Our In-House Cluster¶

Setting up a new computer¶

Hardware and OS¶

First buy or build the machine. If you’re building a node for the cabinet (rack), it’s helpful to first connect the power supply, motherboard, and hard drive to each other, then everything else. (The geometry of the case makes this easier.)

Then install linux. Currently, we are running Ubuntu Server 20.04 LTS; there are several thumb drives in the cabinet with the ISO. You can simply follow all the steps and install. Pick a name for the computer. Our current theme is something NYC related. Do not pick something too long or idiotic. (Note: If you want to change the computer name, edit both /etc/hosts and /etc/hostname for your new name.) By default, the Ubuntu installer will not occupy the entire drive, but typically we will want to; and this is easily changed after installation by doing the following:

sudo lvresize -l +100%FREE /dev/mapper/ubuntu--vg-ubuntu--lv
sudo resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv

If you’re a new user, pick a username. It’s better to have one 8 characters or shorter. If you’re an old user, use your usual username.

Now we need an IP address. We used to get one from Columbia for each machine, but now we just configure pennstation, which is our DHCP server. This is done as follows:

IP address: On grandcentral (room 1144-1145) via pennstation:¶

First, find your hostname, MAC address, and IP:

cat /etc/hostname
ifconfig

Log onto pennstation (ask someone for the password) and run these commands, where YOURNAME is the hostname, IP is the end of the requested IP address (remember to check for conflicts!), YOURMAC is your MAC address:

ssh admin@192.168.0.2
admin@192.168.0.2's password:
pennstation#conf
pennstation(conf)#ip dhcp server
pennstation(config-dhcp)#pool YOURNAME
pennstation(config-dhcp-YOURNAME)#host 192.168.0.IP
pennstation(config-dhcp-YOURNAME)#hardware-address YOURMAC
pennstation(config-dhcp-YOURNAME)#default-router 192.168.0.1
pennstation(config-dhcp-YOURNAME)#dns-server 128.59.1.3
pennstation(config-dhcp-YOURNAME)#show conf
!
 pool YOURNAME
  host 192.168.0.IP
  hardware-address YOURMAC
  default-router 192.168.0.1
  dns-server 128.59.1.3
pennstation(config-dhcp-YOURNAME)#exit
pennstation(config-dhcp)#exit
pennstation(conf)#exit
pennstation#copy run start
File with same name already exist.
Proceed to copy the file [confirm yes/no]: yes
pennstation#exit

DHCP reservation on router (deprecated)¶

Log onto the router at 192.168.0.1. If you don’t know the username and password, ask someone in the group.
Go to LAN, DHCP client, and find your machine’s name, and write down the MAC address and IP address. (This is the internal IP address, which should begin with 192.168.0.xxx)
Go to LAN, DHCP Reservation, and add an entry for your machine. Make sure you put in the description as your computer’s name, and your internal IP address.
Go to Advanced, Virtual Server, and create an entry: The name is the comnputer’s name, Interface WAN1, the internal port is 22, the Internal Server IP is your internal IP address, the external port is (3000 + the last part of your IP address). For example, tribeca has the internal IP address 192.168.0.22, so its port is 3022. If there are too many entries, then you can remove one of the compute nodes (but not a personal machine).
Let everyone in the group know the IP and port number, so they can put it in their .ssh/config files. Someone can add it to the agenda of group meeting or directly onto the group website. You can get your own .ssh/config file from the generate_machs.py script; see below for instructions on getting this.

static ip (deprecated)¶

(in case you need a static ip address, here are the old instructions: first, get the mac address of the machine:

ifconfig |grep hwaddr|awk '{print $nf}'

now you need to go to the columbia cuit webpage and request a new registration. you need to give the mac address and the room number and the ethernet jack number. the requested ip address should be of the form name.apam.columbia.edu, where “name” is the name of the computer.

end of paranthetical comment.)

installing libraries & users¶

We now use Ansible to create new users, install the Intel compiler suite and other libraries, and install commonly used software (e.g. vasp).

There is a git repo with the current setup here:

grandcentral:/home/cam1/git_repos/cluster_setup.git

New Users¶

if you are a new user, you will want an ssh key. this will let you log on without having to enter your password every time. on the machine you are going to log on from, go to ~/.ssh, and type:

ssh-keygen -t rsa

then press enter–you do not want a password. any computer that has the private key (id_rsa) in ~/.ssh will be able to log into any computer that has the public key in ~/.ssh/authorized_keys. (we will create the authorized_keys files soon.) so never give out the private key, or anybody can log onto your computer.

your id_rsa.pub should be sent to someone to be included in the software repository, so it can be installed on all machines.

you should get an ssh config file and put it in .ssh/config. a config file can be generated with the generate_machs.py script, which is part of our software package. you get it and run it thus:

git clone ssh://username@grandcentral.apam.columbia.edu:2000/home/cam1/git_repos/software.git
cd software/utils
./generate_machs.py

and follow instructions. full documentation is available with the other group documentation.

The wireless network in 1144 is called “rockefeller”; for the password, ask someone in the group.

Managing and cleaning up space¶

Files should never be stored on the compute nodes, as there is not sufficient room and this is bad practice. Any meaningful results should be transfered back to your own machine. The current usage of all hard drives on our compute nodes is measured hourly and can be found here.

Vasp runs can take a lot of space, so we now have a little script (in cluster.git; see above for how to get it) that cleans old directories. it’s called cleannode.py. it goes through each boomerang-run directory, and if it’s older than a constant (default 14 days), it removes some larger files and puts the rest in a tgz. by default it deletes procar, wavecar, doscar, chgcar, and vasprun.xml, but this can be modified in the script. it prints out a list of directories that it wants to clean up and gets confirmation.

An example use would be as follows. (the file config_cluster_all contains a line setting the shell variable machlist to include the names of all machines; the file that is included within the cluster.git repository will be fine for this.)

cd <directory>/cluster/
source config_cluster_all
for mach in ${machlist[@]}; do echo $mach; scp cleannode.py $mach:bin/; ssh -t $mach bin/cleannode.py; done

Notes on our computers¶

grandcentral is the name of both the ip address (grandcentral.apam.columbia.edu) and a particular machine that we use for storage / website hosting and not computing.
batterypark is a machine in 1144 that is very old and useful only for testing and guests–don’t try running anything unless you want to wait a long time.
some machines (broadway, uws, fidi, soho, columbuso, han, carlosj, hudson, morningside, queens) are owned by particular people, so please don’t clog them with jobs.
It’s much more efficient to run 2 vasp jobs on a computer at a time. Try not to run two high-memory jobs on the same computer. See boomerang for details.
Update: Now the newest generate_machs.py file will allow you to access every machine from inside and outside the network, as long as grandcentral is up. [OLD: Due to router limitations, only some machines can be accessed from outside the network. These are the ones that have ports in the table of machines. If you want access to the others, log onto one of those (preferably not someone’s personal computer) and from there onto any of the computers.]
Most machines have 1 TB drives. With 15 users per machine, this comes out to an average 60 GB per person. Obviously on other people’s PCs you should keep it even lower than that.
If you have an account, you can get sudo access with the password for the root user. Ask someone for the password. Be advised that ssh root@machine is disabled. This is only true on shared machines, not personal machines.
Columbia sometimes throttles us for too much usage, especially when uploading from our cluster to a remote machine. Check this at http://www.columbia.edu/cgi-bin/acis/networks/quota/netquota.pl. We have two WAN ports on the router, which have 2 IP addresses, distributing the load somewhat. If we’re still throttled, try logging into the router (192.168.0.1), then go to WAN -> WAN 2 -> Release (and maybe then Obtain). If we’re lucky, that will give a new IP.

DFT codes¶

VASP, QE, and abinit are precompiled and available on each node:

/opt/apps/vasp/5.4.1/intel/bin/vasp_{std,gam,ncl}
/opt/apps/vasp/5.4.4/intel/bin/vasp_{std,gam,ncl}
/opt/apps/abinit/9.4.2/bin/abinit (etc)
/opt/apps/qe/6.7/bin/pw.x (etc)

In order to load the modules and run the executable, do the following:

module load vasp/5.4.4
mpirun -n 8 $(which vasp_std) > vasp.out

module load abinit/9.4.2
mpirun -n 8 $(which abinit) < input > output

module load qe/6.7
mpirun -n 8 $(which pw.x) < input > output

Vasp 5.2 has a memory problem running large supercells; if you will be doing these, add the following to your .bashrc, or better yet, to your boomerang script

ulimit -s unlimited

To get the PAWGGA POTCAR for VASP, run the following commands; it’s easy to extend to other pseudopotentials we have. In this example, we grab the psuedopotentials from grandcentral and make a POTCAR for LaNiO3, using the oxygen soft pseudopotential.

mkdir pawgga # this is going to be where we keep it
cd pawgga
scp grandcentral:~cam1/vasp/potcar_pawgga.tar . # copy it from grandcentral; other pseudoptls are here too
tar xvf potcar_pawgga.tar # untar it into current directory
gunzip */POTCAR.Z # unzip the POTCAR files **
cat La/POTCAR Ni/POTCAR O_s/POTCAR > POTCAR_LaNiO_s

Installing Intel Compiler¶

The Intel compiler suite is now free for everyone, and can easily be installed in Ubuntu using apt. Here we outline the basic installation steps as documented on the intel site.

Begin by getting the public key and adding the intel repository.

wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB -O - | sudo apt-key add –
sudo add-apt-repository "deb https://apt.repos.intel.com/oneapi all main"

The above is actually depracated and not recommended due to security concerns (though everyone still seems to do it), and the proper way is documented here.

Now install the packages.

sudo apt install intel-basekit
sudo apt install intel-hpckit

The following script will set up the environment, and you may want to add this to the .bashrc.

source /opt/intel/oneapi/setvars.sh

Please note that this script will set your default python to the optimized Intel python, and you may not want this.

Troubleshooting nodes¶

Once in a while (one node every month or so?) a hard-drive error makes a particular drive mounted as read-only, for some reason. This usually generates messages such as:

mkdir: cannot create directory `sandbox': Read-only file system

The easiest fix is to restart the computer. Usually the system needs to fix the disk on restart. Instructions how to do this:

Find the physical machine and attach a monitor and keyboard, and preferably a power cable. There are workstations in 1144 (the green table) and 1145 (one of the desks) where you can find an extra monitor and keyboard.

Boot up the machine and follow instructions to check the disk. If it doesn’t check the disk automatically, run fsck (see Internet for instructions).

Restart the machine, and ensure it works, writes to disk, and connects to other machines.

If it gives another error, try running badblocks to identify bad blocks in the HD.

If all else fails, replace the hard drive.

Replace the node whence it came, plug in the power cable and ethernet, and start it up.

Taking down a node¶

If you want to troubleshoot a node and take it down so nobody else will access it, or if you want to do some dedicated testing and don’t want anyone running on your node, you should disable their ssh access. Here’s how:

su # superuser access
cp /etc/ssh/sshd_config{,.`date +%s`} # copy config file as a backup
echo "AllowUsers cam1 mckornbluth" >> /etc/ssh/sshd_config # allow only cam1 and mckornbluth
/usr/sbin/sshd –t;echo $? # check that it works without error
service ssh restart # restart sshd

When you’re done, remember to let everyone back on:

su # superuser access
cp /etc/ssh/sshd_config{,.`date +%s`} # copy config file as a backup
head -n -1 /etc/ssh/sshd_config > temp # piping directly will give an empty file
mv temp /etc/ssh/sshd_config
/usr/sbin/sshd –t;echo $? # check that it works without error
service ssh restart # restart sshd

See here for details: http://knowledgelayer.softlayer.com/learning/how-do-i-permit-specific-users-ssh-access

List of machines in our group¶

Dual Socket Compute Nodes¶

Machine	MAC	IP address	Port	Proc	Mem	20.04
bqe	c8:1f:66:e8:83:5c	192.168.0.215	None	E5-2670v2	64G	Y
bruckner	c8:1f:66:e8:87:b7	192.168.0.200	None	E5-2670v2	64G	Y
crossbronx	c8:1f:66:f3:ee:a4	192.168.0.217	None	E5-2690v2	96G	Y
deegan	c8:1f:66:e8:2d:f3	192.168.0.214	None	E5-2670v2	64G	Y
fdr	c8:1f:66:ea:59:f4	192.168.0.211	None	E5-2670v2	64G	Y
goethals	c8:1f:66:f4:29:db	192.168.0.204	None	E5-2670v2	64G	Y
gowanus	c8:1f:66:ea:38:c8	192.168.0.213	None	E5-2670v2	64G	Y
gwb	c8:1f:66:e5:c9:e5	192.168.0.203	None	E5-2670v2	64G	Y
holland	c8:1f:66:ea:6a:d4	192.168.0.209	None	E5-2670v2	64G	Y
lincoln	c8:1f:66:e7:26:99	192.168.0.210	None	E5-2670v2	64G	Y
outerbridge	c8:1f:66:ea:5b:f5	192.168.0.205	None	E5-2670v2	64G	Y
pelham	c8:1f:66:ea:5e:55	192.168.0.212	None	E5-2670v2	64G	Y
pulaski	c8:1f:66:ea:4d:29	192.168.0.201	None	E5-2670v2	64G	Y
rfk	c8:1f:66:ea:79:f3	192.168.0.206	None	E5-2670v2	64G	Y
throgsneck	c8:1f:66:f3:93:8e	192.168.0.207	None	E5-2670v2	64G	Y
vanwyck	c8:1f:66:e8:3c:6f	192.168.0.216	None	E5-2670v2	64G	Y
verrazano	c8:1f:66:ea:68:29	192.168.0.208	None	E5-2670v2	64G	Y
whitestone	c8:1f:66:ea:68:b2	192.168.0.202	None	E5-2670v2	64G	Y

Single Socket Compute Nodes¶

Machine	MAC	IP address	Port	Proc	Mem	20.04
bowery	bc:5f:f4:8e:ef:9d	192.168.0.5	3005	i7-3770K	32G	N
broadway	bc:5f:f4:3a:74:8f	192.168.0.3	3003	i7-3770	32G	N
bronx	bc:5f:f4:44:0e:77	192.168.0.16	None	i7-3770K	32G	N
brooklyn	bc:5f:f4:9e:69:49	192.168.0.55	None	i7-3770K	32G	N
bryantpark	bc:5f:f4:9e:6a:ea	192.168.0.54	None	i7-3770K	32G	N
centralpark	bc:5f:f4:9e:67:62	192.168.0.43	None	i7-3770K	32G	Y
chinatown	bc:5f:f4:3a:75:9b	192.168.0.15	3015	i7-3770	32G	N
cityhall	bc:5f:f4:a1:10:b5	192.168.0.51	None	i7-3770K	32G	Y
civiccenter	bc:5f:f4:8e:ef:79	192.168.0.6	3006	i7-3770K	32G	N
diamonddistrict	bc:5f:f4:9e:67:66	192.168.0.65	None	i7-3770K	32G	Y
downtown	bc:5f:f4:47:bc:f2	192.168.0.12	3012	i7-3770K	32G	N
dumbo	bc:5f:f4:3a:73:dd	192.168.0.17	3017	i7-3770	32G	N
ellisisland	bc:5f:f4:9e:67:30	192.168.0.45	None	i7-3770K	32G	Y
flatbush	bc:5f:f4:9e:68:35	192.168.0.56	None	i7-3770K	32G	N
flatiron	bc:5f:f4:75:89:6d	192.168.0.8	3008	i7-3770K	32G	N
greatlawn	bc:5f:f4:9e:6b:04	192.168.0.46	None	i7-3770K	32G	Y
guggenheim	bc:5f:f4:9e:68:02	192.168.0.48	None	i7-3770K	32G	Y
harlem	bc:5f:f4:47:bd:ac	192.168.0.11	3011	i7-3770K	32G	N
hellskitchen	bc:5f:f4:8e:ef:90	192.168.0.13	3013	i7-3770K	32G	N
heraldsquare	bc:5f:f4:9e:67:e7	192.168.0.37	None	i7-3770K	32G	Y
highline	bc:5f:f4:9e:65:b1	192.168.0.63	None	i7-3770K	32G	Y
intrepid	bc:5f:f4:9e:67:eb	192.168.0.49	None	i7-3770K	32G	N
inwood	bc:5f:f4:9e:68:01	192.168.0.38	None	i7-3770K	32G	Y
lenoxhill	bc:5f:f4:9e:68:00	192.168.0.40	None	i7-3770K	32G	Y
les	bc:5f:f4:44:0d:7f	192.168.0.9	3009	i7-3770	32G	Y
lincolnsquare	bc:5f:f4:9e:68:0c	192.168.0.41	None	i7-3770K	32G	Y
madisonsquare	bc:5f:f4:9e:65:87	192.168.0.35	None	i7-3770K	32G	Y
manhattanville	bc:5f:f4:9e:65:91	192.168.0.39	None	i7-3770K	32G	Y
marinepark	bc:5f:f4:9e:67:e9	192.168.0.59	None	i7-3770K	32G	Y
met	bc:5f:f4:a1:10:b7	192.168.0.52	None	i7-3770K	32G	N
midtown	bc:5f:f4:47:bd:13	192.168.0.10	3010	i7-3770K	32G	N
murrayhill	bc:5f:f4:9e:68:04	192.168.0.44	None	i7-3770K	32G	Y
museummile	bc:5f:f4:9e:68:31	192.168.0.47	None	i7-3770K	32G	Y
naturalhistory	bc:5f:f4:9e:68:03	192.168.0.62	None	i7-3770K	32G	Y
noho	bc:5f:f4:3a:76:1a	192.168.0.18	3018	i7-3770	32G	N
nolita	bc:5f:f4:72:c0:e8	192.168.0.14	3014	i7-3770K	32G	N
nypd	bc:5f:f4:9e:67:d5	192.168.0.57	None	i7-3770K	32G	Y
nypl	bc:5f:f4:9e:68:06	192.168.0.50	None	i7-3770K	32G	Y
pier76	bc:5f:f4:9e:65:8f	192.168.0.53	None	i7-3770K	32G	Y
prospectpark	bc:5f:f4:9e:67:f9	192.168.0.64	None	i7-3770K	32G	Y
soho	bc:5f:f4:47:bd:fc	192.168.0.27	3027	i7-2600	32G	N
statenisland	bc:5f:f4:9e:6b:0a	192.168.0.61	None	i7-3770K	32G	Y
timessquare	bc:5f:f4:9e:68:1c	192.168.0.36	None	i7-3770K	32G	N
ues	bc:5f:f4:75:89:c7	192.168.0.7	3007	i7-3770K	32G	N
unionsquare	bc:5f:f4:8e:ef:6a	192.168.0.4	3004	i7-3770K	32G	Y
williamsburg	bc:5f:f4:9e:67:dd	192.168.0.58	None	i7-3770K	32G	N
gramercy (6c)	8c:89:a5:80:a8:8a	192.168.0.21	3021	i7-3930K	64G	N

Administrative/Personal Nodes¶

Machine	MAC	IP address	Status	Port	Proc	Mem	20.04
tribeca	8c:89:a5:80:ab:42	192.168.0.22	slurm	3022	i7-3930K	64G	Y
grandcentral	a4:ba:db:03:17:d9	192.168.0.100	Git/apache	2000	i7-870	16G	N
parkingspace	d0:50:99:68:12:7e	192.168.0.102	Raid	None	G3258	4G	N
pennstation	00:01:e8:d7:ce:3f	192.168.0.2	Router	None	N/A	N/A	N
powerstrip	00:06:67:24:fa:e2	192.168.0.110	N/A	None	Power	N/A	N
morningside	00:1e:c9:4f:c4:31	192.168.0.101	Chris	3101	i7-	8G	N
fidi	bc:5f:f4:38:1a:1c	192.168.0.28	Chanul	3028	i7-3770	32G	N
westend	bc:5f:f4:9e:6b:16	192.168.0.24	Enda	3024	i7-3770K	32G	N
columbuso	bc:5f:f4:b9:3c:07	192.168.0.26	Sasaank	3026	i7-4770	32G	N
queens	bc:5f:f4:9e:44:a7	192.168.0.25	Lyuwen	3025	i7-3770K	32G	N
batterypark	00:0f:1f:db:04:f9	192.168.0.32	?	3032	i7-	?G	N
carlosj	00:22:19:23:c1:59	192.168.0.31	?	3031	i7-	8G	N
chelsea	00:24:e8:33:79:b1	192.168.0.29	?	3029	i7-	8G	N
Han	00:24:e8:38:3b:40	192.168.0.30	?	3030	i7-	8G	N
shan	fc:aa:14:2f:38:7a	192.168.0.66	Seungbin	3050	i7-5820K	32G	N

Single sockets that are currently broken¶

noho - where is it?

flatiron - where is it?

gramercy - where is it?

ues - where is it?

hellskitchen - where is it?

intrepid - where is it?

brooklyn - won’t power on.

chinatown - won’t power on.

met - won’t power on.

downtown - won’t power on.

dumbo - won’t power on.

harlem - won’t power on.

bowery

nolita

civiccenter

bronx

bryantpark

williamsburg

midtown is not reachable.

timessquare is not reachable.

Single sockets to update to 20.04 next¶

Only upgrades needed are those that are listed as broken above.

Dual sockets that are broken¶

crossbronx - has a new hard drive, but still some serious problems. Appears to overheat.

goethals, gowanus, and pulaski have bad hard drives.

fdr and bruckner will not power on.

Config File¶

Here is a sample config file with all the single and dual socket machines (config). This file should be named config and placed in your ~/.ssh/ directory. Replace XXX with your user name.

Potential future purchases¶

We have not updated our cluster in many years, and technology keeps pushing forward. Memory bandwidth is one of the key metrics for processors in terms of performance bottlenecks for plane wave DFT calculations. Our single sockets (i7-3770) have 25.6 GB/s and our dual sockets (E5-2670v2) have 59.7 GB/s. However, there are already relatively old single socket processors of the Cascade Lake family that are running at 94 GB/s. The ten core version is going for about $600 while the 18 core model is going for about $1000. We do not currently have funds to buy nodes, though perhaps we can scrap enough together to buy one to test out the performance.

Instructions: python, git, sphinx, and apache¶

Python Virtual Environments¶

Python virtual environments allow you to create an isolated installation of python and any of its packages. This can be useful when using pythons programs that have many dependencies, like Jupyter notebooks, running older software, or developing new software. We provide a quick summary of the discussion in here. The basic gist is to install and use virtualenv. First install with pip:

pip install virtualenv

Now you can go to your project folder and create a virtual environment named env, which is a standard naming convention:

cd project_folder
virtualenv venv

A directory named venv was created, housing all of the relevant information. You may want to explicitly set the version of python you will use:

virtualenv -p /usr/bin/python2.7 venv

To actually use the virtual environment, fire it up by sourcing the relevant bash file:

source venv/bin/activate
(venv) user@computer~/path/project_folder$

You can see that your bash prompt has been modified to alert you that you are in the virtualenv named venv. Now, use pip to install any packages you would like, and you are read to go. You can stop the virtualenv by:

deactivate

Intro to git¶

Useful links
1. For more detailed information regarding the information below, please see here
2. Git cheat-sheet
3. A very nice tutorial here
About version control: Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.
First time git setup: The first thing you should do when you install Git is to set your user name and e-mail address. This is important because every Git commit uses this information, and it’s immutably baked into the commits you pass around:

git config –global user.name “John Doe” git config –global user.email johndoe@example.com

You can get a Git project using two main approaches. The first takes an existing project or directory and imports it into Git. The second clones an existing Git repository from another server. For the first approach, let us assume that the group meeting docs will be under version control. First make the directory:

mkdir group_meetings
cd group_meetings

Now we set up git to make a local repository:

git init
git add [filename]
git commit -m "My initial commit message"

This creates a new subdirectory named .git that contains all of your necessary repository files. It adds the file [filename] and commits it to the repository. The commit -m command adds your message to the log file that tracks changes you made in this version.

You now have a working local repository and you can locally do version control. However, you will probably want to have a remote repository in order to develop with friends. Go to the remote host that will host the git repository:

ssh -p 2000 cam1@grandcentral.apam.columbia.edu
mkdir group_meetings.git
cd group_meetings.git
git init --bare

Now if you want to make the remote host the master to your local copy, go back to your local repository::

git remote add origin ssh://cam1@grandcentral.apam.columbia.edu:2000/home/cam1/group_meetings.git
git push -u origin master

Now perhaps another user will check out their own copy, or maybe I will check out a copy on my home computer or my laptop:

git clone ssh://cam1@grandcentral.apam.columbia.edu:2000/home/cam1/git_repos/group_meetings.git

Modifying the internal group website: Git and Sphinx¶

If you would like to make a change to this webpage, or any of the others on grandcentral, first pull a copy of the webpage for yourself:

git clone ssh://grandcentral/home/cam1/git_repos/documentation.git

This assumes that grandcentral is defined in your .ssh/config file. Now cd into the “documentation” directory.

If you already have a copy setup, always remember to “pull” so that you are working from the most up-to-date version:

git pull

Once you finish modifying the files, you add them:

git add filename

and commit them:

git commit -m 'commit message'

Then you must execute the build.sh script to generate the html files followed by up.sh

./build.sh
./up.sh

Finally, you push up to the master branch to complete the changes:

git push

Everyone should have access to change things an push up to the repository. If you run into any problems, make sure you are added to the gituser and wwwpub groups.

sudo adduser dkk2122 gituser
sudo adduser dkk2122 wwwpub

If you still have trouble, make sure that the git repository is owned by the gituser group:

sudo chgrp -R gituser myrepos.git

and that the myrepos.git/config file has group-writeable permissions, i.e. it includes the following line:

sharedRepository = group

Intro to sphinx¶

All of the current documentation is generated using sphinx:

http://sphinx-doc.org/

Modifying the internal group website: Apache¶

The actual website is run on grandcentral and the data is at grandcentral:/var/www. The directories there are:

conferences - conference presentations. This is currently behind a password.

documentation - This website. This is NOT behind a password.

groupmeetings - presentations and agenda from group meetings. This is currently behind a password.

lockbox - miscellaneous files that ARE behind a password.

(redacted) - miscellaneous files. This is NOT behind a password. Useful to share work with yourself for later.

research - notes from people’s research. This is currently behind a password.

tutorials - this is NOT behind a password.

For the username and password, speak to someone in the group. (It is not the same as litshare.)

To set security for various directories, you edit the file and restart the server:

$ sudo vi /etc/apache2/httpd.conf
$ sudo /etc/init.d/apache2 restart

Environment Modules¶

Environment modules system dynamically manages environment variables in shell to provide users access to software and libraries on demand. This also allows users to swap between different versions of softwares or different implementations of the software without the worry of any potential conflict. (It also avoids the tedious and redundant loading procedure from the setvars.sh script from the latest intel OneAPI kit.)

Here we are using the Lmod, a Lua based environment module system, for our in-house clusters. The software is compiled and store in the module base repository accessible from our local network via the git darmon introduced here

git://192.168.0.100/base.git

The repository houses the lmod and lua scripts and libraries as well as some modulefiles in the modulefiles directory that are used to load certain software packages.

To list the available software managed by the system, simply run the command

module avail

Among the list the loaded software and the default software version that will be loaded are marked.

It is also relatively easy to create a script for the module system to manage certain software, one can simply follow the template below:

help([[
    <A multi-line text as the help message of the module>
  ]])

local <local variables> = ...

load("<The modules the current module depends on.>")

prepend_path("<environment variable>", "<path to prepend>")
setenv("<environment variable>", "<value of the environment variable>")

whatis("<what is information of the module>")

More information on writing the modulefile can be find the documentation of Lmod.

In-house debian repository¶

We have set up our in-house debian repository to distribute software such as DFT packages, within our in-house cluster. It’s a very easy and light-weight solution that satisfies our needs of managing the software versions and their dependencies.

The debian source is located at

"deb [trusted=yes] http://192.168.0.100/debian/ ./"

Which is setup simply exposing the directory with apahce2 directory listing. Once this source is added into the apt source list, one can install the software packages provided by the repo through apt install.

In order to publish a compiled software package, one just needs to create a debian binary package. For example, one can create the following directory structure to make a binary package for vasp 5.4.4 compiled with intel:

./vasp-5.4.4-intel
  |-- DEBIAN
  |   `-- control
  |-- opt/apps/vasp/5.4.4/intel/bin
      |-- vasp_std
      |-- vasp_ncl
      `-- vasp_gam

where the vasp_std, vasp_ncl and vasp_gam are the 3 variant compilations of the vasp code, and control is the debian control file:

Package: vasp-5.4.4-intel
Version: 1.0
Section: custom
Priority: optional
Architecture: all
Essential: no
Installed-Size: 1024
Maintainer: Lyuwen Fu
Description: VASP 5.4.4 intel with Wannier90 v2.1
Depends: intel-hpckit, intel-basekit

where Package indicates the package name referred to during apt install, Version serves as the incrimental version number needed when one wishes to replace an old compilation for various reasons, and Depends lists the the packages this package depends on and will be installed along side the package.

Then, the command dpkg-deb --build vasp-5.4.4-intel can be used to create the vasp-5.4.4-intel.deb file. Usually, the file is rename to reflect the package version number. The package needs to be uploaded to the debian repo directory /var/www/html/debian on grandcentral, and the package listing there needs to be updated with the command dpkg-scanpackages . | gzip -c9 > Packages.gz.

More details can be find the debian documentation.

Git Protocal - Git Daemon¶

Git daemon is a very simple wat to setup the “Git” protocol to allow fast, unauthenticated read-only access to git repositories. In this case, we want the read access of the environment module base repository to install and update lmod and the modulefiles without authentication. Details of how to setup this service can be found in the git documentation.

Other Computing Resources¶

NERSC resources¶

Please follow the instructions at http://www.nersc.gov/users/accounts/user-accounts/get-a-nersc-account/ to request a new account. E-mail Pierre or Eric to get the allocation info needed.

NERSC keeps track of users who are authorized to use VASP, so you may need to have Chris send another e-mail to the VASP licensing people (vasp.materialphysik@univie.ac.at) with NERSC CC’d (vasp_licensing@nersc.gov) to add you to the list. .. Right now just Pierre, Mordechai, Chanul, and Eric are on the list.

The NERSC systems are extremely well documented so you can probably answer most of your questions from their web site (http://www.nersc.gov/users/).

NERSC now uses slurm scheduling system, and a VASP job script basically look like this:

#!/bin/bash -l
#SBATCH --job-name=<job_name>
#SBATCH -N 1
#SBATCH -A m1902
#SBATCH --time=60:00
#SBATCH -p regular
#SBATCH -C haswell

module load vasp/5.4.1-hsw
exe=$(which vasp_std)
NC=32
NP=4
sed -i '/NPAR.*/d' INCAR
echo "NPAR=$NP" >> INCAR
srun -n $NC $exe >>vasp.out 2>>vasp.err

Make sure you include the parallelization flags in your INCAR file. For example, VASP recommends that NPAR be set to approximately the square root of the number of CPUs being used.

If you’re sharing with other group members, you can use the project directory (e.g. /project/projectdirs/m1902/). If you want to change the directory “example” so that any new files have default read/write permissions for the entire group, you should run this command:

setfacl -R -d -m mask:007 /project/projectdirs/m1902/example/

It should already be set that new files belong to the project group; if this is not true, you do that by running:

chmod -R g+s /project/projectdirs/m1902/example/

And if the directory already has files, you need to run this:

chgrp -R m1902 /project/projectdirs/m1902/example/
chmod -R g+rw  /project/projectdirs/m1902/example/

Check the number of hours remaining in the group’s allocation from the command line by executing:

getnim -R m1902

Information on HPC Cluster at Columbia¶

Below are the information about our HPC clusters, and the documentations are linked to the names:

Cluster name	Nodes Purchased	Cluster status	Group storage	Login node
Yeti	10	Retired	/vega/qmech	yetisubmit.cc.columbia.edu
Habanero	2	Online	/rigel/qmech	habanero.rcs.columbia.edu
Terremoto	1	Online	/moto/qmech	terremoto.rcs.columbia.edu
Ginsburg	1	Online	/moto/qmech	terremoto.rcs.columbia.edu

You can log in using your UNI (similar to cunix), and your SSOL password. If you do not have access to the cluster, please contact Chris to give you access.

All online clusters (Habanero and Terremoto) are using slurm system, the batch script has similar form to the one for NERSC.

VASP on HPC Clusters¶

VASP 5.4.1 and VASP.5.4.4 are compiled using the method mentioned in here on both Habanero and Terremoto.

On Habanero, VASP 5.4.1 are located under directory:

/rigel/qmech/projects/built_by_lyuwen/vasp.5.4.1/bin

and VASP 5.4.4:

/rigel/qmech/projects/built_by_lyuwen/vasp.5.4.4.1/bin

Note that this is the patched version of VASP 5.4.4, the original version is under the vasp.5.4.4 directory.

In order to execute them, you need to load the intel parallel studio module by

module load intel-parallel-studio

Old stuff on Yeti¶

The vasp 5.2 executable, sample PBS job script (run.vasp), and a script to check the cluster usage (freenodes.pl) are in our shared directory /vega/qmech/projects

There is a hard limit to the number of processes per node, which can cause VASP to crash for larger unit cells. Enabling “ulimit -s unlimited” to increase the stack size fixes this, but the cluster administration wanted to be cautious so right now the stack size is set to 15360 for InfiniBand nodes.

Update May 2015: “The user stack limit has been removed from execute nodes.” (from the Yeti website)

The NPAR settings can also cause a crash. Some simple testing shows that for a one-node run, NCORE=4 (for 16 cores) is a good value.

Hyperthreading: Mordechai inquired in December 2015, and was told hyperthreading is enabled on the cluster, but the schedule is not aware of it. So if you’re running on a single node via OpenMP or round-robin tasks, it will use the hyperthreading, but I’m not sure if it works on multiple nodes.

Efficiency: I did some testing with running 16 separate small VASP jobs on 1 node vs. 32 separate small jobs (which uses the hyperthreading). The latter was about 4% faster than doing the first twice. So it doesn’t help so much in that configuration.