Our In-House Cluster

Setting up a new computer

Hardware and OS

First buy or build the machine. If you’re building a node for the cabinet (rack), it’s helpful to first connect the power supply, motherboard, and hard drive to each other, then everything else. (The geometry of the case makes this easier.)

Then install linux. Currently, we are running Ubuntu Server 20.04 LTS; there are several thumb drives in the cabinet with the ISO. You can simply follow all the steps and install. Pick a name for the computer. Our current theme is something NYC related. Do not pick something too long or idiotic. (Note: If you want to change the computer name, edit both /etc/hosts and /etc/hostname for your new name.) By default, the Ubuntu installer will not occupy the entire drive, but typically we will want to; and this is easily changed after installation by doing the following:

sudo lvresize -l +100%FREE /dev/mapper/ubuntu--vg-ubuntu--lv
sudo resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv

If you’re a new user, pick a username. It’s better to have one 8 characters or shorter. If you’re an old user, use your usual username.

Now we need an IP address. We used to get one from Columbia for each machine, but now we just configure pennstation, which is our DHCP server. This is done as follows:

IP address: On grandcentral (room 1144-1145) via pennstation:

First, find your hostname, MAC address, and IP:

cat /etc/hostname
ifconfig

Log onto pennstation (ask someone for the password) and run these commands, where YOURNAME is the hostname, IP is the end of the requested IP address (remember to check for conflicts!), YOURMAC is your MAC address:

ssh admin@192.168.0.2
admin@192.168.0.2's password:
pennstation#conf
pennstation(conf)#ip dhcp server
pennstation(config-dhcp)#pool YOURNAME
pennstation(config-dhcp-YOURNAME)#host 192.168.0.IP
pennstation(config-dhcp-YOURNAME)#hardware-address YOURMAC
pennstation(config-dhcp-YOURNAME)#default-router 192.168.0.1
pennstation(config-dhcp-YOURNAME)#dns-server 128.59.1.3
pennstation(config-dhcp-YOURNAME)#show conf
!
 pool YOURNAME
  host 192.168.0.IP
  hardware-address YOURMAC
  default-router 192.168.0.1
  dns-server 128.59.1.3
pennstation(config-dhcp-YOURNAME)#exit
pennstation(config-dhcp)#exit
pennstation(conf)#exit
pennstation#copy run start
File with same name already exist.
Proceed to copy the file [confirm yes/no]: yes
pennstation#exit

DHCP reservation on router (deprecated)

  1. Log onto the router at 192.168.0.1. If you don’t know the username and password, ask someone in the group.

  2. Go to LAN, DHCP client, and find your machine’s name, and write down the MAC address and IP address. (This is the internal IP address, which should begin with 192.168.0.xxx)

  3. Go to LAN, DHCP Reservation, and add an entry for your machine. Make sure you put in the description as your computer’s name, and your internal IP address.

  4. Go to Advanced, Virtual Server, and create an entry: The name is the comnputer’s name, Interface WAN1, the internal port is 22, the Internal Server IP is your internal IP address, the external port is (3000 + the last part of your IP address). For example, tribeca has the internal IP address 192.168.0.22, so its port is 3022. If there are too many entries, then you can remove one of the compute nodes (but not a personal machine).

  5. Let everyone in the group know the IP and port number, so they can put it in their .ssh/config files. Someone can add it to the agenda of group meeting or directly onto the group website. You can get your own .ssh/config file from the generate_machs.py script; see below for instructions on getting this.

static ip (deprecated)

(in case you need a static ip address, here are the old instructions: first, get the mac address of the machine:

ifconfig |grep hwaddr|awk '{print $nf}'

now you need to go to the columbia cuit webpage and request a new registration. you need to give the mac address and the room number and the ethernet jack number. the requested ip address should be of the form name.apam.columbia.edu, where “name” is the name of the computer.

end of paranthetical comment.)

installing libraries & users

We now use Ansible to create new users, install the Intel compiler suite and other libraries, and install commonly used software (e.g. vasp).

There is a git repo with the current setup here:

grandcentral:/home/cam1/git_repos/cluster_setup.git

New Users

if you are a new user, you will want an ssh key. this will let you log on without having to enter your password every time. on the machine you are going to log on from, go to ~/.ssh, and type:

ssh-keygen -t rsa

then press enter–you do not want a password. any computer that has the private key (id_rsa) in ~/.ssh will be able to log into any computer that has the public key in ~/.ssh/authorized_keys. (we will create the authorized_keys files soon.) so never give out the private key, or anybody can log onto your computer.

your id_rsa.pub should be sent to someone to be included in the software repository, so it can be installed on all machines.

you should get an ssh config file and put it in .ssh/config. a config file can be generated with the generate_machs.py script, which is part of our software package. you get it and run it thus:

git clone ssh://username@grandcentral.apam.columbia.edu:2000/home/cam1/git_repos/software.git
cd software/utils
./generate_machs.py

and follow instructions. full documentation is available with the other group documentation.

The wireless network in 1144 is called “rockefeller”; for the password, ask someone in the group.

Managing and cleaning up space

Files should never be stored on the compute nodes, as there is not sufficient room and this is bad practice. Any meaningful results should be transfered back to your own machine. The current usage of all hard drives on our compute nodes is measured hourly and can be found here.

Vasp runs can take a lot of space, so we now have a little script (in cluster.git; see above for how to get it) that cleans old directories. it’s called cleannode.py. it goes through each boomerang-run directory, and if it’s older than a constant (default 14 days), it removes some larger files and puts the rest in a tgz. by default it deletes procar, wavecar, doscar, chgcar, and vasprun.xml, but this can be modified in the script. it prints out a list of directories that it wants to clean up and gets confirmation.

An example use would be as follows. (the file config_cluster_all contains a line setting the shell variable machlist to include the names of all machines; the file that is included within the cluster.git repository will be fine for this.)

cd <directory>/cluster/
source config_cluster_all
for mach in ${machlist[@]}; do echo $mach; scp cleannode.py $mach:bin/; ssh -t $mach bin/cleannode.py; done

Notes on our computers

  • grandcentral is the name of both the ip address (grandcentral.apam.columbia.edu) and a particular machine that we use for storage / website hosting and not computing.

  • batterypark is a machine in 1144 that is very old and useful only for testing and guests–don’t try running anything unless you want to wait a long time.

  • some machines (broadway, uws, fidi, soho, columbuso, han, carlosj, hudson, morningside, queens) are owned by particular people, so please don’t clog them with jobs.

  • It’s much more efficient to run 2 vasp jobs on a computer at a time. Try not to run two high-memory jobs on the same computer. See boomerang for details.

  • Update: Now the newest generate_machs.py file will allow you to access every machine from inside and outside the network, as long as grandcentral is up. [OLD: Due to router limitations, only some machines can be accessed from outside the network. These are the ones that have ports in the table of machines. If you want access to the others, log onto one of those (preferably not someone’s personal computer) and from there onto any of the computers.]

  • Most machines have 1 TB drives. With 15 users per machine, this comes out to an average 60 GB per person. Obviously on other people’s PCs you should keep it even lower than that.

  • If you have an account, you can get sudo access with the password for the root user. Ask someone for the password. Be advised that ssh root@machine is disabled. This is only true on shared machines, not personal machines.

  • Columbia sometimes throttles us for too much usage, especially when uploading from our cluster to a remote machine. Check this at http://www.columbia.edu/cgi-bin/acis/networks/quota/netquota.pl. We have two WAN ports on the router, which have 2 IP addresses, distributing the load somewhat. If we’re still throttled, try logging into the router (192.168.0.1), then go to WAN -> WAN 2 -> Release (and maybe then Obtain). If we’re lucky, that will give a new IP.

DFT codes

  • VASP, QE, and abinit are precompiled and available on each node:

    /opt/apps/vasp/5.4.1/intel/bin/vasp_{std,gam,ncl}
    /opt/apps/vasp/5.4.4/intel/bin/vasp_{std,gam,ncl}
    /opt/apps/abinit/9.4.2/bin/abinit (etc)
    /opt/apps/qe/6.7/bin/pw.x (etc)
    
  • In order to load the modules and run the executable, do the following:

    module load vasp/5.4.4
    mpirun -n 8 $(which vasp_std) > vasp.out
    
    module load abinit/9.4.2
    mpirun -n 8 $(which abinit) < input > output
    
    module load qe/6.7
    mpirun -n 8 $(which pw.x) < input > output
    
  • Vasp 5.2 has a memory problem running large supercells; if you will be doing these, add the following to your .bashrc, or better yet, to your boomerang script

ulimit -s unlimited
  • To get the PAWGGA POTCAR for VASP, run the following commands; it’s easy to extend to other pseudopotentials we have. In this example, we grab the psuedopotentials from grandcentral and make a POTCAR for LaNiO3, using the oxygen soft pseudopotential.

    mkdir pawgga # this is going to be where we keep it
    cd pawgga
    scp grandcentral:~cam1/vasp/potcar_pawgga.tar . # copy it from grandcentral; other pseudoptls are here too
    tar xvf potcar_pawgga.tar # untar it into current directory
    gunzip */POTCAR.Z # unzip the POTCAR files **
    cat La/POTCAR Ni/POTCAR O_s/POTCAR > POTCAR_LaNiO_s
    

Installing Intel Compiler

The Intel compiler suite is now free for everyone, and can easily be installed in Ubuntu using apt. Here we outline the basic installation steps as documented on the intel site.

Begin by getting the public key and adding the intel repository.

wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB -O - | sudo apt-key add –
sudo add-apt-repository "deb https://apt.repos.intel.com/oneapi all main"

The above is actually depracated and not recommended due to security concerns (though everyone still seems to do it), and the proper way is documented here.

Now install the packages.

sudo apt install intel-basekit
sudo apt install intel-hpckit

The following script will set up the environment, and you may want to add this to the .bashrc.

source /opt/intel/oneapi/setvars.sh

Please note that this script will set your default python to the optimized Intel python, and you may not want this.

Troubleshooting nodes

Once in a while (one node every month or so?) a hard-drive error makes a particular drive mounted as read-only, for some reason. This usually generates messages such as:

mkdir: cannot create directory `sandbox': Read-only file system

The easiest fix is to restart the computer. Usually the system needs to fix the disk on restart. Instructions how to do this:

  1. Find the physical machine and attach a monitor and keyboard, and preferably a power cable. There are workstations in 1144 (the green table) and 1145 (one of the desks) where you can find an extra monitor and keyboard.

  2. Boot up the machine and follow instructions to check the disk. If it doesn’t check the disk automatically, run fsck (see Internet for instructions).

  3. Restart the machine, and ensure it works, writes to disk, and connects to other machines.

  4. If it gives another error, try running badblocks to identify bad blocks in the HD.

  5. If all else fails, replace the hard drive.

  6. Replace the node whence it came, plug in the power cable and ethernet, and start it up.

Taking down a node

If you want to troubleshoot a node and take it down so nobody else will access it, or if you want to do some dedicated testing and don’t want anyone running on your node, you should disable their ssh access. Here’s how:

su # superuser access
cp /etc/ssh/sshd_config{,.`date +%s`} # copy config file as a backup
echo "AllowUsers cam1 mckornbluth" >> /etc/ssh/sshd_config # allow only cam1 and mckornbluth
/usr/sbin/sshd –t;echo $? # check that it works without error
service ssh restart # restart sshd

When you’re done, remember to let everyone back on:

su # superuser access
cp /etc/ssh/sshd_config{,.`date +%s`} # copy config file as a backup
head -n -1 /etc/ssh/sshd_config > temp # piping directly will give an empty file
mv temp /etc/ssh/sshd_config
/usr/sbin/sshd –t;echo $? # check that it works without error
service ssh restart # restart sshd

See here for details: http://knowledgelayer.softlayer.com/learning/how-do-i-permit-specific-users-ssh-access

List of machines in our group

Dual Socket Compute Nodes

Machine

MAC

IP address

Port

Proc

Mem

20.04

bqe

c8:1f:66:e8:83:5c

192.168.0.215

None

E5-2670v2

64G

Y

bruckner

c8:1f:66:e8:87:b7

192.168.0.200

None

E5-2670v2

64G

Y

crossbronx

c8:1f:66:f3:ee:a4

192.168.0.217

None

E5-2690v2

96G

Y

deegan

c8:1f:66:e8:2d:f3

192.168.0.214

None

E5-2670v2

64G

Y

fdr

c8:1f:66:ea:59:f4

192.168.0.211

None

E5-2670v2

64G

Y

goethals

c8:1f:66:f4:29:db

192.168.0.204

None

E5-2670v2

64G

Y

gowanus

c8:1f:66:ea:38:c8

192.168.0.213

None

E5-2670v2

64G

Y

gwb

c8:1f:66:e5:c9:e5

192.168.0.203

None

E5-2670v2

64G

Y

holland

c8:1f:66:ea:6a:d4

192.168.0.209

None

E5-2670v2

64G

Y

lincoln

c8:1f:66:e7:26:99

192.168.0.210

None

E5-2670v2

64G

Y

outerbridge

c8:1f:66:ea:5b:f5

192.168.0.205

None

E5-2670v2

64G

Y

pelham

c8:1f:66:ea:5e:55

192.168.0.212

None

E5-2670v2

64G

Y

pulaski

c8:1f:66:ea:4d:29

192.168.0.201

None

E5-2670v2

64G

Y

rfk

c8:1f:66:ea:79:f3

192.168.0.206

None

E5-2670v2

64G

Y

throgsneck

c8:1f:66:f3:93:8e

192.168.0.207

None

E5-2670v2

64G

Y

vanwyck

c8:1f:66:e8:3c:6f

192.168.0.216

None

E5-2670v2

64G

Y

verrazano

c8:1f:66:ea:68:29

192.168.0.208

None

E5-2670v2

64G

Y

whitestone

c8:1f:66:ea:68:b2

192.168.0.202

None

E5-2670v2

64G

Y

Single Socket Compute Nodes

Machine

MAC

IP address

Port

Proc

Mem

20.04

bowery

bc:5f:f4:8e:ef:9d

192.168.0.5

3005

i7-3770K

32G

N

broadway

bc:5f:f4:3a:74:8f

192.168.0.3

3003

i7-3770

32G

N

bronx

bc:5f:f4:44:0e:77

192.168.0.16

None

i7-3770K

32G

N

brooklyn

bc:5f:f4:9e:69:49

192.168.0.55

None

i7-3770K

32G

N

bryantpark

bc:5f:f4:9e:6a:ea

192.168.0.54

None

i7-3770K

32G

N

centralpark

bc:5f:f4:9e:67:62

192.168.0.43

None

i7-3770K

32G

Y

chinatown

bc:5f:f4:3a:75:9b

192.168.0.15

3015

i7-3770

32G

N

cityhall

bc:5f:f4:a1:10:b5

192.168.0.51

None

i7-3770K

32G

Y

civiccenter

bc:5f:f4:8e:ef:79

192.168.0.6

3006

i7-3770K

32G

N

diamonddistrict

bc:5f:f4:9e:67:66

192.168.0.65

None

i7-3770K

32G

Y

downtown

bc:5f:f4:47:bc:f2

192.168.0.12

3012

i7-3770K

32G

N

dumbo

bc:5f:f4:3a:73:dd

192.168.0.17

3017

i7-3770

32G

N

ellisisland

bc:5f:f4:9e:67:30

192.168.0.45

None

i7-3770K

32G

Y

flatbush

bc:5f:f4:9e:68:35

192.168.0.56

None

i7-3770K

32G

N

flatiron

bc:5f:f4:75:89:6d

192.168.0.8

3008

i7-3770K

32G

N

greatlawn

bc:5f:f4:9e:6b:04

192.168.0.46

None

i7-3770K

32G

Y

guggenheim

bc:5f:f4:9e:68:02

192.168.0.48

None

i7-3770K

32G

Y

harlem

bc:5f:f4:47:bd:ac

192.168.0.11

3011

i7-3770K

32G

N

hellskitchen

bc:5f:f4:8e:ef:90

192.168.0.13

3013

i7-3770K

32G

N

heraldsquare

bc:5f:f4:9e:67:e7

192.168.0.37

None

i7-3770K

32G

Y

highline

bc:5f:f4:9e:65:b1

192.168.0.63

None

i7-3770K

32G

Y

intrepid

bc:5f:f4:9e:67:eb

192.168.0.49

None

i7-3770K

32G

N

inwood

bc:5f:f4:9e:68:01

192.168.0.38

None

i7-3770K

32G

Y

lenoxhill

bc:5f:f4:9e:68:00

192.168.0.40

None

i7-3770K

32G

Y

les

bc:5f:f4:44:0d:7f

192.168.0.9

3009

i7-3770

32G

Y

lincolnsquare

bc:5f:f4:9e:68:0c

192.168.0.41

None

i7-3770K

32G

Y

madisonsquare

bc:5f:f4:9e:65:87

192.168.0.35

None

i7-3770K

32G

Y

manhattanville

bc:5f:f4:9e:65:91

192.168.0.39

None

i7-3770K

32G

Y

marinepark

bc:5f:f4:9e:67:e9

192.168.0.59

None

i7-3770K

32G

Y

met

bc:5f:f4:a1:10:b7

192.168.0.52

None

i7-3770K

32G

N

midtown

bc:5f:f4:47:bd:13

192.168.0.10

3010

i7-3770K

32G

N

murrayhill

bc:5f:f4:9e:68:04

192.168.0.44

None

i7-3770K

32G

Y

museummile

bc:5f:f4:9e:68:31

192.168.0.47

None

i7-3770K

32G

Y

naturalhistory

bc:5f:f4:9e:68:03

192.168.0.62

None

i7-3770K

32G

Y

noho

bc:5f:f4:3a:76:1a

192.168.0.18

3018

i7-3770

32G

N

nolita

bc:5f:f4:72:c0:e8

192.168.0.14

3014

i7-3770K

32G

N

nypd

bc:5f:f4:9e:67:d5

192.168.0.57

None

i7-3770K

32G

Y

nypl

bc:5f:f4:9e:68:06

192.168.0.50

None

i7-3770K

32G

Y

pier76

bc:5f:f4:9e:65:8f

192.168.0.53

None

i7-3770K

32G

Y

prospectpark

bc:5f:f4:9e:67:f9

192.168.0.64

None

i7-3770K

32G

Y

soho

bc:5f:f4:47:bd:fc

192.168.0.27

3027

i7-2600

32G

N

statenisland

bc:5f:f4:9e:6b:0a

192.168.0.61

None

i7-3770K

32G

Y

timessquare

bc:5f:f4:9e:68:1c

192.168.0.36

None

i7-3770K

32G

N

ues

bc:5f:f4:75:89:c7

192.168.0.7

3007

i7-3770K

32G

N

unionsquare

bc:5f:f4:8e:ef:6a

192.168.0.4

3004

i7-3770K

32G

Y

williamsburg

bc:5f:f4:9e:67:dd

192.168.0.58

None

i7-3770K

32G

N

butler (6c)

fc:aa:14:2f:38:7a

192.168.0.66

None

i7-5820K

32G

N

gramercy (6c)

8c:89:a5:80:a8:8a

192.168.0.21

3021

i7-3930K

64G

N

Administrative/Personal Nodes

Machine

MAC

IP address

Status

Port

Proc

Mem

20.04

tribeca

8c:89:a5:80:ab:42

192.168.0.22

slurm

3022

i7-3930K

64G

Y

grandcentral

a4:ba:db:03:17:d9

192.168.0.100

Git/apache

2000

i7-870

16G

N

parkingspace

d0:50:99:68:12:7e

192.168.0.102

Raid

None

G3258

4G

N

pennstation

00:01:e8:d7:ce:3f

192.168.0.2

Router

None

N/A

N/A

N

powerstrip

00:06:67:24:fa:e2

192.168.0.110

N/A

None

Power

N/A

N

morningside

00:1e:c9:4f:c4:31

192.168.0.101

Chris

3101

i7-

8G

N

fidi

bc:5f:f4:38:1a:1c

192.168.0.28

Chanul

3028

i7-3770

32G

N

westend

bc:5f:f4:9e:6b:16

192.168.0.24

Enda

3024

i7-3770K

32G

N

columbuso

bc:5f:f4:b9:3c:07

192.168.0.26

Sasaank

3026

i7-4770

32G

N

queens

bc:5f:f4:9e:44:a7

192.168.0.25

Lyuwen

3025

i7-3770K

32G

N

batterypark

00:0f:1f:db:04:f9

192.168.0.32

?

3032

i7-

?G

N

carlosj

00:22:19:23:c1:59

192.168.0.31

?

3031

i7-

8G

N

chelsea

00:24:e8:33:79:b1

192.168.0.29

?

3029

i7-

8G

N

Han

00:24:e8:38:3b:40

192.168.0.30

?

3030

i7-

8G

N

Single sockets that are currently broken

  1. noho - where is it?

  2. flatiron - where is it?

  3. gramercy - where is it?

  4. ues - where is it?

  5. hellskitchen - where is it?

  6. intrepid - where is it?

  7. brooklyn - won’t power on.

  8. chinatown - won’t power on.

  9. met - won’t power on.

  10. downtown - won’t power on.

  11. dumbo - won’t power on.

  12. harlem - won’t power on.

  13. bowery

  14. nolita

  15. civiccenter

  16. bronx

  17. bryantpark

  18. williamsburg

  19. midtown is not reachable.

  20. timessquare is not reachable.

Single sockets to update to 20.04 next

Only upgrades needed are those that are listed as broken above.

Dual sockets that are broken

  1. crossbronx - has a new hard drive, but still some serious problems. Appears to overheat.

  2. goethals, gowanus, and pulaski have bad hard drives.

  3. fdr and bruckner will not power on.

Config File

Here is a sample config file with all the single and dual socket machines (config). This file should be named config and placed in your ~/.ssh/ directory. Replace XXX with your user name.

Potential future purchases

We have not updated our cluster in many years, and technology keeps pushing forward. Memory bandwidth is one of the key metrics for processors in terms of performance bottlenecks for plane wave DFT calculations. Our single sockets (i7-3770) have 25.6 GB/s and our dual sockets (E5-2670v2) have 59.7 GB/s. However, there are already relatively old single socket processors of the Cascade Lake family that are running at 94 GB/s. The ten core version is going for about $600 while the 18 core model is going for about $1000. We do not currently have funds to buy nodes, though perhaps we can scrap enough together to buy one to test out the performance.

Instructions: python, git, sphinx, and apache

Python Virtual Environments

Python virtual environments allow you to create an isolated installation of python and any of its packages. This can be useful when using pythons programs that have many dependencies, like Jupyter notebooks, running older software, or developing new software. We provide a quick summary of the discussion in here. The basic gist is to install and use virtualenv. First install with pip:

pip install virtualenv

Now you can go to your project folder and create a virtual environment named env, which is a standard naming convention:

cd project_folder
virtualenv venv

A directory named venv was created, housing all of the relevant information. You may want to explicitly set the version of python you will use:

virtualenv -p /usr/bin/python2.7 venv

To actually use the virtual environment, fire it up by sourcing the relevant bash file:

source venv/bin/activate
(venv) user@computer~/path/project_folder$

You can see that your bash prompt has been modified to alert you that you are in the virtualenv named venv. Now, use pip to install any packages you would like, and you are read to go. You can stop the virtualenv by:

deactivate

Intro to git

  • Useful links

    1. For more detailed information regarding the information below, please see here

    2. Git cheat-sheet

    3. A very nice tutorial here

  • About version control: Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

  • First time git setup: The first thing you should do when you install Git is to set your user name and e-mail address. This is important because every Git commit uses this information, and it’s immutably baked into the commits you pass around:

    git config –global user.name “John Doe” git config –global user.email johndoe@example.com

You can get a Git project using two main approaches. The first takes an existing project or directory and imports it into Git. The second clones an existing Git repository from another server. For the first approach, let us assume that the group meeting docs will be under version control. First make the directory:

mkdir group_meetings
cd group_meetings

Now we set up git to make a local repository:

git init
git add [filename]
git commit -m "My initial commit message"

This creates a new subdirectory named .git that contains all of your necessary repository files. It adds the file [filename] and commits it to the repository. The commit -m command adds your message to the log file that tracks changes you made in this version.

You now have a working local repository and you can locally do version control. However, you will probably want to have a remote repository in order to develop with friends. Go to the remote host that will host the git repository:

ssh -p 2000 cam1@grandcentral.apam.columbia.edu
mkdir group_meetings.git
cd group_meetings.git
git init --bare

Now if you want to make the remote host the master to your local copy, go back to your local repository::

git remote add origin ssh://cam1@grandcentral.apam.columbia.edu:2000/home/cam1/group_meetings.git
git push -u origin master

Now perhaps another user will check out their own copy, or maybe I will check out a copy on my home computer or my laptop:

git clone ssh://cam1@grandcentral.apam.columbia.edu:2000/home/cam1/git_repos/group_meetings.git

Modifying the internal group website: Git and Sphinx

If you would like to make a change to this webpage, or any of the others on grandcentral, first pull a copy of the webpage for yourself:

git clone ssh://grandcentral/home/cam1/git_repos/documentation.git

This assumes that grandcentral is defined in your .ssh/config file. Now cd into the “documentation” directory.

If you already have a copy setup, always remember to “pull” so that you are working from the most up-to-date version:

git pull

Once you finish modifying the files, you add them:

git add filename

and commit them:

git commit -m 'commit message'

Then you must execute the build.sh script to generate the html files followed by up.sh

./build.sh
./up.sh

Finally, you push up to the master branch to complete the changes:

git push

Everyone should have access to change things an push up to the repository. If you run into any problems, make sure you are added to the gituser and wwwpub groups.

sudo adduser dkk2122 gituser
sudo adduser dkk2122 wwwpub

If you still have trouble, make sure that the git repository is owned by the gituser group:

sudo chgrp -R gituser myrepos.git

and that the myrepos.git/config file has group-writeable permissions, i.e. it includes the following line:

sharedRepository = group

Intro to sphinx

All of the current documentation is generated using sphinx:

http://sphinx-doc.org/

Modifying the internal group website: Apache

The actual website is run on grandcentral and the data is at grandcentral:/var/www. The directories there are:

  • conferences - conference presentations. This is currently behind a password.

  • documentation - This website. This is NOT behind a password.

  • groupmeetings - presentations and agenda from group meetings. This is currently behind a password.

  • lockbox - miscellaneous files that ARE behind a password.

  • (redacted) - miscellaneous files. This is NOT behind a password. Useful to share work with yourself for later.

  • research - notes from people’s research. This is currently behind a password.

  • tutorials - this is NOT behind a password.

For the username and password, speak to someone in the group. (It is not the same as litshare.)

To set security for various directories, you edit the file and restart the server:

$ sudo vi /etc/apache2/httpd.conf
$ sudo /etc/init.d/apache2 restart

Environment Modules

Environment modules system dynamically manages environment variables in shell to provide users access to software and libraries on demand. This also allows users to swap between different versions of softwares or different implementations of the software without the worry of any potential conflict. (It also avoids the tedious and redundant loading procedure from the setvars.sh script from the latest intel OneAPI kit.)

Here we are using the Lmod, a Lua based environment module system, for our in-house clusters. The software is compiled and store in the module base repository accessible from our local network via the git darmon introduced here

git://192.168.0.100/base.git

The repository houses the lmod and lua scripts and libraries as well as some modulefiles in the modulefiles directory that are used to load certain software packages.

To list the available software managed by the system, simply run the command

module avail

Among the list the loaded software and the default software version that will be loaded are marked.

It is also relatively easy to create a script for the module system to manage certain software, one can simply follow the template below:

help([[
    <A multi-line text as the help message of the module>
  ]])

local <local variables> = ...

load("<The modules the current module depends on.>")

prepend_path("<environment variable>", "<path to prepend>")
setenv("<environment variable>", "<value of the environment variable>")

whatis("<what is information of the module>")

More information on writing the modulefile can be find the documentation of Lmod.

In-house debian repository

We have set up our in-house debian repository to distribute software such as DFT packages, within our in-house cluster. It’s a very easy and light-weight solution that satisfies our needs of managing the software versions and their dependencies.

The debian source is located at

"deb [trusted=yes] http://192.168.0.100/debian/ ./"

Which is setup simply exposing the directory with apahce2 directory listing. Once this source is added into the apt source list, one can install the software packages provided by the repo through apt install.

In order to publish a compiled software package, one just needs to create a debian binary package. For example, one can create the following directory structure to make a binary package for vasp 5.4.4 compiled with intel:

./vasp-5.4.4-intel
  |-- DEBIAN
  |   `-- control
  |-- opt/apps/vasp/5.4.4/intel/bin
      |-- vasp_std
      |-- vasp_ncl
      `-- vasp_gam

where the vasp_std, vasp_ncl and vasp_gam are the 3 variant compilations of the vasp code, and control is the debian control file:

Package: vasp-5.4.4-intel
Version: 1.0
Section: custom
Priority: optional
Architecture: all
Essential: no
Installed-Size: 1024
Maintainer: Lyuwen Fu
Description: VASP 5.4.4 intel with Wannier90 v2.1
Depends: intel-hpckit, intel-basekit

where Package indicates the package name referred to during apt install, Version serves as the incrimental version number needed when one wishes to replace an old compilation for various reasons, and Depends lists the the packages this package depends on and will be installed along side the package.

Then, the command dpkg-deb --build vasp-5.4.4-intel can be used to create the vasp-5.4.4-intel.deb file. Usually, the file is rename to reflect the package version number. The package needs to be uploaded to the debian repo directory /var/www/html/debian on grandcentral, and the package listing there needs to be updated with the command dpkg-scanpackages . | gzip -c9 > Packages.gz.

More details can be find the debian documentation.

Git Protocal - Git Daemon

Git daemon is a very simple wat to setup the “Git” protocol to allow fast, unauthenticated read-only access to git repositories. In this case, we want the read access of the environment module base repository to install and update lmod and the modulefiles without authentication. Details of how to setup this service can be found in the git documentation.

Other Computing Resources

NERSC resources

Please follow the instructions at http://www.nersc.gov/users/accounts/user-accounts/get-a-nersc-account/ to request a new account. E-mail Pierre or Eric to get the allocation info needed.

NERSC keeps track of users who are authorized to use VASP, so you may need to have Chris send another e-mail to the VASP licensing people (vasp.materialphysik@univie.ac.at) with NERSC CC’d (vasp_licensing@nersc.gov) to add you to the list. .. Right now just Pierre, Mordechai, Chanul, and Eric are on the list.

The NERSC systems are extremely well documented so you can probably answer most of your questions from their web site (http://www.nersc.gov/users/).

NERSC now uses slurm scheduling system, and a VASP job script basically look like this:

#!/bin/bash -l
#SBATCH --job-name=<job_name>
#SBATCH -N 1
#SBATCH -A m1902
#SBATCH --time=60:00
#SBATCH -p regular
#SBATCH -C haswell

module load vasp/5.4.1-hsw
exe=$(which vasp_std)
NC=32
NP=4
sed -i '/NPAR.*/d' INCAR
echo "NPAR=$NP" >> INCAR
srun -n $NC $exe >>vasp.out 2>>vasp.err

Make sure you include the parallelization flags in your INCAR file. For example, VASP recommends that NPAR be set to approximately the square root of the number of CPUs being used.

If you’re sharing with other group members, you can use the project directory (e.g. /project/projectdirs/m1902/). If you want to change the directory “example” so that any new files have default read/write permissions for the entire group, you should run this command:

setfacl -R -d -m mask:007 /project/projectdirs/m1902/example/

It should already be set that new files belong to the project group; if this is not true, you do that by running:

chmod -R g+s /project/projectdirs/m1902/example/

And if the directory already has files, you need to run this:

chgrp -R m1902 /project/projectdirs/m1902/example/
chmod -R g+rw  /project/projectdirs/m1902/example/

Check the number of hours remaining in the group’s allocation from the command line by executing:

getnim -R m1902

Information on HPC Cluster at Columbia

Below are the information about our HPC clusters, and the documentations are linked to the names:

Cluster name

Nodes Purchased

Cluster status

Group storage

Login node

Yeti

10

Retired

/vega/qmech

yetisubmit.cc.columbia.edu

Habanero

2

Online

/rigel/qmech

habanero.rcs.columbia.edu

Terremoto

1

Online

/moto/qmech

terremoto.rcs.columbia.edu

Ginsburg

1

Online

/moto/qmech

terremoto.rcs.columbia.edu

You can log in using your UNI (similar to cunix), and your SSOL password. If you do not have access to the cluster, please contact Chris to give you access.

All online clusters (Habanero and Terremoto) are using slurm system, the batch script has similar form to the one for NERSC.

VASP on HPC Clusters

VASP 5.4.1 and VASP.5.4.4 are compiled using the method mentioned in here on both Habanero and Terremoto.

On Habanero, VASP 5.4.1 are located under directory:

/rigel/qmech/projects/built_by_lyuwen/vasp.5.4.1/bin

and VASP 5.4.4:

/rigel/qmech/projects/built_by_lyuwen/vasp.5.4.4.1/bin

Note that this is the patched version of VASP 5.4.4, the original version is under the vasp.5.4.4 directory.

In order to execute them, you need to load the intel parallel studio module by

module load intel-parallel-studio

Old stuff on Yeti

The vasp 5.2 executable, sample PBS job script (run.vasp), and a script to check the cluster usage (freenodes.pl) are in our shared directory /vega/qmech/projects

There is a hard limit to the number of processes per node, which can cause VASP to crash for larger unit cells. Enabling “ulimit -s unlimited” to increase the stack size fixes this, but the cluster administration wanted to be cautious so right now the stack size is set to 15360 for InfiniBand nodes.

Update May 2015: “The user stack limit has been removed from execute nodes.” (from the Yeti website)

The NPAR settings can also cause a crash. Some simple testing shows that for a one-node run, NCORE=4 (for 16 cores) is a good value.

Hyperthreading: Mordechai inquired in December 2015, and was told hyperthreading is enabled on the cluster, but the schedule is not aware of it. So if you’re running on a single node via OpenMP or round-robin tasks, it will use the hyperthreading, but I’m not sure if it works on multiple nodes.

Efficiency: I did some testing with running 16 separate small VASP jobs on 1 node vs. 32 separate small jobs (which uses the hyperthreading). The latter was about 4% faster than doing the first twice. So it doesn’t help so much in that configuration.