ceph as data placement mechanism for htcondor/blast

The idea is to

  1. place ceph storage (OSD) on all nodes (that have a spare disk slot),
  2. configure ceph's CRUSH map to replicate the contents to all OSDs,
  3. mount it on all nodes as cephfs,
  4. put BLAST database in it, and
  5. let ceph distribute the reads when condor jobs kick in

and see how slow it gets.

Most hosts will act as both client (that requires new kernel) and storage node (that requires running OSD). So this is for setting a host up as both. The “master” is the host running MON and primary MDS.

Setup

1. Install kernel-lt from ELRepo.

Setup: servers

1. Install ceph:

rpm -Uvh http://ceph.com/rpm-bobtail/el6/x86_64/ceph-release-1-0.el6.noarch.rpm
rpm --import https://raw.github.com/ceph/ceph/master/keys/release.asc
yum install ceph

2. For initial setup: follow ceph quick start. Note: requires 2 OSD hosts to work – with one OSD you end up in HEALTH_WARN state that will go away once you add another OSD.

3. Add another OSD: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/, but

  • when adding CRUSH map (running commands on the master host: manta, adding osd.2 on host duck):
# ceph osd tree
# id    weight  type name       up/down reweight
-1      2       root default
-3      2               rack unknownrack
-2      1                       host manta
0       1                               osd.0   up      1
-4      1                       host swan
1       1                               osd.1   up      1

2       0       osd.2   down    0

# ceph osd crush set 2 osd.2 1.0 pool=default rack=unknownrack host=duck
updated item id 2 name 'osd.2' weight 1 at location {host=duck,pool=default,rack=unknownrack} to crush map

# ceph osd tree

# id    weight  type name       up/down reweight
-1      3       root default
-3      3               rack unknownrack
-2      1                       host manta
0       1                               osd.0   up      1
-4      1                       host swan
1       1                               osd.1   up      1
-5      1                       host duck
2       1                               osd.2   down    0

The go on to “Starting the OSD”;

  • the new OSD should already be “in” when it starts.

4. (For completeness) add another MDS

  • create directory for it in /var/lib/ceph/mds
  • update /etc/ceph/ceph.conf
  • service ceph start mds
# ceph -s
   health HEALTH_OK
   monmap e1: 1 mons at {a=144.92.167.231:6789/0}, election epoch 1, quorum 0 a
   osdmap e47: 4 osds: 4 up, 4 in
    pgmap v7875: 576 pgs: 576 active+clean; 32533 MB data, 71055 MB used, 3411 GB / 3667 GB avail
   mdsmap e10: 1/1/1 up {0=a=up:active}, 1 up:standby

# ceph mds set_max_mds 2
max_mds = 2

# ceph -s
   health HEALTH_OK
   monmap e1: 1 mons at {a=144.92.167.231:6789/0}, election epoch 1, quorum 0 a
   osdmap e47: 4 osds: 4 up, 4 in
    pgmap v7879: 576 pgs: 576 active+clean; 32533 MB data, 71055 MB used, 3411 GB / 3667 GB avail
   mdsmap e13: 2/2/2 up {0=a=up:active,1=b=up:active}

(note the change in mdsmap line.)

5. Should add 2 more MONs for production.

6. Start it up on boot.

Mount the OSD disk from fstab, chkconfig ceph on (ceph's init.d script is smart enough to read ceph.conf and start up only what runs on this host).

Setup: clients

1. Move on to ceph fs. Mount with -o ro on the condor nodes, mount rw on the submit machine where BLAST databases get generated:

mount -t ceph -o ro,noatime,nodiratime manta:/ /ceph/blastdb

(with multiple MONs: … manta:/,duck:/,heron:/ …)

Setup: HTCondor

1. Add to /etc/condor/condor_config.local

HasCephFS = True
STARTD_ATTRS = $(STARTD_ATTRS) HasCephFS

and run condor_reconfigure on each host.

# condor_status -long | grep HasCe
HasCephFS = true
...

2. Add Requirements = TARGET.HasCepFS to BLAST job submit file.

3. Limit the number of entries to BLAST so that the whole batch completes in a reasonable time.

Complications:

  • there' only 3 SATA-based nodes with free hot-swap disk slots. There are IDE-based nodes (free slots but no high-capacity drives) and SCSI-based nodes without either free disk slots or disks larger than 70GB.
  • CS-Rosetta jobs may come in and claim slots on our 3 test nodes.

baseline test

Run on the 3 OSD nodes only.

They seem to go through BLAST at ~ 125 jobs per hour, so limit the run to 2,000 entries from 16295 to 18987. That should get 'er done in 24 hours or so.

The run begins with “r” and BMRB FASTA files

  1. creates binary db files
  2. copies (rsync) them over to the nodes,
  3. makes and fires off condor DAG w/ 2,000 BLAST jobs

Result:

  • started @ 13:06 Feb 12
  • db creation/copy done @ 16:48 Feb 12: 3:42
  • dag finished @ 11:27 Feb 13: 22:21 total, 18:39 condor.

basic ceph test

As above, but put the databases on ceph share:

  1. change environment = “BLASTDB=/ceph/blastdb … in the submit file,
  2. change dbdir in the script creating the binary database files,
  3. delete the db copy loop from the script.

Result:

  • started @ 15:03 Feb 13,
  • db creation done @ 16:52 Feb 13: 1:49
  • killed after averaging approx. 6 jobs/hour for a day

next try: 2013-02-16

  1. double the RAM in heron (16-core worker host)
  2. half the number of jobs (start at entry 17461) for faster turnaround

baseline result:

  • started @ 12:52
  • formatdb done @ 14:05
  • copying done @ 16:35 (same 3:4x as baseline test)
  • dag ran from 16:48 to 2:15 am: 9:27 hours

Ceph:

  1. modify CRUSH map before doing anything else: set alg uniform and weight 1.000 for each host
    ceph osd getcrushmap -o foo
    crushtool -d foo -o bar
    ...
    crushtool -c bar -o newfoo
    ceph osd setcrushmap -i newfoo
  2. adjust pool replication parameters:
    for i in data metadata rbd ; do ceph osd pool set $i min_size 4 ; do ceph osd pool set $i size 4 ; done

    (size is how many you want in a healthy cluster, min_size is where you should get errors if there's fewer than that available)

  3. mount cephfs with rsize=536870912 or something > 524288 (multiple of 4096) on worker hosts
  4. run cephfs /ceph/blastdb set_layout -u 2147483648 -s 2147483648 -c 2147483648 on manta before writing the databases
  5. to check the layout after writing: cephfs /ceph/blastdb map

ceph result:

  • this simply crashes ceph while creating the database, no further testng possible

2013-03-08

Shutting down manta wedges it solid, requires a powercycle.

Login