BMRB entry updates with BLAST/Condor

Collection of programs for updating external database links in BMRB entries.

Project location: https://svn.bmrb.wisc.edu/svn/blast/

Files in svn:

  • scripts in bin/,
  • submit and other config files in etc/,
  • python modules, jars, and other files in lib/,
  • source code for “getsequence” progams in get_seqs/,
  • source code for entry updater programs in update_entry/

Runtime:

  • /var/tmp/bmrb/blast – DAGman files and job logs,
  • /var/tmp/bmrb/blast/db – BLAST databases

Software:

  • /opt/ncbi-blast+ (symlink to /opt/ncbi-blast-X.Y.Z+)

Running BLAST

The run is started from bmrbgid user's crontab on Condor submit machine. To start it manually,

  1. login to minnow as bmrbgrid
  2. cd /minnow/bmrbgrid/blast/bin
  3. run ./blast_bmrb_entries.py

The script will update BLAST databases, create and submit the DAG and exit.

Monitoring

BMRB entries on public website are monitored for BLAST update by nagios. Monitoring script looks for files names bmr*.blast.xml in entry directories. If it does not find one updated less that 3 weeks ago, it returns an error code and an alert is generated.

Condor/BLAST details

"Redundant" DB

From BLAST's point of view, if the same sequence with same alignment details etc. is found in more that one source database (e.g. PDB and SWISS-PROT), those two records are the same “match”. In non-redundant database (NR) source database IDs, accession numbers, and descriptions are concatenated in one description field. In the source FASTA file they are delimited by Control-A, but in the output of BLAST search these get replaced by vertical pipe characters to make it human-readable.

Vertical pipe is also used as delimiter inside the description (between source DB ID, accession number, and description), and exact placement and number of delimiters depends on the source DB and on formatting options used when generating binary files for BLAST from input FASTA file.

As a result, parsing source DB details from BLAST output is impossible. The solution suggested by Christopher Tran from NCBI is to turn NR back into “not non-redundant” before generating binary database files for BLAST. At this point component fields are delimited by unambiguous Control-A, so the script for doing it fairly simple.

Resulting FASTA database (“R”) is formatted with makeblastdb to generate binary database files.

Database replication

The slowest part of BLAST search is usually loading database files into memory. When this is done off an NFS share, performance penalty can easily be an order of magnitude or worse, and with multiple condor nodes hitting the NFS server at the same time the whole cluster slows down to a crawl.

Binary database files are created on local filesystem, in /var/tmp/bmrb/blast/db and then copied to each condor worker host using rsync. This works fairly well with a small number of hosts connected via fast network.

To get the list of hosts, run

   condor_status -f '%s\n' Machine | sort |uniq

Cluster setup

Note set up rsync, /opt and /minow NFS mounts, and BLAST DB directory on each condor worker node before adding it to the cluster.

BLAST DB directory is /var/tmp/bmrb/blast/db. Condor logs, submit and other DAG files are in /var/tmp/bmrb/blast.

Everything else is currently stored on NFS shares:

  • BLAST is installed in /opt,
  • Other files are in bmrbgrid user's directory under /condor

Procedure

  1. NR:
    1. download from NCBI in FASTA format,
    2. “redundify” it,
    3. create binary DB files for BLAST.
  2. Each BMRB entry has a sequence file in FASTA format in its clean/ directory: bmrNNNN.MOLTYPE.fasta.
    1. Copy protein sequences (where MOLTYPE == prot) longer than 15 standard residues (i.e. X'es don't count) to job input directory,
    2. concatenate the above sequences into one FASTA file (BMRB BLAST DB),
    3. create binary DB file for BLAST.
  3. Copy binary DB files above to each worker host.
  4. Generate DAG submit file.
  5. Submit the DAG to Condor.

DAG steps:

  1. Run BLAST search on input file.
  2. Run update_entry.py on BLAST output file. This is a wrapper that
    1. compares the timestamps on original entry files and the .fasta file: if entry was updated after the .fasta, BLAST results are discarded,
    2. updates 2.1 file,
    3. compares updated file with the original one: if it was updated, overwrites the original (so the timestamp does not change if there were no updates),
    4. repeats the above for 3.1 file.

Update rules:

  1. We keep all BMRB and PDB hits, but only top 5 hits from other databases
  2. _Sequence_homology_query_date (and its 3.1 equivalent) is not updated if sequence was BLASTed. This way timestamps on entry files don't change when the data wasn't actually updated. To find out if/when an entry was last BLASTed, check the timestamp on bmrNNNN.blast.xml file in entry's clean/ directory.
  3. _Sequence_homology_query_revised_last_date (and its 3.1 equivalent) is updated if the list of hits is updated.
  4. Hits are sorted by database ID with BMRB first, then PDB, then the rest (alphabetically). Then they're sorted by accession numbers. Duplicate hits (same DB ID and accession number) are discarded, keeping the higher-scoring one. (Note that this discards hits to multiple chains in the same entry, so the links are entry-to-entry, not chain-to-chain. This may need to be changed.)
  5. Hist shorter than 51% of query sequence length are discarded.
  6. Hits with score below 98% are discarded. The score is calculated as num. identities divided by alignment length and multiplied by 100:
    (numIdentities/alignLength)*100
  7. If hit list returned by BLAST search is any different from hit list already in the entry, it replaces the entry list (old list is discarded).

Sample BLAST outputs

Text output:

>sp|P19339|SXL_DROME Sex-lethal protein
          Length = 354

 Score =  367 bits (941), Expect = e-100
 Identities = 183/183 (100%), Positives = 183/183 (100%)

Query: 2   GSDDLMNDPRASNTNLIVNYLPQDMTDRELYALFRAIGPINTCRIMRDYKTGYSFGYAFV 61
           GSDDLMNDPRASNTNLIVNYLPQDMTDRELYALFRAIGPINTCRIMRDYKTGYSFGYAFV
Sbjct: 112 GSDDLMNDPRASNTNLIVNYLPQDMTDRELYALFRAIGPINTCRIMRDYKTGYSFGYAFV 171

Query: 62  DFTSEMDSQRAIKVLNGITVRNKRLKVSYARPGGESIKDTNLYVTNLPRTITDDQLDTIF 121
           DFTSEMDSQRAIKVLNGITVRNKRLKVSYARPGGESIKDTNLYVTNLPRTITDDQLDTIF
Sbjct: 172 DFTSEMDSQRAIKVLNGITVRNKRLKVSYARPGGESIKDTNLYVTNLPRTITDDQLDTIF 231

Query: 122 GKYGSIVQKNILRDKLTGRPRGVAFVRYNKREEAQEAISALNNVIPEGGSQPLSVRLAEE 181
           GKYGSIVQKNILRDKLTGRPRGVAFVRYNKREEAQEAISALNNVIPEGGSQPLSVRLAEE
Sbjct: 232 GKYGSIVQKNILRDKLTGRPRGVAFVRYNKREEAQEAISALNNVIPEGGSQPLSVRLAEE 291

Query: 182 HGK 184
           HGK
Sbjct: 292 HGK 294

Parse result

 _Residue_count                               183
   
 loop_
    _Database_name                            SWISS-PROT # (sp)
    _Database_accession_code                  P19339
    _Database_entry_mol_name                  "SXL_DROME Sex-lethal protein"
    _Sequence_query_to_submitted_percentage   100%  # second 183 from "Identities = 183/183 (100%)" / _Residue_count * 100
    _Sequence_subject_length                  354   # from "Length = 354"
    _Sequence_identity                        100%  # from "Identities = 183/183 (100%)"
    _Sequence_positive                        100%  # from "Positives = 183/183 (100%)"
    _Sequence_homology_expectation_value      e-100 # from "Expect = e-100"

XML output:

<BlastOutput_iterations>
    <Iteration>
        <Iteration_iter-num>1</Iteration_iter-num>
        <Iteration_query-ID>lcl|1_0</Iteration_query-ID>
        <Iteration_query-def>gnl|mdb|bmrb15072:oscp-nt</Iteration_query-def>
        <Iteration_query-len>120</Iteration_query-len>
        <Iteration_hits>
            <Hit>
            <Hit_num>1</Hit_num>
            <Hit_id>gnl|mdb|bmrb15072:1 test</Hit_id>
<-- database name is either "gnl|mdb|bmrbID:entityID entity name" or "gi|ID|DB|ID|descr"
            <Hit_def>oscp-nt</Hit_def>
            <Hit_accession>bmrb15072:1 test</Hit_accession>
            <Hit_len>120</Hit_len>
            <Hit_hsps>
                <Hsp>
                <Hsp_num>1</Hsp_num>
                <Hsp_bit-score>232.646</Hsp_bit-score>
                <Hsp_score>592</Hsp_score>
                <Hsp_evalue>2.39683e-60</Hsp_evalue>
                <Hsp_query-from>1</Hsp_query-from>
                <Hsp_query-to>120</Hsp_query-to>
                <Hsp_hit-from>1</Hsp_hit-from>
                <Hsp_hit-to>120</Hsp_hit-to>
                <Hsp_query-frame>1</Hsp_query-frame>
                <Hsp_hit-frame>1</Hsp_hit-frame>
                <Hsp_identity>120</Hsp_identity>
                <Hsp_positive>120</Hsp_positive>
                <Hsp_align-len>120</Hsp_align-len>
                <Hsp_qseq>FAKLVRPPVQIYGIEGRYATALYSAASKQNKLEQVEKELLRVGQILKEPKMAASLLNPYVKRSVKVKSLSDMTAKEKFSPLTSNLINLLAENGRLTNTPAVISAFSTMMSVHRGEVPCTV</Hsp_qseq>
                <Hsp_hseq>FAKLVRPPVQIYGIEGRYATALYSAASKQNKLEQVEKELLRVGQILKEPKMAASLLNPYVKRSVKVKSLSDMTAKEKFSPLTSNLINLLAENGRLTNTPAVISAFSTMMSVHRGEVPCTV</Hsp_hseq>
                <Hsp_midline>FAKLVRPPVQIYGIEGRYATALYSAASKQNKLEQVEKELLRVGQILKEPKMAASLLNPYVKRSVKVKSLSDMTAKEKFSPLTSNLINLLAENGRLTNTPAVISAFSTMMSVHRGEVPCTV</Hsp_midline>
                </Hsp>
            </Hit_hsps>
            </Hit>

Parse result

 loop_
    _Database_name                            BMRB
    _Database_accession_code                  15072
    _Database_entry_mol_name                  test
    _Sequence_query_to_submitted_percentage   100%      # (Hsp_align-len / Iteration_query-len) * 100
    _Sequence_subject_length                  120       # (Hit_len)
    _Sequence_identity                        100%      # (Hsp_identity / Hsp_align-len ) * 100
    _Sequence_positive                        100%      # (Hsp_positive / Hsp_align-len ) * 100
    _Sequence_homology_expectation_value      2.39e-100 # (Hsp_evalue rounded)
Login