Condor tips

Removing stuck jobs

Job(s) stuck in idle because No resources matched request's constraints. (I in ST column means “idle”.)

[root@hawk ~]# condor_q


-- Submitter: hawk.bmrb.wisc.edu : <144.92.167.248:24733> : hawk.bmrb.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
2731.0   madings         6/6  12:47   0+00:44:12 R  0   7.3  condor_dagman     
2732.0   madings         6/6  12:47   0+00:43:51 R  0   7.3  condor_dagman     
2881.0   madings         6/6  12:56   0+00:01:04 I  0   17089.8 bash
...

[root@hawk ~]# condor_q -better-analyze 2881


-- Submitter: hawk.bmrb.wisc.edu : <144.92.167.248:24733> : hawk.bmrb.wisc.edu
---                                                                           
2881.000:  Run analysis summary.  Of 32 machines,                             
     32 are rejected by your job's requirements                               
      0 reject your job because of their own requirements                     
      0 match but are serving users with a better priority in the pool        
      0 match but reject the job for unknown reasons                          
      0 match but will not currently preempt their existing job               
      0 match but are currently offline                                       
      0 are available to run your job                                         
        Last successful match: Mon Jun  6 12:56:20 2011                       
        Last failed match: Mon Jun  6 13:32:53 2011                           
        Reason for last match failure: no match found                         

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) &&
( ( RequestMemory * 1024 ) >= ImageSize ) &&                                
( TARGET.FileSystemDomain == MY.FileSystemDomain )                          

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( 1024 * TARGET.Memory ) >= 17500000 )0                   REMOVE
...
[root@hawk ~]# condor_q -long 2881                                                                              


-- Submitter: hawk.bmrb.wisc.edu : <144.92.167.248:24733> : hawk.bmrb.wisc.edu

… long list of attributes with

Requirements = ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain )

in it. Copy-paste that into condor_qedit editing the offending requirement (TARGET.Memory):

[root@hawk ~]# condor_qedit 2881 Requirements ' ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( TARGET.Memory > 0 ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain )'
Set attribute "Requirements".
[root@hawk ~]# condor_q


-- Submitter: hawk.bmrb.wisc.edu : <144.92.167.248:24733> : hawk.bmrb.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
2731.0   madings         6/6  12:47   0+00:50:04 R  0   7.3  condor_dagman
2732.0   madings         6/6  12:47   0+00:49:43 R  0   7.3  condor_dagman
2881.0   madings         6/6  12:56   0+00:01:45 R  0   17089.8 bash
...

Job 2881 is now running (R).

To avoid this, put 'requirements = (TARGET.Memory > 0)' in job's submit file: that will be used instead of condor's default ( TARGET.Memory * 1024 ) >= ImageSize.

Login