HTCondor-CE Troubleshooting Guide

About This Guide

In this document, you will find a collection of files and commands to help troubleshoot HTCondor-CE along with a list of common issues with suggested troubleshooting steps.

HTCondor-CE Troubleshooting Data

The following files are located on the CE host.

MasterLog

The HTCondor-CE master log tracks status of all of the other HTCondor daemons and thus contains valuable information if they fail to start.

To increase the debug level in this log, set the following value in /etc/condor-ce/config.d/99-local.conf on the CE host:

MASTER_DEBUG = D_FULLDEBUG

To apply these changes, reconfigure HTCondor-CE:

[root@client ~]$ condor_ce_reconfig

Location:

/var/log/condor-ce/MasterLog

Key Contents Start-up, shut-down and communication with other HTCondor daemons

What to look for:

Successful daemon start-up. The following line shows that the Collector daemon started successfully:

10/07/14 14:20:27 Started DaemonCore process "/usr/sbin/condor_collector -f -port 9619", pid and pgroup = 7318

SchedLog

The HTCondor-CE schedd log contains information on all jobs that are submitted to the CE. It contains valuable information when trying to troubleshoot authentication issues.

To increase the debug level in this log, set the following value in /etc/condor-ce/config.d/99-local.conf on the CE host:

SCHEDD_DEBUG = D_FULLDEBUG

To apply these changes, reconfigure HTCondor-CE:

[root@client ~]$ condor_ce_reconfig

Location:

/var/log/condor-ce/SchedLog

Key Contents

  • Every job submitted to the CE
  • User authorization events

What to look for:

  • Job owner is authorized and mapped:
    10/07/14 16:52:17 Command=QMGMT_WRITE_CMD, peer=<131.225.154.68:42262>
    10/07/14 16:52:17 AuthMethod=GSI, AuthId=/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brian Lin 1047,/GLOW/Role=NULL/Capability=NULL, CondorId=glow@users.opensciencegrid.org
    In this example, the job is authorized with the jobís proxy subject using GSI and is mapped to the glow user.
  • User job submission fails due to improper authentication or authorization:
    08/30/16 16:52:56 DC_AUTHENTICATE: required authentication of 72.33.0.189 failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXZpUlYa)
    08/30/16 16:53:12 PERMISSION DENIED to gsi@unmapped from host 72.33.0.189 for command 60021 (DC_NOP_WRITE), access level WRITE: reason: WRITE authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 72.33.0.189,dyn-72-33-0-189.uwnet.wisc.edu, hostname size = 1, original ip address = 72.33.0.189
    08/30/16 16:53:12 DC_AUTHENTICATE: Command not authorized, done!
  • Job is submitted to the CE queue:
    10/07/14 16:52:17 Submitting new job 234.0
    In this example, the ID of the submitted job is 234.0.
  • Missing negotiator:
    10/18/14 17:32:21 Can't find address for negotiator
    10/18/14 17:32:21 Failed to send RESCHEDULE to unknown daemon:
    Since HTCondor-CE does not manage any resources, it does not run a negotiator daemon by default and this error message is expected. In the same vein, you may see messages that there are 0 worker nodes:
    06/23/15 11:15:03 Number of Active Workers 0 
  • Corrupted job_queue.log:
    02/07/17 10:55:48 Logging per-job history files to: /var/lib/gratia/condorce_data
    02/07/17 10:55:48 Failed to execute /usr/sbin/condor_shadow.std, ignoring
    02/07/17 10:55:49 WARNING: Encountered corrupt log record 95654 (byte offset 5046225)
    02/07/17 10:55:49     103 1354325.0 PeriodicRemove ( StageInFinish > 0 ) 105
    02/07/17 10:55:49 Lines following corrupt log record 95654 (up to 3):
    02/07/17 10:55:49     103 1346101.0 RemoteWallClockTime 116668.000000
    02/07/17 10:55:49     104 1346101.0 WallClockCheckpoint
    02/07/17 10:55:49     104 1346101.0 ShadowBday
    02/07/17 10:55:49 ERROR "Error: corrupt log record 95654 (byte offset 5046225) occurred inside closed transaction, recovery failed" at line 1080 in file /builddir/build/BUILD/condor-8.4.8/src/condor_utils/classad_log.cpp
    02/07/17 10:55:49 Cron: Killing all jobs
    02/07/17 10:55:49 CronJobList: Deleting all jobs
    02/07/17 10:55:49 Cron: Killing all jobs
    02/07/17 10:55:49 CronJobList: Deleting all jobs
    This means /var/lib/condor-ce/spool/job_queue.log has been corrupted and you will need to hand-remove the offending record by searching for the text specified after the Lines following corrupt log record... line. The most common culprit of the corruption is that the disk containing the job_queue.log has filled up. To avoid this problem, you can change the location of job_queue.log by setting JOB_QUEUE_LOG in /etc/condor-ce/config.d/ to a path, preferably one on a large SSD.

JobRouterLog

The HTCondor-CE job router log produced by the job router itself and thus contains valuable information when trying to troubleshoot issues with job routing.

To increase the debug level in this log, set the following value in /etc/condor-ce/config.d/99-local.conf on the CE host:

JOB_ROUTER_DEBUG = D_FULLDEBUG

To apply these changes, reconfigure HTCondor-CE:

[root@client ~]$ condor_ce_reconfig

Location:

/var/log/condor-ce/JobRouterLog

Key contents:

  • Every attempt to route a job
  • Routing success messages
  • Job attribute changes, based on chosen route
  • Job submission errors to an HTCondor batch system
  • Corresponding job IDs on an HTCondor batch system

Known Errors:

  • If you have D_FULLDEBUG turned on for the job router, you will see errors like the following:
    06/12/15 14:00:28 HOOK_UPDATE_JOB_INFO not configured.
    You can safely ignore these.

Symptoms

HTCondor batch systems only: The following error occurs when the job router daemon cannot submit the routed job:

10/19/14 13:09:15 Can't resolve collector condorce.example.com; skipping
10/19/14 13:09:15 ERROR (pool condorce.example.com) Can't find address of schedd
10/19/14 13:09:15 JobRouter failure (src=5.0,route=Local_Condor): failed to submit job

What to look for:

  • Job is considered for routing:
    09/17/14 15:00:56 JobRouter (src=86.0,route=Local_LSF): found candidate job

    In parentheses are the original HTCondor-CE job ID (e.g., 86.0) and the route (e.g., Local_LSF).

  • Job is successfully routed:
    09/17/14 15:00:57 JobRouter (src=86.0,route=Local_LSF): claimed job
  • Finding the corresponding job ID on your HTCondor batch system:
    09/17/14 15:00:57 JobRouter (src=86.0,dest=205.0, route=Local_Condor): claimed job

    In parentheses are the original HTCondor-CE job ID (e.g., 86.0) and the resultant job ID on the HTCondor batch system (e.g., 205.0)

  • If your job is not routed, there will not be any evidence of it within the log itself. To investigate why your jobs are not being considered for routing, use the condor_ce_job_router_info

GridmanagerLog

The HTCondor-CE grid manager log tracks the submission and status of jobs on the batch system. It contains valuable information when trying to troubleshoot jobs that have been routed but failed to complete. Details on how to read the Gridmanager log can be found on the HTCondor Wiki.

To increase the debug level in this log, set the following value in /etc/condor-ce/config.d/99-local.conf on the CE host:

MAX_GRIDMANAGER_LOG = 6h
MAX_NUM_GRIDMANAGER_LOG = 8
GRIDMANAGER_DEBUG = D_FULLDEBUG

To apply these changes, reconfigure HTCondor-CE:

[root@client ~]$ condor_ce_reconfig

Location:

/var/log/condor-ce/GridmanagerLog.<job owner>

Key Contents

  • Every attempt to submit a job to a batch system or other grid resource
  • Status updates of submitted jobs
  • Corresponding job IDs on non-HTCondor batch systems

What to look for:

  • Job is submitted to the batch system:
    09/17/14 09:51:34 [12997] (85.0) gm state change: GM_SUBMIT_SAVE -> GM_SUBMITTED

    Every state change the Gridmanager tracks should have the job ID in parentheses (i.e.=(85.0)).

  • Job status being updated:
    09/17/14 15:07:24 [25543] (87.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
    09/17/14 15:07:24 [25543] GAHP[25563] <- 'BLAH_JOB_STATUS 3 lsf/20140917/482046'
    09/17/14 15:07:24 [25543] GAHP[25563] -> 'S'
    09/17/14 15:07:25 [25543] GAHP[25563] <- 'RESULTS'
    09/17/14 15:07:25 [25543] GAHP[25563] -> 'R'
    09/17/14 15:07:25 [25543] GAHP[25563] -> 'S' '1'
    09/17/14 15:07:25 [25543] GAHP[25563] -> '3' '0' 'No Error' '4' '[ BatchjobId = "482046"; JobStatus = 4; ExitCode = 0; WorkerNode = "atl-prod08" ]'

    The first line tells us that the Gridmanager is initiating a status update and the following lines are the results. The most interesting line is the second highlighted section that notes the job ID on the batch system and its status. If there are errors querying the job on the batch system, they will appear here.

  • Finding the corresponding job ID on your non-HTCondor batch system:
    09/17/14 15:07:24 [25543] (87.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
    09/17/14 15:07:24 [25543] GAHP[25563] <- 'BLAH_JOB_STATUS 3 lsf/20140917/482046'

    On the first line, after the timestamp and PID of the Gridmanager process, you will find the CE’s job ID in parentheses. At the end of the second line, you will find the batch system, date, and batch system job id separated by slashes.

  • Job completion on the batch system:
    09/17/14 15:07:25 [25543] (87.0) gm state change: GM_TRANSFER_OUTPUT -> GM_DONE_SAVE

SharedPortLog

The HTCondor-CE shared port log keeps track of all connections to all of the HTCondor-CE daemons other than the collector. This log is a good place to check if experiencing connectivity issues with HTCondor-CE. More information can be found here.

To increase the debug level in this log, set the following value in /etc/condor-ce/config.d/99-local.conf on the CE host:

SHARE_DPORT_DEBUG = D_FULLDEBUG

To apply these changes, reconfigure HTCondor-CE:

[root@client ~]$ condor_ce_reconfig

Location:

/var/log/condor-ce/SharedPortLog

Key Contents Every attempt to connect to HTCondor-CE (except collector queries)

Messages log

The messages file can include output from lcmaps, which handles mapping of X.509 proxies to Unix usernames. If there are issues with the authentication setup, the errors may appear here.

Location:

/var/log/messages

Key Contents

  • User authentication

What to look for:

  • A user is mapped:
    Oct  6 10:35:32 osgserv06 htondor-ce-llgt[12147]: Callout to "LCMAPS" returned local user (service condor): "osgglow01"
  • Specific error messages and methods to troubleshoot them can be found in this document

BLAHP Configuration File

HTCondor-CE uses the BLAHP to submit jobs to your local non-HTCondor batch system using your batch system's client tools. You can also tell the BLAHP to save the files that are being submitted to the local batch system to DIR_NAME by adding the following line:

blah_debug_save_submit_info=DIR_NAME
The BLAHP will then create a directory with the format bl_* for each submission to the local jobmanager with the submit file and proxy used.

HELP NOTE
Whitespace is important so do not put any spaces around the = sign. In addition, the directory must be created and HTCondor-CE should have sufficient permissions to create directories within DIR_NAME.

Location:

/etc/blah.config

Key Contents

  • Locations of the batch system's client binaries and logs
  • Location to save files that are submitted to the local batch system

HTCondor-CE Troubleshooting Tools

HTCondor-CE has its own separate set of of the HTCondor tools with _ce_ in the name (i.e., condor_ce_submit vs condor_submit). Some of the the commands are only for the CE (e.g., condor_ce_run and condor_ce_trace) but many of them are just HTCondor commands configured to interact with the CE (e.g., condor_ce_q, condor_ce_status). It is important to differentiate the two: condor_ce_config_val will provide configuration values for your HTCondor-CE while condor_config_val will provide configuration values for your batch system.

condor_ce_run

Usage

Similar to globus-job-run, condor_ce_run is a tool that submits a simple job to your CE, so it is useful for verifying that job submission works from end-to-end. To submit a job to the CE and run the env command on the remote batch system:

[user@client ~]$ condor_ce_run -r condorce.example.com:9619 /bin/env

Replacing the highlighted text with the hostname of the CE. If you are troubleshooting an HTCondor-CE that you do not have a login for and the CE accepts local universe jobs, you can run commands locally on the CE with condor_ce_run with the -l option. The following example outputs the JobRouterLog of the CE in question:

[user@client ~]$ condor_ce_run -lr condorce.example.com:9619 cat /var/log/condor-ce/JobRouterLog

Replacing the highlighted text with the hostname of the CE. To disable this feature on your CE, consult this section of the install documentation.

Troubleshooting

  1. If you do not see any results: condor_ce_run does not display results until the job completes on the CE, which may take several minutes or longer if the CE is busy. In the meantime, can use condor_ce_q in a separate terminal to track the job on the CE. If you never see any results, use condor_ce_trace to pinpoint errors.
  2. If you see an error message that begins with “Failed to…”: Check connectivity to the CE with condor_ce_trace or condor_ce_ping

condor_ce_trace

Usage

If condor_ce_run fails, the condor_ce_trace tool may help in verifying the install:

condor_ce_trace --debug condorce.example.com

[user@client ~]$ condor_ce_trace condorce.example.com
Testing HTCondor-CE collector connectivity.
***** condor_ping output *****
10/07/14 12:54:40 recognized 60011 as command number.
Remote Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Local  Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Session ID:                  condorce:22494:1412704480:2403
Instruction:                 60011
Command:                     60011
Encryption:                  none
Integrity:                   MD5
Authenticated using:         GSI
All authentication methods:  GSI
Remote Mapping:              glow@users.opensciencegrid.org
Authorized:                  TRUE

********************
- Successful ping of collector on <131.225.154.68:9619>.

Testing HTCondor-CE schedd connectivity.
***** condor_ping output *****
10/07/14 12:54:40 recognized 60011 as command number.
Remote Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Local  Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Session ID:                  condorce:22495:1412704480:336
Instruction:                 60011
Command:                     60011
Encryption:                  none
Integrity:                   MD5
Authenticated using:         GSI
All authentication methods:  GSI
Remote Mapping:              glow@users.opensciencegrid.org
Authorized:                  TRUE

********************
- Successful ping of schedd on <131.225.154.68:9620?sock=22489_8590_4>.

Job ad, pre-submit: 
    [
        Log = "/cloud/login/blin/.log_5237_YCPBqo"; 
        x509UserProxyVOName = "GLOW"; 
        x509userproxysubject = "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brian Lin 1047/CN=proxy"; 
        Out = "/cloud/login/blin/.stdout_5237_s9KGwd"; 
        LeaveJobInQueue = ( StageOutFinish > 0 ) isnt true; 
        x509UserProxyFirstFQAN = "/GLOW/Role=NULL/Capability=NULL"; 
        x509userproxy = "/tmp/x509up_u47646"; 
        x509UserProxyFQAN = "/GLOW/Role=NULL/Capability=NULL"; 
        Args = ""; 
        Err = "/cloud/login/blin/.stderr_5237_j_FluG"; 
        Cmd = "/bin/env"; 
        x509UserProxyExpiration = 1412736896
    ]
Submitting job to schedd <131.225.154.68:9620?sock=22489_8590_4>
- Successful submission; cluster ID 229
Resulting job ad: 
    [
        BufferSize = 524288; 
        NiceUser = false; 
        CoreSize = -1; 
        CumulativeSlotTime = 0; 
        OnExitHold = false; 
        RequestCpus = 1; 
        Err = "_condor_stderr"; 
        BufferBlockSize = 32768; 
        x509userproxy = "/tmp/x509up_u47646"; 
        TransferOutputRemaps = "_condor_stdout=/cloud/login/blin/.stdout_5237_s9KGwd;_condor_stderr=/cloud/login/blin/.stderr_5237_j_FluG"; 
        ImageSize = 100; 
        CurrentTime = time(); 
        WantCheckpoint = false; 
        CommittedTime = 0; 
        TargetType = "Machine"; 
        WhenToTransferOutput = "ON_EXIT"; 
        Cmd = "/bin/env"; 
        JobUniverse = 5; 
        ExitBySignal = false; 
        HoldReasonCode = 16; 
        Iwd = "/cloud/login/blin"; 
        NumRestarts = 0; 
        CommittedSuspensionTime = 0; 
        Owner = undefined; 
        NumSystemHolds = 0; 
        CumulativeSuspensionTime = 0; 
        RequestDisk = DiskUsage; 
        Requirements = true && TARGET.OPSYS == "LINUX" && TARGET.ARCH == "X86_64" && TARGET.HasFileTransfer && TARGET.Disk >= RequestDisk && TARGET.Memory >= RequestMemory; 
        MinHosts = 1; 
        JobNotification = 0; 
        NumCkpts = 0; 
        LastSuspensionTime = 0; 
        NumJobStarts = 0; 
        WantRemoteSyscalls = false; 
        JobPrio = 0; 
        RootDir = "/"; 
        CurrentHosts = 0; 
        x509UserProxyExpiration = 1412736896; 
        StreamOut = false; 
        WantRemoteIO = true; 
        OnExitRemove = true; 
        DiskUsage = 1; 
        In = "/dev/null"; 
        PeriodicRemove = false; 
        RemoteUserCpu = 0.0; 
        LocalUserCpu = 0.0; 
        LocalSysCpu = 0.0; 
        RemoteSysCpu = 0.0; 
        ClusterId = 229; 
        Log = "/cloud/login/blin/.log_5237_YCPBqo"; 
        CompletionDate = 0; 
        RemoteWallClockTime = 0.0; 
        x509UserProxyFQAN = "/GLOW/Role=NULL/Capability=NULL"; 
        LeaveJobInQueue = JobStatus == 4 && ( CompletionDate is UNDDEFINED || CompletionDate == 0 || ( ( time() - CompletionDate ) < 864000 ) ); 
        CondorVersion = "$CondorVersion: 8.0.7 Sep 24 2014 $"; 
        MyType = "Job"; 
        StreamErr = false; 
        HoldReason = "Spooling input data files"; 
        PeriodicHold = false; 
        ProcId = 0; 
        x509UserProxyFirstFQAN = "/GLOW/Role=NULL/Capability=NULL"; 
        Out = "_condor_stdout"; 
        JobStatus = 5; 
        PeriodicRelease = false; 
        RequestMemory = ifthenelse(MemoryUsage isnt undefined,MemoryUsage,( ImageSize + 1023 ) / 1024); 
        Args = ""; 
        MaxHosts = 1; 
        TotalSuspensions = 0; 
        CommittedSlotTime = 0; 
        x509userproxysubject = "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brian Lin 1047/CN=proxy"; 
        x509UserProxyVOName = "GLOW"; 
        CondorPlatform = "$CondorPlatform: X86_64-CentOS_6.5 $"; 
        ShouldTransferFiles = "YES"; 
        ExitStatus = 0; 
        QDate = 1412704480; 
        EnteredCurrentStatus = 1412704480
    ]
Spooling cluster 229 files to schedd <131.225.154.68:9620?sock=22489_8590_4>
- Successful spooling
Querying job status (1/600)
Job status: Held
Querying job status (2/600)
Job status: Idle
Querying job status (3/600)
Job status: Idle
Querying job status (4/600)
Job status: Idle
Querying job status (5/600)
Job status: Idle
Querying job status (6/600)
Job status: Idle
Querying job status (7/600)
Job status: Idle
Querying job status (8/600)
Job status: Idle
Querying job status (9/600)
Job status: Idle
Querying job status (10/600)
Job status: Idle
Querying job status (11/600)
Job status: Idle
Querying job status (12/600)
Job status: Idle
Querying job status (13/600)
Job status: Idle
Querying job status (14/600)
Job status: Idle
Querying job status (15/600)
Job status: Idle
Querying job status (16/600)
Job status: Idle
Querying job status (17/600)
Job status: Idle
Querying job status (18/600)
Job status: Idle
Querying job status (19/600)
Job status: Idle
Querying job status (20/600)
Job status: Idle
Querying job status (21/600)
Job status: Idle
Querying job status (22/600)
Job status: Idle
Querying job status (23/600)
Job status: Idle
Querying job status (24/600)
Job status: Idle
Querying job status (25/600)
Job status: Idle
Querying job status (26/600)
Job status: Idle
Querying job status (27/600)
Job status: Idle
Querying job status (28/600)
Job status: Idle
Querying job status (29/600)
Job status: Idle
Querying job status (30/600)
Job status: Idle
Querying job status (31/600)
Job status: Idle
Querying job status (32/600)
Job status: Idle
Querying job status (33/600)
Job status: Idle
Querying job status (34/600)
Job status: Idle
Querying job status (35/600)
Job status: Idle
Querying job status (36/600)
Job status: Idle
Querying job status (37/600)
Job status: Idle
Querying job status (38/600)
Job status: Idle
Querying job status (39/600)
Job status: Idle
Querying job status (40/600)
Job status: Idle
Querying job status (41/600)
Job status: Idle
Querying job status (42/600)
Job status: Idle
Querying job status (43/600)
Job status: Idle
Querying job status (44/600)
Job status: Idle
Querying job status (45/600)
Job status: Idle
Querying job status (46/600)
Job status: Idle
Querying job status (47/600)
Job status: Idle
Querying job status (48/600)
Job status: Completed
***** Job output *****
_CONDOR_ANCESTOR_22435=22441:1412360582:256213668
_CONDOR_ANCESTOR_22441=5347:1412704518:1792957092
_CONDOR_ANCESTOR_5347=5356:1412704519:2135643703
PATH=/bin:/usr/bin:/sbin:/usr/sbin
OSG_JOB_CONTACT=host.name/jobmanager-condor
_CONDOR_SLOT=
OSG_DEFAULT_SE=None
OSG_GRID=/etc/osg/wn-client/
TMPDIR=/var/lib/condor/execute/dir_5347
GLOBUS_LOCATION=/usr
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_5347
_CONDOR_JOB_IWD=/var/lib/condor/execute/dir_5347
TEMP=/var/lib/condor/execute/dir_5347
OSG_HOSTNAME=condorce.example.com
OSG_STORAGE_ELEMENT=False
OSG_SITE_NAME=local
_CONDOR_JOB_PIDS=
OSG_APP=/share/osg/app
OSG_WN_TMP=None
X509_USER_PROXY=/var/lib/condor/execute/dir_5347/x509up_u47646
TMP=/var/lib/condor/execute/dir_5347
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_5347/.job.ad
OSG_SITE_WRITE=None
OSG_GLEXEC_LOCATION=None
OSG_DATA=UNAVAILABLE
HOME=/home/glow
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_5347/.machine.ad
OSG_SITE_READ=None
********************

The tool contacts both the CE’s Schedd and Collector daemons to see if you have permission to submit to the CE, displays the submit script that it submits to the CE, and tracks the resultant job.

Troubleshooting

  1. If the command fails with “Failed ping…”: Make sure that the HTCondor-CE daemons are running on the CE

  2. If you see “gsi@unmapped” in the “Remote Mapping” line: Either your credentials are not mapped on the CE or authentication is not set up at all. To set up authentication, refer to our installation document.

  3. If the job submits but does not complete: Look at the status of the job and perform the relevant troubleshooting steps.

condor_submit

Usage

Use the condor_submit (notice that this is not condor_ce_submit) command to manually test job submission from a remote host. You may need to try manual submission if you are having difficulty with condor_ce_run or condor_ce_trace, need to specify attributes for your local batch system, or simply want further tests of remote submission.

[user@client ~]$ condor_submit test.sub

universe = grid
grid_resource = condor condorce.example.com condorce.example.com:9619

executable = test.sh
output = test.out
error = test.err
log = test.log

ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT

use_x509userproxy = true
+Owner=undefined
queue

test.sub is a submit file that describes attributes of the job in HTCondor’s job description language. It is outside of the scope of this document to describe the full suite of attributes offered by condor_submit but there are a few CE-specific attributes that can be set to override any defaults in the job routes:

# Set the requested memory to 2GB
+maxMemory = 2000
# Set the requested number of cores to 2
+xcount = 2
# Set the maximum amount of time that a job should run for to 10 minutes
+maxWallTime = 10
# Set the job's batch system queue to 'analysis'
+remote_queue = 'analysis'

More information on condor_submit and submit files can be found in the condor_manual.

Troubleshooting

All interactions between condor_submit and the CE will be recorded in the file specified by the log attribute in your submit file. This includes acknowledgement of the job in your local queue, connection to the CE, and a record of job completion.

000 (786.000.000) 12/09 16:49:55 Job submitted from host: <131.225.154.68:53134>
...
027 (786.000.000) 12/09 16:50:09 Job submitted to grid resource
    GridResource: condor condorce.example.com condorce.example.com:9619
    GridJobId: condor condorce.example.com condorce.example.com:9619 796.0
...
005 (786.000.000) 12/09 16:52:19 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job


If there are issues contacting the CE, you will see error messages about a 'Down Globus Resource':

020 (788.000.000) 12/09 16:56:17 Detected Down Globus Resource
    RM-Contact: fermicloud133.fnal.gov
...
026 (788.000.000) 12/09 16:56:17 Detected Down Grid Resource
    GridResource: condor condorce.example.com condorce.example.com:9619

This indicates a communication issue with your CE that can be diagnosed with condor_ce_ping.

condor_ce_ping

Usage

Use the following condor_ce_ping command to test your ability to submit jobs to an HTCondor-CE:

[user@client ~]$ condor_ce_ping -verbose -name condorce.example.com -pool condorce.example.com:9619 WRITE

The following shows successful output where I am able to submit jobs (Authorized: TRUE) as the glow user (Remote Mapping: glow@users.opensciencegrid.org):

Remote Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Local  Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Session ID:                  condorce:27407:1412286981:3
Instruction:                 WRITE
Command:                     60021
Encryption:                  none
Integrity:                   MD5
Authenticated using:         GSI
All authentication methods:  GSI
Remote Mapping:              glow@users.opensciencegrid.org
Authorized:                  TRUE

HELP NOTE
If you run the condor_ce_ping command on the CE that you are testing, omit the -name and -pool options. condor_ce_ping takes the same arguments as condor_ping and is documented in the HTCondor manual.

Troubleshooting

  1. If you see “ERROR: couldn’t locate (null)!”, that means the HTCondor-CE schedd (the daemon that schedules jobs) cannot be reached. To track down the issue, increase debugging levels on the CE:
    MASTER_DEBUG = D_FULLDEBUG
    SCHEDD_DEBUG = D_FULLDEBUG

    Then look in the Master log and Schedd log for any errors.

  2. If you see “gsi@unmapped” in the “Remote Mapping” line, this means that either your credentials are not mapped on the CE or that authentication is not set up at all. To set up authentication, refer to our installation document.

condor_ce_q

Usage

condor_ce_q can display job status or specific job attributes for jobs that are still in the CE’s queue.

To list jobs that are queued on a CE:

[user@client ~]$ condor_ce_q -name condorce.example.com -pool condorce.example.com:9619

To inspect the full ClassAd for a specific job, specify the -l flag and the job ID:

[user@client ~]$ condor_ce_q -name condorce.example.com -pool condorce.example.com:9619 -l <Job ID>

HELP NOTE
If you run the condor_ce_q command on the CE that you are testing, omit the -name and -pool options. condor_ce_q takes the same arguments as condor_q and is documented in the HTCondor manual.

Troubleshooting

If the jobs that you are submiting to a CE are not completing, condor_ce_q can tell you the status of your jobs.

  1. If the schedd is not running: You will see a lengthy message about being unable to contact the schedd. To track down the issue, increase the debugging levels on the CE with:
    MASTER_DEBUG = D_FULLDEBUG
    SCHEDD_DEBUG = D_FULLDEBUG
    

    To apply these changes, reconfigure HTCondor-CE:

    [root@client ~]$ condor_ce_reconfig

    Then look in the Master log and Schedd log on the CE for any errors.

  2. If there are issues with contacting the collector: You will see the following message:
    condor_ce_q -pool ce1.accre.vanderbilt.edu -name ce1.accre.vanderbilt.edu
    
    -- Failed to fetch ads from: <129.59.197.223:9620?sock=33630_8b33_4> : ce1.accre.vanderbilt.edu

    This may be due to network issues or bad HTCondor daemon permissions. To fix the latter issue, ensure that the ALLOW_READ configuration value is not set by checking condor_ce_config_val:

    [user@client ~]$ condor_ce_config_val -v ALLOW_READ
    Not defined: ALLOW_READ

    If it is defined, remove it from the file that is returned in the output.

  3. If a job is held: There should be an accompanying HoldReason that will tell you why it is being held. The HoldReason is in the job’s ClassAd, so you can use the long form of condor_ce_q to extract its value:
    [user@client ~]$ condor_ce_q -name condorce.example.com -pool condorce.example.com:9619 -l  | grep HoldReason
  4. If a job is idle: The most common cause is that it is not matching any routes in the CE’s job router. To find out whether this is the case, use the condor_ce_job_router_info.

condor_ce_history

Usage

condor_ce_history can display job status or specific job attributes for jobs that have that have left the CE’s queue.

To list jobs that have run on the CE:

[user@client ~]$ condor_ce_history -name condorce.example.com -pool condorce.example.com:9619

To inspect the full ClassAd for a specific job, specify the -l flag and the job ID:

[user@client ~]$ condor_ce_history -name condorce.example.com -pool condorce.example.com:9619 -l <Job ID>

HELP NOTE
If you run the condor_ce_history command on the CE that you are testing, omit the -name and -pool options. condor_ce_history takes the same arguments as condor_history and is documented in the HTCondor manual.

condor_ce_job_router_info

Usage

Use the condor_ce_job_router_info command to help troubleshoot your routes and how jobs will match to them.

To see all of your routes (the output is long because it combines your routes with the JOB_ROUTER_DEFAULTS configuration variable):

[root@client ~]$ condor_ce_job_router_info -config

Route 1
Name         : "Local_PBS"
Universe     : 9
MaxJobs      : 10000
MaxIdleJobs  : 2000
GridResource : batch pbs
Requirements : target.osgTestPBS is true
ClassAd      : 
    [
        set_osg_environment = "OSG_GRID='/etc/osg/wn-client/' OSG_SITE_READ='None' OSG_APP='/share/osg/app' OSG_HOSTNAME='condorce.example.com' OSG_DATA='UNAVAILABLE' OSG_GLEXEC_LOCATION='None' GLOBUS_LOCATION='/usr' OSG_STORAGE_ELEMENT='False' OSG_SITE_NAME='local' OSG_WN_TMP='None' PATH='/bin:/usr/bin:/sbin:/usr/sbin' OSG_SITE_WRITE='None' OSG_DEFAULT_SE='None' OSG_JOB_CONTACT='host.name/jobmanager-condor'"; 
        set_requirements = true; 
        MaxJobs = 10000; 
        copy_environment = "orig_environment"; 
        eval_set_remote_SMPGranularity = ifThenElse(InputRSL.xcount isnt null,InputRSL.xcount,ifThenElse(xcount isnt null,xcount,ifThenElse(default_xcount isnt null,default_xcount,1))); 
        delete_PeriodicRemove = true; 
        GridResource = "batch pbs"; 
        set_RoutedJob = true; 
        name = "Local_PBS"; 
        MaxIdleJobs = 2000; 
        eval_set_environment = debug(strcat("HOME=",userHome(Owner,"/")," ",ifThenElse(orig_environment is undefined,osg_environment,strcat(osg_environment," ",orig_environment)))); 
        eval_set_RequestMemory = ifThenElse(InputRSL.maxMemory isnt null,InputRSL.maxMemory,ifThenElse(maxMemory isnt null,maxMemory,ifThenElse(default_maxMemory isnt null,default_maxMemory,2000))); 
        eval_set_remote_NodeNumber = ifThenElse(InputRSL.xcount isnt null,InputRSL.xcount,ifThenElse(xcount isnt null,xcount,ifThenElse(default_xcount isnt null,default_xcount,1))); 
        eval_set_remote_queue = ifThenElse(InputRSL.queue isnt null,InputRSL.queue,ifThenElse(queue isnt null,queue,ifThenElse(default_queue isnt null,default_queue,""))); 
        eval_set_remote_cerequirements = ifThenElse(InputRSL.maxWallTime isnt null,strcat("Walltime == ",string(60 * InputRSL.maxWallTime)," && CondorCE == 1"),"CondorCE == 1"); 
        delete_osgTestPBS = true; 
        Requirements = target.osgTestPBS is true; 
        eval_set_RequestCpus = ifThenElse(InputRSL.xcount isnt null,InputRSL.xcount,ifThenElse(xcount isnt null,xcount,ifThenElse(default_xcount isnt null,default_xcount,1))); 
        delete_CondorCE = true; 
        TargetUniverse = 9
    ]

Route 2
Name         : "Condor Test"
Universe     : 5
MaxJobs      : 10000
MaxIdleJobs  : 2000
GridResource : 
Requirements : true
ClassAd      : 
    [
        set_osg_environment = "OSG_GRID='/etc/osg/wn-client/' OSG_SITE_READ='None' OSG_APP='/share/osg/app' OSG_HOSTNAME='condorce.example.com' OSG_DATA='UNAVAILABLE' OSG_GLEXEC_LOCATION='None' GLOBUS_LOCATION='/usr' OSG_STORAGE_ELEMENT='False' OSG_SITE_NAME='local' OSG_WN_TMP='None' PATH='/bin:/usr/bin:/sbin:/usr/sbin' OSG_SITE_WRITE='None' OSG_DEFAULT_SE='None' OSG_JOB_CONTACT='host.name/jobmanager-condor'"; 
        eval_set_accounting_group = "accounting_group"; 
        set_requirements = true; 
        MaxJobs = 10000; 
        copy_environment = "orig_environment"; 
        eval_set_remote_SMPGranularity = ifThenElse(InputRSL.xcount isnt null,InputRSL.xcount,ifThenElse(xcount isnt null,xcount,ifThenElse(default_xcount isnt null,default_xcount,1))); 
        delete_PeriodicRemove = true; 
        set_RoutedJob = true; 
        name = "Condor Test"; 
        MaxIdleJobs = 2000; 
        eval_set_environment = debug(strcat("HOME=",userHome(Owner,"/")," ",ifThenElse(orig_environment is undefined,osg_environment,strcat(osg_environment," ",orig_environment)))); 
        eval_set_accounting_group_user = "blin_user"; 
        eval_set_RequestMemory = ifThenElse(InputRSL.maxMemory isnt null,InputRSL.maxMemory,ifThenElse(maxMemory isnt null,maxMemory,ifThenElse(default_maxMemory isnt null,default_maxMemory,2000))); 
        eval_set_remote_NodeNumber = ifThenElse(InputRSL.xcount isnt null,InputRSL.xcount,ifThenElse(xcount isnt null,xcount,ifThenElse(default_xcount isnt null,default_xcount,1))); 
        eval_set_remote_queue = ifThenElse(InputRSL.queue isnt null,InputRSL.queue,ifThenElse(queue isnt null,queue,ifThenElse(default_queue isnt null,default_queue,""))); 
        eval_set_remote_cerequirements = ifThenElse(InputRSL.maxWallTime isnt null,strcat("Walltime == ",string(60 * InputRSL.maxWallTime)," && CondorCE == 1"),"CondorCE == 1"); 
        Requirements = true; 
        eval_set_RequestCpus = ifThenElse(InputRSL.xcount isnt null,InputRSL.xcount,ifThenElse(xcount isnt null,xcount,ifThenElse(default_xcount isnt null,default_xcount,1))); 
        delete_CondorCE = true; 
        TargetUniverse = 5
    ]

To see how the job router is handling a job that is currently in the CE’s queue, analyze the output of condor_ce_q (replace the highlighted text with the job ID that you are interested in):

[root@client ~]$ condor_ce_q -l <Job ID> | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.

Checking for candidate jobs. routing table is:
Route Name             Submitted/Max        Idle/Max     Throttle
Local_PBS                      0/  10000       0/   2000     none
Condor Test                    0/  10000       0/   2000     none

Umbrella constraint: ((target.x509userproxysubject =!= UNDEFINED) && (target.x509UserProxyExpiration =!= UNDEFINED) && (time() < target.x509UserProxyExpiration) && (target.JobUniverse =?= 5 || target.JobUniverse =?= 1)) && ( (target.osgTestPBS is true) || (true) ) && (target.ProcId >= 0 && target.JobStatus == 1 && (target.StageInStart is undefined || target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone" && target.Managed isnt "External" && target.Owner isnt Undefined && target.RoutedBy isnt "htcondor-ce")
Checking Job src=162,0 against all routes
	Route Matches: Condor Test
Found candidate job src=162,0,route=Condor Test
1 candidate jobs found

To inspect a job that has already left the queue, use condor_ce_history instead of condor_ce_q:

[root@client ~]$ condor_ce_history -l <Job ID> | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

HELP NOTE
If the proxy for the job has expired, the job will not match any routes. To work around this constraint:

[root@client ~]$ condor_ce_history -l <Job ID> | sed "s/^\(x509UserProxyExpiration\) = .*/\1 = `date +%s --date '+1 sec'`/" | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Alternatively, you can provide a file containing a job’s ClassAd as the input and edit attributes within that file:

[root@client ~]$ condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads <JobAd file>

Troubleshooting

  1. If the job does not match any route: You can identify this case when you see 0 candidate jobs found in the condor_job_router_info output. This message means that, when compared to your job’s ClassAd, the Umbrella constraint does not evaluate to true. When troubleshooting, look at all of the expressions prior to the target.ProcId >= 0 expression, because it and everything following it is logic that the job router added so that routed jobs do not get routed again.

    Umbrella constraint: ((target.x509userproxysubject =!= UNDEFINED) && (target.x509UserProxyExpiration =!= UNDEFINED) && (time() < target.x509UserProxyExpiration) && (target.JobUniverse =?= 5 || target.JobUniverse =?= 1)) && ( (target.osgTestPBS is true) || (true) ) && (target.ProcId >= 0 && target.JobStatus == 1 && (target.StageInStart is undefined || target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone" && target.Managed isnt "External" && target.Owner isnt Undefined && target.RoutedBy isnt "htcondor-ce")
    0 candidate jobs found
    

  2. If your job matches more than one route: the tool will tell you by showing all matching routes after the job ID:
    Checking Job src=162,0 against all routes
    Route Matches: Local_PBS
    Route Matches: Condor Test

    To troubleshoot why this is occuring, look at the combined Requirements expressions for all routes and compare it to the job’s ClassAd provided. The combined Requirements expression is highlighted below:

    Umbrella constraint: ((target.x509userproxysubject =!= UNDEFINED) && 
    (target.x509UserProxyExpiration =!= UNDEFINED) && 
    (time() < target.x509UserProxyExpiration) && 
    (target.JobUniverse =?= 5 || target.JobUniverse =?= 1)) && 
    ( (target.osgTestPBS is true) || (true) ) && 
    (target.ProcId >= 0 && target.JobStatus == 1 && 
    (target.StageInStart is undefined || target.StageInFinish isnt undefined) && 
    target.Managed isnt "ScheddDone" && 
    target.Managed isnt "Extenal" && 
    target.Owner isnt Undefined && 
    target.RoutedBy isnt "htcondor-ce")
    

    Both routes evaluate to true for the job’s ClassAd because it contained osgTestPBS = true. Make sure your routes are mutually exclusive, otherwise you may have jobs routed incorrectly! See the job route configuration page for more details.

  3. If it is unclear why jobs are matching a route: wrap the route's requirements expression in debug() and check the JobRouterLog for more information.

condor_ce_router_q

Usage

If you have multiple job routes and many jobs, condor_ce_router_q is a useful tool to see how jobs are being routed and their statuses:

[user@client ~]$ condor_ce_router_q

condor_ce_router_q takes the same options as condor_router_q and condor_q and is documented in the HTCondor manual

condor_ce_status

Usage

To see the daemons running on a CE, you can run the following:

[user@client ~]$ condor_ce_status -any -name condorce.example.com -pool condorce.example.com:9619

Replace the highlighted text with the hostname of the CE.

HELP NOTE
If you run the condor_ce_status command on the CE that you are testing, omit the -name and -pool options. condor_ce_status takes the same arguments as condor_status and is documented in the HTCondor manual.

Troubleshooting

To list the daemons that are configured to run:

[user@client ~]$ condor_ce_config_val -v DAEMON_LIST
DAEMON_LIST: MASTER COLLECTOR SCHEDD JOB_ROUTER, SHARED_PORT, SHARED_PORT
  Defined in '/etc/condor-ce/config.d/03-ce-shared-port.conf', line 9.

If you do not see these daemons in the output of condor_ce_status, check the Master log for errors.

condor_ce_config_val

Usage

To see the value of configuration variables and where they are set, use condor_ce_config_val. Primarily, This tool is used with the other troubleshooting tools to make sure your configuration is set properly. To see the value of a single variable and where it is set:

[user@client ~]$ condor_ce_config_val -v <configuration variable>

To see a list of all configuration variables and their values:

[user@client ~]$ condor_ce_config_val -dump

To see a list of all the files that are used to create your configuration and the order that they are parsed, use the following command:

[user@client ~]$ condor_ce_config_val -config

condor_ce_config_val takes the same arguments as condor_config_val and is documented in the HTCondor manual.

condor_ce_reconfig

Usage

To ensure that your configuration changes have taken effect, run condor_ce_reconfig.

[user@client ~]$ condor_ce_reconfig

condor_ce_{on,off,restart}

Usage

To turn on/off/restart HTCondor-CE daemons, use the following commands:

[root@client ~]$ condor_ce_on
[root@client ~]$ condor_ce_off
[root@client ~]$ condor_ce_restart

The HTCondor-CE service uses the previous commands with default values. Using these commands directly gives you more fine-grained control over the behavior of HTCondor-CE's on/off/restart:

  • If you have installed a new version of HTCondor-CE and want to restart the CE under the new version, run the following command:
    [root@client ~]$ condor_ce_restart -fast
    

    This will cause HTCondor-CE to restart and quickly reconnect to all running jobs.

  • If you need to stop running new jobs, run the following:
    [root@client ~]$ condor_ce_off -peaceful
    

    This will cause HTCondor-CE to accept new jobs without starting them and will wait for currently running jobs to complete before turning all the daemons exit.

General Troubleshooting Items

Making sure packages are up-to-date

It is important to make sure that the HTCondor-CE and related RPMs are up-to-date.

yum update "htcondor-ce*" blahp condor
If you just want to see the packages to update, but do not want to perform the update now, answer N at the prompt.

Verify package contents

If the contents of your HTCondor-CE packages have been changed, the CE may cease to function properly. To verify the contents of your packages (ignoring changes to configuration files):

[user@client ~]$ rpm -q --verify htcondor-ce htcondor-ce-client blahp | awk '$2 != "c" {print $0}'

If the verification command returns output, this means that your packages have been changed. To fix this, you can reinstall the packages:

[user@client ~]$ yum reinstall htcondor-ce htcondor-ce-client blahp

HELP NOTE
The reinstall command may place original versions of configuration files alongside the versions that you have modified. If this is the case, the reinstall command will notify you that the original versions will have an .rpmnew suffix. Further inspection of these files may be required as to whether or not you need to merge them into your current configuration.

Verify clocks are synchronized

Like all GSI-based authentication, HTCondor-CE is sensitive to time skews. Make sure the clock on your CE is synchronized using a utility such as ntpd. Additionally, HTCondor itself is sensitive to time skews on the NFS server. If you see empty stdout / err being returned to the submitter, verify there is no NFS server time skew.

HTCondor-CE Troubleshooting Items

This section contains common issues you may encounter using HTCondor-CE and next actions to take when you do. Before troubleshooting, we recommend increasing the log level:

  1. Write the following into /etc/condor-ce/config.d/99-local.conf to increase the log level for all daemons:
    ALL_DEBUG = D_FULLDEBUG
  2. Ensure that the configuration is in place:
    [root@client ~]$ condor_ce_reconfig
  3. Reproduce the issue

HELP NOTE
Before spending any time on troubleshooting, you should ensure that the state of configuration is as expected by running condor_ce_reconfig.

Daemons fail to start

If there are errors in your configuration of HTCondor-CE, this may cause some of its required daemons to fail to startup. Check the following subsections in order:

Symptoms

Daemon startup failure may manifest in many ways, the following are few symptoms of the problem.

  • The service fails to start:
    [root@client ~]$ service condor-ce start
    Starting Condor-CE daemons:			[ FAIL ]
    
  • condor_ce_q fails with a lengthy error message:
    [user@client ~]$ condor_ce_q
    Error: 
    
    Extra Info: You probably saw this error because the condor_schedd is not 
    running on the machine you are trying to query. If the condor_schedd is not 
    running, the Condor system will not be able to find an address and port to 
    connect to and satisfy this request. Please make sure the Condor daemons are 
    running and try again.
     
    Extra Info: If the condor_schedd is running on the machine you are trying to 
    query and you still see the error, the most likely cause is that you have 
    setup a personal Condor, you have not defined SCHEDD_NAME in your 
    condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE 
    setting. You must define either or both of those settings in your config 
    file, or you must use the -name option to condor_q. Please see the Condor 
    manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
    

Next actions

  1. If the MasterLog is filled with ERROR:SECMAN...TCP connection to collector...failed: This is likely due to a misconfiguration for a host with multiple network interfaces. Verify that you have followed the instructions in this section of the install guide.
  2. If the MasterLog is filled with DC_AUTHENTICATE errors: The HTCondor-CE daemons use the host certificate to authenticate with each other. Verify that your host certificate’s DN matches one of the regular expressions found in /etc/condor-ce/condor_mapfile.
  3. If the SchedLog is filled with Can’t find address for negotiator: You can ignore this error! The negotiator daemon is used in HTCondor batch systems to match jobs with resources but since HTCondor-CE does not manage any resources directly, it does not run one.

Jobs fail to submit to the CE

If a user is having issues submitting jobs to the CE and you've ruled out general connectivity or firewalls as the culprit, then you may have encountered an authentication or authorization issue. You may see error messages like the following in your schedd log:

08/30/16 16:52:56 DC_AUTHENTICATE: required authentication of 72.33.0.189 failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXZpUlYa)
08/30/16 16:53:12 PERMISSION DENIED to gsi@unmapped from host 72.33.0.189 for command 60021 (DC_NOP_WRITE), access level WRITE: reason: WRITE authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 72.33.0.189,dyn-72-33-0-189.uwnet.wisc.edu, hostname size = 1, original ip address = 72.33.0.189
08/30/16 16:53:12 DC_AUTHENTICATE: Command not authorized, done!

Next actions

  1. Check GUMS or grid-mapfile and ensure that the user's DN is known to your authentication method
  2. Check for lcmaps errors in /var/log/messages
  3. If you do not see helpful error messages in /var/log/messages, adjust the debug level by adding export LCMAPS_DEBUG_LEVEL=5 in /etc/sysconfig/condor-ce and checking /var/log/messages for errors again.

Jobs stay idle on the CE

Check the following subsections in order, but note that jobs may take several minutes or longer to run if the CE is busy.

Idle jobs on CE: Is the job router handling the incoming job?

Jobs on the CE will be put on hold if they do not match any job routes after 30 minutes, but you can check a few things if you suspect that the jobs are not being matched. Check if the JobRouter sees a job before that by looking at the job router log and looking for the text src=<job id>…claimed job.

Next actions

Use condor_ce_job_router_info to see why your idle job does not match any routes

Idle jobs on CE: Verify correct operation between the CE and your local batch system

For HTCondor batch systems

HTCondor-CE submits jobs directly to an HTCondor batch system via the JobRouter, so any issues with the CE/local batch system interaction will appear in the JobRouterLog.

Next actions

  1. Check the JobRouterLog for failures.
  2. Verify that the local HTCondor is functional.
  3. Use condor_ce_config_val to verify that the JOB_ROUTER_SCHEDD2_NAME, JOB_ROUTER_SCHEDD2_POOL, and JOB_ROUTER_SCHEDD2_SPOOL configuration variables are set to the hostname of your CE and the hostname of your local HTCondor’s collector, and the location of your local HTCondor’s spool directory, respectively.
  4. Use condor_config_val to verify that QUEUE_SUPER_USER_MAY_IMPERSONATE is set to '.*' (without the quotes).

For non-HTCondor batch systems

HTCondor-CE submits jobs to a non-HTCondor batch system via the Gridmanager, so any issues with the CE/local batch system interaction will appear in the GridmanagerLog. Look for gm state change… lines to figure out where the issures are occuring.

Next actions

  1. If you see failures in the GridmanagerLog during job submission: Save the submit files by adding the appropriate entry to blah.config and submit it manually to the batch system. If that succeeds, make sure that the BLAHP knows where your binaries are located by setting the <batch system>_binpath in /etc/blah.config.

  2. If you see failures in the GridmanagerLog during queries for job status: Query the resultant job with your batch system tools from the CE. If you can, the BLAHP uses scripts to query for status in /usr/libexec/blahp/<batch system>_status.sh (e.g., /usr/libexec/blahp/lsf_status.sh) that take the argument batch system/YYYMMDD/job ID (e.g., lsf/20141008/65053). Run the appropriate "status" script for your batch system and upon success, you should see the following output:

    [ BatchjobId = "894862"; JobStatus = 4; ExitCode = 0; WorkerNode = "atl-prod08" ]
    If the script fails, request help from the OSG.

Idle jobs on CE: Make sure the underlying batch system can run jobs

HTCondor-CE communicates directly with an HTCondor batch system schedd, so if jobs are not running, examine the schedd log and diagnose the problem from there. For other batch systems, the BLAHP is used to submit jobs using your batch systemís job submission binaries, whose location is specified in /etc/blah.config.

Procedure

  1. Manually create and submit a simple job (e.g., sleep)
  2. Check for errors in the submission itself
  3. Watch the job in the batch system queue (e.g., using condor_q)
  4. If the job does not run, check for errors on the batch system

Next actions

If the underlying batch system does not run a simple manual job, it will probably not run a job coming from HTCondor-CE. Once you can run simple manual jobs on your batch system, try submitting to the HTCondor-CE again.

Idle jobs on CE: Verify ability to change permissions on key files

HTCondor-CE needs the ability to write and chown files in its spool directory and if it cannot, jobs will not run at all. Spool permission errors can appear in the SchedLog and the JobRouterLog.

Symptoms

09/17/14 14:45:42 Error: Unable to chown '/var/lib/condor-ce/spool/1/0/cluster1.proc0.subproc0/env' from 12345 to 54321

Next actions

As root, try to change ownership of the file or directory in question. If the file does not exist, a parent directory may have improper permissions.

Jobs stay idle on a remote host submitting to the CE

If you are submitting your job from a separate submit host to the CE, it stays idle in the queue forever, and you do not see a resultant job in the CE's queue, this means that your job cannot contact the CE for submission or it is not authorized to run there. Note that jobs may take several minutes or longer if the CE is busy.

Remote idle jobs: Can you contact the CE?

To check basic connectivity to a CE, use condor_ce_ping:

Symptoms

[user@client ~]$ condor_ping -verbose -name condorce.example.com -pool condorce.example.com:9619 WRITE 
ERROR: couldn't locate condorce.example.com!

Next actions

  1. Make sure that the HTCondor-CE daemons are running with condor_ce_status.

  2. Verify the CE is reachable from your submit host:
    ping CE

Remote idle jobs: Are you authorized to run jobs on the CE?

The CE will only accept jobs from users that authenticate via GUMS or a grid mapfile. You can use condor_ce_ping to check if you are authorized and what user your proxy is being mapped to.

Symptoms

[user@client ~]$ condor_ping -verbose -name condorce.example.com -pool condorce.example.com:9619 WRITE 
Remote Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Local  Version:              $CondorVersion: 8.0.7 Sep 24 2014 $
Session ID:                  condorce:3343:1412790611:0
Instruction:                 WRITE
Command:                     60021
Encryption:                  none
Integrity:                   MD5
Authenticated using:         GSI
All authentication methods:  GSI
Remote Mapping:              gsi@unmapped
Authorized:                  FALSE

Next actions

  1. Verify that an authentication method is set up on the CE
  2. Verify that your proxy is mapped to an existing system user

Jobs go on hold

Jobs will be put on held with a HoldReason attribute that can be inspected with condor_ce_q:

[user@client ~]$ condor_ce_q -l <job ID> -attr HoldReason
HoldReason = "CE job in status 5 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route or entry in JOB_ROUTER_ENTRIES."

Held jobs: Missing/expired user proxy

HTCondor-CE requires a valid user proxy for each job that is submitted. You can check the status of your proxy with the following

[user@client ~]$ voms-proxy-info -all

Next actions

Ensure that the owner of the job generates their proxy with voms-proxy-init.

Held jobs: Invalid job universe

The HTCondor-CE only accepts jobs that have universe in their submit files set to vanilla, standard, local, or scheduler. These universes also have corresponding integer values that can be found in the HTCondor manual.

Next actions

  1. Ensure jobs submitted locally, from the CE host, are submitted with universe = vanilla
  2. Ensure jobs submitted from a remote submit point are submitted with:
    universe = grid
    grid_resource = condor condorce.example.com condorce.example.com:9619

    where condorce.example.com is replaced by the hostname of the CE.

Held jobs: Non-existent route or entry in JOB_ROUTER_ENTRIES

Jobs on the CE will be put on hold if they do not match any job routes within 30 minutes.

Next actions

Use condor_ce_job_router_info to see why your idle job does not match any routes.

Identifying the corresponding job ID on the local batch system

When troubleshooting interactions between your CE and your local batch system, you will need to associate the CE job ID and the resultant job ID on the batch system. The methods for finding the resultant job ID differs between batch systems.

HTCondor batch systems

  1. To inspect the CE’s job ad, use condor_ce_q or condor_ce_history:

    # Use condor_ce_q if the job is still in the CE’s queue
    [user@client ~]$ condor_ce_q <Job ID> -af RoutedToJobId
    # Use condor_ce_history if the job has left the CE’s queue
    [user@client ~]$ condor_ce_history <Job ID> -af RoutedToJobId
  2. Parse the JobRouterLog for the CE’s job ID.

Non-HTCondor batch systems

When HTCondor-CE records the corresponding batch system job ID, it is written in the form <BATCH SYSTEM>/<DATE>/<JOB ID>:

lsf/20141206/482046

  1. To inspect the CE’s job ad, use condor_ce_q:

    [user@client ~]$ condor_ce_q <Job ID> -af GridJobId
  2. Parse the GridmanagerLog for the CE’s job ID.

Jobs removed from the local HTCondor pool become resubmitted (HTCondor batch systems only)

By design, HTCondor-CE will resubmit jobs that have been removed from the underlying HTCondor pool. Therefore, to remove misbehaving jobs, they will need to be removed on the CE level following these steps:

  1. Identify the misbehaving job ID
  2. Find the job's corresponding CE job ID:

    [user@client ~]$ condor_q <Job ID> -af RoutedFromJobId
  3. Use condor_ce_rm to remove the CE job from the queue

Missing HTCondor tools

Most of the HTCondor-CE tools are just wrappers around existing HTCondor tools that load the CE-specific config. If you are trying to use HTCondor-CE tools and you see the following error:

[user@client ~]$ condor_ce_job_router_info
/usr/bin/condor_ce_job_router_info: line 6: exec: condor_job_router_info: not found

This means that the condor_job_router_info (note this is not the CE version), is not in your PATH.

Next Actions

  1. Either the condor RPM is missing or there are some other issues with it (try rpm --verify condor).
  2. You have installed HTCondor in a non-standard location that is not in your PATH.
  3. The condor_job_router_info itself wasn't available until OSG's release of Condor-8.0.7-5 (available in osg-release) or Condor-8.2.3-1.1 (available in osg-upcoming).

Known Issues

SUBMIT_EXPRS are not applied to jobs on the local HTCondor

If you are adding attributes to jobs submitted to your HTCondor pool with SUBMIT_EXPRS, these will not be applied to jobs that are entering your pool from the HTCondor-CE. To get around this, you will want to add the attributes to your job routes. If the CE is the only entry point for jobs into your pool, you can get rid of SUBMIT_EXPRS on your backend. Otherwise, you will have to maintain your list of attributes both in your list of routes and in your SUBMIT_EXPRS.

Getting Help

If you are still experiencing issues after using this document, please let us know!

  1. Gather basic HTCondor-CE and related information (versions, relevant configuration, problem description, etc.)

  2. Gather system information:

    osg-system-profiler
  3. Start a support request using a web interface or by email to goc@opensciencegrid.org

    • Describe issue and expected or desired behavior
    • Include basic HTCondor-CE and related information
    • Attach the osg-system-profiler output

Reference

Here are some other HTCondor-CE documents that might be helpful:

Topic revision: r64 - 08 Jun 2017 - 18:43:38 - BrianLin
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..