%LINKCSS%

Minutes WSGram Testing Dec 10

Introduction

  • Attendees: Jeff, Rob, Suchandra, Stu, Charles, Terrance
  • Apologies: none
  • Coordinates: Monday, 4:00pm Central; 510-665-5437, #1212
  • Previous meetings, MeetingMinutes

WS Gram scalability tests (Suchandra)

  • using local invocations appears to have made things worse. Will review/rerun.

WS Gram scalability tests (Jeff)

  • From last week, I had observed a consistent failure rate (10%-30%) from a single source that was determined to be a gridftp authentication timeout on the data sink for jobs (of Type III data-producing jobs ) run both with Condor-g and standalone client. The error was explicit in the standalone client as:
    globusrun-ws: Job failed: Staging error for RSL element fileStageOut.
    Error authenticating user at source/dest hostServer refused performing the request. 
    Custom message: Server refused GSSAPI authentication. (error code 1) 
    [Nested exception message:  Custom message: Unexpected reply: 421 Idle Timeout: closing control connection.] 
    [Caused by: Server refused performing the request. 
    Custom message: Server refused GSSAPI authentication. (error code 1) 
    [Nested exception message:  Custom message: Unexpected reply: 421 Idle Timeout: closing control connection.]] 
    I was asked to test:
    • increasing the control_preauth_timeout value from its default of 30 seconds; the globus gridftp team was now suggesting 200 seconds as the default
    • allow retrys by specifying a number directly in the RSL
    • allow retrys by specifying a number default in the RFT configuration db
  • Increasing the control_preauth_timeout value to 300 seconds worked. I was able to push jobs through at a high rate without ever seeing the timeout error. I eventually hit different failure modes when I tried pushing 400 jobs at a time.
    • Scanned the gridftp-auth.log file on destination to measure time required for authentication. The measure is made between the log entries:
      [23490] Mon Dec 10 09:11:22 2007 :: New connection from: XXXX
      [23490] Mon Dec 10 09:11:22 2007 :: DN /DC=org/DC=doegrids/OU=People/CN= XXXX  successfully authorized.
      
      I evaluated the results a they depend on how many separate jobs I threw at the CE: 20-jobs, 100-jobs, and 400-jobs. It's unclear how much interaction goes on between the client and server between these log-entries, so the meaning of the absolute time scale is also unclear. However, as seen in the plot below, the time difference was only weakly dependent on loading the CE - mean values of 59, 57, and 87 seconds for the 3 cases. In any case, the value of 300 seconds (and probably 200 seconds) is large enough for this configuration.
Time btween server connect and
authentication (View Image for details)
gftp_auth_dt.gif
  • I tried allowing for retry from within the RSL via the syntax:
     
     <job> ...
         <fileStageOut>
             <maxAttempts>5</maxAttempts>
            <transfer>
               ....
           </transfer>
        </fileStageOut> 
    
    This did NOT work. I continued to have authentication errors.
  • I tried changing the default value on the CE's rft configuration by altering the rft_database.request table, making the "maxattempts" field default to 5. This also did NOT work. I will submit a globus trouble ticket
    • What I did not do was double check the row-entries in the database for either test to make sure that the number changed.
  • Once I was able to run several hundred jobs at a time at the server, I reached a different set of errors. I'll provide details directly.
    • The total error rate was about 15% and the server remained functional afterwards. I found 3 sets of errors which all seem to be of the same source.
      • Most were
         Delegating user credentials...Failed.
        globusrun-ws: globus_i_delegate.c::1142:
        Error trying to delegate
        globus_i_delegate.c::673:
        Error querying delegation factories
        ManagedJobFactoryService_client.c::2209:
        Failed sending request ManagedJobFactoryPortType_GetMultipleResourceProperties.
        globus_xio_system_select.c:globus_l_xio_system_cancel_cb:664:
        Operation was canceled
        globus_xio_system_select.c:globus_l_xio_system_cancel_cb:664:
        Operation timed out
        
      • 2nd most were
        Delegating user credentials...Failed.
        globusrun-ws: globus_i_delegate.c::1142:
        Error trying to delegate
        globus_delegation_client_util.c:globus_l_delegation_client_util_get_cert_cb:488:
        DelegationFactoryPortType_GetResourceProperty callback failed.
        DelegationFactoryService_client.c::720:
        Failed sending request DelegationFactoryPortType_GetResourceProperty.
        globus_xio_system_select.c:globus_l_xio_system_cancel_cb:664:
        Operation was canceled
        globus_xio_system_select.c:globus_l_xio_system_cancel_cb:664:
        Operation timed out
        
      • the rest were
         Delegating user credentials...Done.
        Submitting job...Failed.
        Cleaning up any delegated credentials...Done.
        globusrun-ws: globus_i_submit.c::731:
        Error submitting job
        ManagedJobFactoryService_client.c::5202:
        Failed sending request ManagedJobFactoryPortType_createManagedJob.
        globus_xio_system_select.c:globus_l_xio_system_cancel_cb:664:
        Operation was canceled
        globus_xio_system_select.c:globus_l_xio_system_cancel_cb:664:
        Operation timed out
        

WS Gram talk at Site Admin meeting (Charles & Jeff)

Tests at UCSD (Terrence)

  • went straight to condor-g (no desire for testing globusrun-ws). had some problems with ws-gram/nfslite support, discovered the correct work arounds within Condor-g
  • seems ready to begin scaling tests
  • will try to have local team modify CMS software to add gt4 submissions
  • will review container optimizations pointed to be OSG and Globus docs
    • see : GlobusWebServicesInstall
    • Stu will send out a link for enabling local invocation -> it's included in GT4 performance recommendations which is linked to in the OSG doc above.
    • Jeff will follow up on gridftp-auth timeout if not added to gt4 docs directly: as listed in gridftp config options , simply add control_preauth_timeout 200 into the $GLOBUS_LOCATION/etc/gridftp.conf file.
  • has decided that current instability of container on his test side is likely a hardware issue and will move on.

AOB

  • we'll likely have a meeting next week but will decide during the site-admin meeting and/or in email

%BOTTOMMATTER%

-- JeffPorter - 07 Dec 2007

Topic attachments
I Attachment Action Size Date Who Comment
else6 Globusrun_Comparison_between_0.8_and_0.6 manage 576.5 K 10 Dec 2007 - 21:13 SuchandraThapa Results from comparing globusrun-ws at 1Hz with different settings
gifgif gftp_auth_dt.gif manage 13.8 K 10 Dec 2007 - 19:01 JeffPorter time between gridftp server connect and authentication
Topic revision: r5 - 10 Dec 2007 - 23:31:48 - JeffPorter
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback