-- JoseCaballero - Started in October 2008

Jump to the end

Effort Reports from Jose Caballero


1st quarter

January 2009

  • First week on vacations.

  • Panda Data Movement:
    • Check if LFC python API is imported in script pdm_file_register.py, avoiding using it to register checksum and filesize in LFC catalog when the library is not available. New error code to raise an exception when users try to register checksum and filesize but the API cannot be imported.
    • Allowing specify several pilot DNs at a time to delegate end users proxies to a myproxy server from pathena.
    • Studying to configure the myproxy-server to allow retrieval based on FQAN, instead of using option -Z with myproxy-init.

  • gLExec and MyProxy integration into Panda:
    • Studying how to split runJob.py into three scripts (one for stage-in, one to run athena, one for stage-out) so each part can be run under gLExec or not separately, as a site policy.

  • Outreach Activities:
    • Preparation for the workshop in Chile, as member of the outreach and education team. Learning the lessons, testing the exercises, testing the software and tools installed on the workshop machine....
    • Trip to Chile.
    • Writing to Ruth and people from the Outreach team my feedback and feelings after the trip to Chile. Translation into English the comments from the students (written in Spanish).
    • Starting with the translation into Spanish of the documentation for training courses.

February 2009

  • Panda Data Movement:
    • Edition of the Panda Data Movement wiki page.
    • pdmUtils.py created containing common classes and function for the pdm scripts. The content of errors.py has been moved also into pdmUtils.py Twiki updated acordingly
    • Refactorization of the code to make it more readable, easier to maintain, more scalable, and more consistent with the operations performed.
    • Started the development of a script pdm_get.py to recover information registered in the LFC catalog.

  • gLExec and MyProxy integration into Panda:
    • Preparing poster for CHEP'09.
    • Studing how to split runJob in three processes (stagein, payload, stageout).
    • Studing what people in EGEE are doing with gLExec and its integration with SCAS.
    • Phone meetings (19 Feb, 26 Feb)
    • Testing gLExec/SCAS installation at Gridka.

  • Outreach activities:
    • Translation from English into Spanish of the material for training courses (work in progress).

  • Clound Computing:
    • Reading some documentation.

March 2009

  • First week on the OSG All Hands Meeting at LIGO: attending talks and arranging private meetings (Alina, Igor, Mine, and Alan Roy)

  • gLExec and MyProxy integration into Panda:
    • Finishing the testing of the gLExec/SCAS installation at Gridka, to check if problems with expired voms attributes have been already fixed.
    • Implementing comments from authors in the poster for CHEP.

  • Outreach Activities:
    • Phone meeting with Alina Bejan and people from Colombia to start the organization of a workshop in Colombia in August/September.
    • Reading the document the Outreach Team has about how to organize a workshop. Looks like Ruth wants I am in charge of the whole organization of an event in Brazil and a following up event in Chile, with Alina helping me, instead the opposite.

  • ATLAS Software and Computing Workshop at CERN.
    • Attending presentations.
    • Meeting with Charles Waldman on Tuesday to discuss about his HTTPS LFC interface. We found a few weak points in his design. We discusses about several improvements. He agrees I include his interface in my generic catalog interface, and I have offered him my web based read-only interface (the one I made with django). He liked it and probably will include my code in his project.
    • Meeting with Paul Nilsson on Tuesday to teach me about the structure of pilot code, and to give me some tasks to do.
    • Meeting with SCAS people on Tuesday (Massimo, Antonio Retico, and Gianni Pucciani), requested by me and scheduled by Massimo. Attending, this time in person, on Friday, the regular SCAS/gLExec meeting.
    • Report of my tests of th gLExec installation at GridKa? during the distributed analysis session.

  • CHEP at Prague

  • PanDA/pilot:
    • Started cleaning up the pilot wrappers code (suggested by Paul Nilsson): deleting orphan variables and orphan functions, deleting commented out lines, removing global variables when not needed, splitting long functions in several short functions, sorting lists, etc...

2nd quarter

April 2009

  • gLExec and MyProxy integration into Panda:
    • Writing the proceedings for the CHEP
    • Phone meetings with EGEE people (about SCAS/gLExec installation): still problems with installation in IN2P3, UK sites are going to collaborate.
    • Changes in the myproxyUtils.py interface:
      • bugs fixing
      • allowing DN with parentheses to avoid the MyProxy servers reads them as regexp syntax
      • allowing a list of DNs instead of a single DN as authorized retrievers.

  • PanDA/pilot:
    • More cleaning up the wrappers pilot code [thinPilotWrapper.sh, ProdPilotThin.py, trans-atlasprod-thin.sh]
    • Implementing the code to pick up randomly (99%-1%) one package from libcode [trans-atlasprod-thin.sh]
    • Merge of ProdPilotThin.py and trans-atlasprod-thin.sh in one single script

  • Panda/USATLAS:
    • Attending some sessions of the US workshop at BNL on Apr 14th and 15th

  • Panda/monitoring:
    • Phone meeting to discuss about the Alexei's students ideas.

  • Outreach Activities:
    • Communication with Julio Ibarra and Heidi Alvarez, from the University of Chicago, to find out how to get funding for the workshop in Colombia. They propose to use the program NSF/PASI, but the deadline is 15th January.
    • Trying to clarify my responsibilities, with no success.

May 2009

  • gLExec and MyProxy integration into Panda:
    • Finishing the refactorization of the code.
    • Enhancing the documentation.
    • Periodical retrieval implemented as a thread. That solution is the safer one: if the parent process dies the child dies also; that avoid having a process working forever on the WN retrieving proxies which can be stolen by bad guys.
    • Phone meetings with EGEE people about SCAS/gLExec installation.
    • Proceedings submitted to Journal of Physics.
    • Testing the installation at Lancaster. Complicate. A lot of problems. Many emails between Peter Love, Oscar Koeroo, Maarten and myself. Problems also to delegate proxies from BNL when environment variable GT_PROXY_MODE is setup with value "old".
    • Contacting Jim Basney and Alain Roy to find out when the new version of MyProxy client will be released officially (new versions fix problems during delegation when GT_PROXY_MODE is set up to old).
    • Working with Paul Nilsson to split runJob.py into three different modules (for stage-in, run the payload, stage-out)

  • Outreach Activities:
    • Trying to find out how to fund longing for students in a possible workshop in Colombia.
    • Contact with Chilean people to see if there is interest in a follow-up workshop. They say "yes".
    • Commenting with Jim Shank the option of supporting partially a workshop in Colombia by ATLAS.

  • PanDA/monitoring:
    • Phone meeting to discuss about the Alexei's students work
    • Re-creating the structure in /panda-monitor/ directory within the svn repository to allocate the new code developed by Dmitry.
    • Directory branches/ created in svn.

  • PanDA/pilot:
    • Removed warning messages when command `which python32` and `which voms-proxy-info` fail. Users have complained a lot about these messages.
    • Removed old not needed files from svn.

  • PanDA Data Movement:
    • Subdirectory for pdm scripts created in the repository.
    • A new project (including the whole structure: current/ and tags/ directories, (empty) config files for guidance, etc.) created in the svn repository.
    • Testing the data-host project.
    • Directory branches/ created in svn

June 2009

  • PanDA:
    • cleaning up the svn repository:
      • old not used projects tags/ and branches/ removed
      • non standard config files removed from some projects
      • new project panda-pilot/ created, and whole content from pilot3/ migrated.
      • old project osgddm/ removed; we are already using panda-osgddm/.
      • content of panda-osgddm/ split into two subprojects: osgddm-client/ and osgddm-server/. Also osgddm-common/ created, just in case we need it in the future. Two sets of setup files (for packaging and distribution) created, so far empty (they have to be filled).
    • Some improvements in the autopilot wiki page.

  • PanDA Data Movement:
    • Testing the data-host project. Proposing improvements.
    • I have repeated the installation, but using sqlite3 instead of mysql as backend database. It works fine.

  • Outreach activities:
    • A lot of emails with the Colombian people.
    • I have contacted Marta Losada (coordinator of ATLAS-Colombia), and put her in contact with Jim Shank.
    • We have started to discuss about calendar and schedule for the workshop in Colombia.

  • gLExec and MyProxy integration into PanDA:
    • More tests at Lancaster and GridKa, trying to understand why they fail (again problems with expired VOMS attributes). Discussion with Oscar Koeroo and Maarten Litmaath about this.
    • Trying to find out ways to bypass this problem in the meanwhile (regenerating the voms attributes on the WN in a safe way). Testing the Tadashi's proposal, with no success.
    • Setting GT_PROXY_MODE='old' on the WN (if needed) before the retrieval, and unsetting it after the retrieval.
    • Testing the scripts that Nikhef guys have written to pack and unpack the environment. They don't work since they need a non-standard PERL library. Anyway, I don't like them due their lack of flexibility.
    • Testing the new pilot code, and trying to split it into two processes (stagein+payload and stageout).

3rd quarter

July 2009

  • PanDA:
    • Testing recipe http://code.activestate.com/recipes/576704/ to reduce the size of the pilotcode tarball.
    • Fixed bug in pilot wrapper throwing a harmless error message when g++32 is not on the WN.
    • new project in svn created, with the whole structure and (empty to be filled asap) packaging and distribution files:
                                       |----  tags/
                                       |----  branches/
                                       |----  current/
                                                |---- commonutils/
                                                |---- INSTALL.txt
                                                |---- MANIFEST.in
                                                |---- README.txt
                                                |---- setup.cfg
                                                |---- setup.py
    • Some modules added to common-utils:
      • Patterns.py: with a class to handle threads (so far just for periodical tasks), a class to perform command line operations.
      • pyminifier.py: copied from recipe http://code.activestate.com/recipes/576704/
      • argparse.py: customized version improved to raise my own exceptions and to split options between several namespaces

  • gLExec and MyProxy integration into PanDA:
    • Regular phone meetings with EGEE people.
    • More tests at Lancaster. Finally they work. Problems with expired VOMS attributes fixed.
    • Still trying to split runJob.py into two modules. Problem: forking implementation in the pilot framework. Replacing execvpe() by os.system() + sys.exit() does not work.
    • Script multipleTestJobs.py created to perform a loop over /home/sm/nilsson/subversion/offline/Production/panda/testJob.py in order to submit more than one single job to a given queue/cloud.

  • ATLAS:
    • Joining ADC meetings.

  • CHEP09:
    • reviewing the article I am referee of.

August 2009

  • gLExec and MyProxy integration into PanDA:
    • More tests. Problems with pilot understood: stagein and stageout have not anymore the same pid than the parent process, which is checked in some point in pilot.py
    • Contacted Lancaster admin system to request the Nurcan's proxy being allowed to call gLExec.

  • PanDA:
    • Improvements in the content and documentation of the generic tools I am putting in /svn/common-utils/
    • class Logging implemented in module patterns.py to handle message logging. It is an extension of function tolog(). Using timestamp is optional, and its format is configurable. Stdout or a file to write the message is an input option.
    • The name of the module which is invoking tolog() is added to the messages printed out.

  • Outreach activities:
    • contacted people at Colombia to find out how the workshop plans are going. Seems like it will be during October.

September 2009

  • First week at CERN: ATLAS Software & Computing week.

  • PanDA:
    • Tested the tool to reduce the size of pilot tarball. It does not work. Creating my own one, much simpler but working.
    • Meeting with Xin Zhao, Suchandra and Rob Gardner. The goal is to adapt PanDA (new site/queue, transformation scripts and maybe specific view in the monitoring) to submit testing jobs to validate ITBs. Once I get expertise doing that I can apply that knowledge to other people like DUSEL guys, or even for outreach.

  • gLExec and MyProxy integration into PanDA:
    • end-user proxy cannot be read by the payload after identity switch, because it has permissions 600. If I change permissions, gLExec complains. I have to copy the proxy and makes the second copy more readable. But then I need a third copy, because a proxy credential with permissions other than 600 are not valid (i.e. LFC commands complains). Investigating how to bypass this problem using a specific env var provided by gLExec itself: GLEXEC_TARGET_PROXY.
    • A bug in myproxyInterface parsing input options fixed.
    • Running payload under gLExec and stage-out under pilot identity works.
    • Trying to run payload and stage-out both under gLExec. Problems with permissions to create new directories in pnfs/dCache and in LFC. Temporarily I am changing the permissions of parent directories by hand, just trying to run the job until the very end.

  • Outreach activities:
    • Alina has left UoC. Now I am the leader for the workshop at Bogota. Looking for an expert in sys admin. Meeting at Chicago with Rob Gardner, Marco Mambelli and Suchandra to organize the workshop, from Wed Sep 30 to Fri Oct 2.

4th quarter

October 2009

  • Two days at Chicago in a private meeting with Rob Gardner, Suchandra, Marco Mambelli and Charles Waldman. Topic discussed:
    • workshop at Bogota
    • panda framework for ITBs testing
    • pilot/pcache integration

  • Outreach activities:
    • A lot of emails with people at Chicago and Colombia to organize the workshop in Bogota. Experts we contacted for remote talks are declining.
    • Try to collect exercises and slides from different sources to give in Bogota a module focused on Information System + Brokering.
    • Skype meeting with Rob and others in Chicago.
    • Testing with John Hover the ITB at BNL to submit jobs belonging VO OSGEDU.
    • Preparing slides for my talks: refactorization from slides I used in Chile.

  • Autopilot:
    • Collection documentation about how PanDA works (queues, sites, tags, pilot types,...) and how the script to handle the specifications work (pilotController.py)
    • Removing everything related myproxy and gLExec from autopilot scripts, since myproxyUtils.py is gone from that directory
    • Fixing the name of the server in many files, they still had the BNL server name.
    • Supporting CHARMM guys activities, now they are trying it again.

  • gLExec:
    • Testing the new installation of gLExec at ITB at BNL.

  • OSG ITB Integration:
    • I am in contact with Suchandra (Chicago) and Iwona Sakrejda (LBNL) to create a panda framework for ITB testing.
    • Investigating the best way to do it (create new panda queues? reusing the already existing queues? creating a tag? creating a new panda pilot for this purpose?)

  • OGF27:
    • Attending the Open Grid Forum at Banff (Canada)

November 2009

  • Outreach:
    • Some post-workshop activities through the mailing list created for the event (sending them useful information, answering questions, follow-up of activities).

  • Autopilot:
    • Problem with RC = 60 fixed (pilot was not sending the proxy to the server)
    • Supporting CHARMM guys activities. They need to have osg role=pilot to work to be accepted by the server.
    • Learning how to pass input options to the jobs at the time of the job submission instead of the pilot submission.
    • Updating the documentation: https://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/AutoPilot#Scheduling_and_monitoring_pilots
    • Testing cross-VO interoperability in PanDA: pilots belonging VO "A" can pick up jobs belonging VO "B". Is that acceptable?
    • Changes in the wrapper (atlasProdPilotWrapper) to source, when exists, OSG_APP/atlas_app/atlas_rel/local/setup.sh

  • Pilot:
    • LocalSiteMover.py modified to pass the GUID to lsm-put and lsm-get when required.

  • Panda job submission:
    • Creation of a new script for end users job submission: sendJob. It is just an improved version of toPanda (raising exceptions when the input from command line are not correct).
    • Creation of a project on SVN for this tool.

  • OSG ITB Integration:
    • Started with the configuration (processsingType="ITB_Integration") and tests for ITB Integration. Using fake jobs provided by Suchandra.
    • Attending the OSG Trash/Integration meetings.
    • Helping Suchandra to fix problems with the UoC site. Its jobmanager is not sending back the output and error files.
    • Running a script called site_verify.pl, provided by Xin, at BNL and UoC?.
    • Suchandra is testing the new code for jobs submissions.
    • Contact with LBNL guys (Iwona and Doug) to get all the needed information to create a panda site for them.

  • CERN-VM:
    • In contact with Arturo Sanchez, and others, to help with PanDA virtualization.
    • Question: is really BOINC a good platform to run ATLAS software?

  • DUSEL:
    • Meeting with DUSEL guys to discuss about their needs, and the convenience of using Panda for their activities.
    • Helping them with the first steps: grid certificates request, joining the OSG VO, creation of their own VO.

December 2009

  • First week at CERN, attending the ATLAS Computing and Software Workshop.
  • Informal ATLAS Distributed Computing Workshop at BNL

  • OSG ITB Integration:
    • File glite/vomses created.
    • Certificate files needed to create proxies for OSG VO copied and cron created to renew the proxy.
    • Cron to submit pilots to UC setup.
    • Testing the gatekeeper/jobmanager at LBNL, Caltech and FNAL: for a while they did not work, or they do not accept role /osg/role=pilot which what we have decided to use for the pilots.
    • Attending ITB meetings.
    • Seting up a panda site for LBNL.
    • Testing panda sites for LBNL, BNL_ITB and Caltech. They are not working for some reason (cloud=none?, ddm=none?, status=none?...)

  • gLExec:
    • Attending gLExec/Argus meetings.
    • Testing the script provided by the developers to create a directory owned by the end user. It does not fix my problems (I have to give some files and directories world wide permissions to read/write).

  • Autopilot:
    • Problems with SimplePilot fixed: the new modules on the server reject getJob requests when some unexpected variable is passed. I have removed all not needed variables from the requests.
    • Trivial version of wrapper created (and added in the pilots block in pilotController), with minimum level of verbosity: trivialWrapper.sh and trivialPilot.py

  • Panda Job Submission:
    • Starting to write the documentation for the new submission tool sendJob.py

  • DUSEL:
    • Live demo on how to submit jobs and pilots, use the monitor, etc.
    • Answering a lot of questions: about grid certificates, submission with condor-g, panda architecture...
    • Helping them to start working with grid certificates, roles, etc, and sending jobs to panda.
    • Helping Mats Rynge (one of the Engage VO admins) to setup a role=pilot in the VO Engage.

    • Panda queue created for CERNVM.


1st quarter

January 2010

  • First week on vacations.

  • Panda/Pilot/Autopilot:
    • As requested by John Hover (to rid of gridui04), moving files and scripts belonging sm account from gridui04 to pandadev01
    • Investigating why sometimes the output file is empty. Seems to be a problem with Chicago sites.

  • gLExec:
    • Resume of development. All refactorization to split runJob.py into two modules not valid anymore. Paul proposes a new model, running a single module twice.
    • Allowing the retrieval of long lifetime proxies, instead of 12 h long proxies. Removing the periodical retrieval.
    • Installation at BNL seems not to work properly. My tests are failing. Working with John and Xin to understand the problem.
    • Modifying the last version of pilot code to use group process ID instead of process ID.

  • Outreach:
    • Preparing the workshop at Bucaramanga (Colombia).
    • Meeting with Carlos Gamboa and John Hover to discuss the role of Carlos during the next workshop.
    • Phone meeting with Rob Garder, Carlos Gamboa and people from Colombia to discuss logistic.

  • OSG ITB Integration:
    • Working with Alden and Maxim to fix problems with some of the new panda sites.
    • Attending phone meetings.
    • pilotCron_OSG fixed to avoid duplicated code.
    • trivialWrapper tested. Cron on gridui07 changed to use it. NOTE: the output still contains ATLAS stuffs which I must to eliminate.
    • Removing manageData from trivialPilot.py to avoid some ATLAS specific messages in the out/log files.
    • Fixing the name of the script in trivialWrapper.sh to avoid misleading messages.

  • Panda for non-ATLAS users:
    • Exchange of emails with Ian Stokes-Rees and Peter Doherty about Panda for SBGRID/LifeSciences.

February 2010

  • Panda/Pilot/Autopilot:
    • Finished the implementation of a check in pUtil to print out the module invoking function tolog()

  • OSG ITB Integration:
    • Discussion with Steven Timm from FNAL about the best way to pass the payload proxy to invoke gLExec. After reading our CHEP paper he agrees with adopting the same mechanism than ATLAS.
    • Testing gridFTP to stage-out the output files. Studying how to pass to the pilot the destinationURL, so far the only way seems to be embedded in jobParameters, which I do not like.
    • trivialWrapper.sh was overriding the OSG PATH env var, fixed.
    • Stageing-out the output files with gridFTP if the user requests it.
    • Attending the phone meetings.
    • Working in the usage of gLExec for FNAL:
                * To make it work is needed to move the payload (and wrapper?) to the local disk: i.e. /tmp/
                * Passed the name of the output file as a variable, and moved to job.ouput after execution. job.output is what pilotWrapper expects.

  • Panda for non-ATLAS users:
    • Exchange of emails with Ian Stokes-Rees and Peter Doherty about SBGRID/LifeSciences.
    • Phone meeting with Ian and Peter...
    • prodSourceLabel=standalone allows end users to setup the job priority using --currentPriority
    • Seems to be a VO at FNAL for LBNE. They are adding the BNL guys into it.
    • Helping Tim Miller (CHARMM) to improve operations. His condor queue was collapsed with thousands of idle pilots. BNL_ITB_Install removed from the tag CHARMM. Looking for more sites to included under the tag CHARMM.
    • Joining the weekly VOs meeting to discuss how to improve CHARMM's productivity.
    • Functions added to trivialPilot.py to get the queue parameters (a la queuedata in atlas pilot) to find out if the site requires glexec.
    • Debugging why jobs sent to BNL_ITB_Test1 by Yuri are not being picked up by any pilot. Two reasons: different prodSourceLabel between jobs and pilots, and jobs were atlas production jobs which have no URL for the transformation script.
    • I have asked interactive access to CHARMM pilot submission host to check log files myself easily. The last problem seems to be a corrupted condor performance at wisc (we have asked Alain Roy for help)
    • Working with Tim Miller to find out why all his pilots are failing (again!). Many reasons:
          * bugs in several queue definitions in schedconfig (with extra quotations, \n, etc.), 
          * sites down, 
          * using obsolete panda sites, etc. 

  • gLExec/Panda integration:
    • Added a list of excluded env var. The env var in this list must not be recreated when glexec is invoked, so they must not go to the wrapper. The get() method adds default values to the ones set by the user.
    • Removing a lot of garbage (prints out, checks, etc.) from myproxyUtils.py
    • Problems to run glexec at BNL fixed.
    • Talk at the T1/T2/T3 Jamboree.
    • Using GLEXEC_TARGET_PROXY the end user proxy does not belong to the pilot anymore, so the pilot cannot remove it at the end. It has to be removed from the wrapper.
    • Implemented a variable in myproxyUtils.py to setup the outputfile, to be used for ITB.
    • run.sh removed from the code. It was hardcoded.

  • Outreach:
    • Discussion with Rob Gardner to prepare the workshop at Colombia.
    • A series of phone meetings with Rob, Aaron, and Carlos, and Suchandra and Marco, to prepare the workshop.
    • Preparing the twiki, links, documentation, etc.
    • Creating the agenda.
    • Contacting the remote speakers: Mats Rynge and Burt Holzman.

March 2010

  • First week in Bucaramanga (Colombia)
  • Second week attending the OSG All Hands Meeting at FNAL: I have contacted many people involved in outreach (including Ruth, James Weichel, Horst Severini...) and started negotiations for a workshop in Sao Paulo (Brazil).

  • Outreach:
    • writing a full report about the activities in Colombia as part of the request for NSF to get money.
    • Interview for the online magazine iSGTW about Grid Colombia.

  • ATLAS:
    • Attending ADC Monitoring meetings organized by Alex Read.
    • Preparing slides for talk "experience of using PanDA for non-HEP experiments"

  • OSG ITB integration:
    • Fixing bug with gridFTP. Now --destinationURL is an attribute of an specific class to handle pilot options, and its value is inserted into --jobParameters internally (transparent for the users), and with a different name for each job (adding an index at the end of the filename).

  • Autopilot/pilot:
    • Fixed the name of the pandaserver in PilotUtils.py. It was using voatlas19, which has been turned off.
    • Creating a script to perform globus-job-run to check sites availability with threads (to improve performance)

  • Last week attending the BNL ATLAS Distributed Computing Workshop.

2nd quarter

April 2010

  • glexec/panda integration:
    • Parameters to allow the proxy retrievals (end-user DN, myproxy server name) are known programmatically.
    • Passing if glexec must be used or not as a variable from pilot.py to runJob.py
    • Panda site TESTGLEXEC created for tests.
    • The list of particular files whose permissions have to be changed is passed as an argument, no longer hardcoded.

  • Outreach:
    • Putting in contact people from Colombia with Rob Quick, and answering questions from them about using the OSG GOC for Colombia.
    • Phone meeting with Dave Richie, Jim Weichel and Marcia Teckenbrok to talk about fixing the outreach and education web pages.
    • Phone meeting with people from Brazil to discuss about a workshop in Sao Paulo.
    • Discussion with Rob Quick about a mini-workshop in Colombia about GOC.
    • Discussion about starting an International Latinamerica Grid Initiative, with me link to OSG. Possible trip to the CLCAR this summer.

  • OSG ITB integration:
    • phone meetings with ITB people.
    • Discussing about using MyProxy to renew credentials on the job submit hosts.
    • Discussing about automatic plot creation.
    • Testing FNAL. There was a typo in their GUMS configuration, fixed now.

  • Non-HEP experiments:
    • Debuging problems with curl on some SBGrid site.
    • Requested an RACF account "dayaplt" to setup the pilot submission on DayaBay hosts.

  • PanDA for OSG:
    • Created a mailing list panda-osg (at) opensciencegrid.org and added a few guys (from CHARMM, SBGrid, LBNE, UoC...)
    • Fixing some bug in the documentation of sendJob.py
    • Creating a version for csh of the setup files for sendJob and sendPilot
    • Option --verify implemented. When used, the trf URL is checked and the job is not sent if the check fails.
    • Improving error reporting. If job submission fails due curl error, an explanatory message is printed-out: http://www.usatlas.bnl.gov/~caballer/files/curl_rc.

  • CHEP2010:
    • Writing the abstract about security on ATLAS operations.
    • Writing the abstract about OSG outreach activities.
    • Writing the abstract about PanDA for ITB validation.

  • Panda pilot:

  • Autopilot:
    • unsetting BNL http proxy from pilotCron to avoid burning the BNL proxy if not needed

May 2010

  • Panda pilot:
    • Finished the refactoring of the pilot wrapper to avoid it checking out files from SVN.

  • OSG ITB integration:
    • Phone meetings with ITB people.
    • Working on the integration of glexec and MyProxy with the OSG pilot. Removing hardcoded stuffs and doing operations programmatically like in ATLAS.

  • Outreach:
    • Putting in contact some people from Colombia with OSG GOC to figure out problems they have with voms proxies. After trying it myself on a different host we understand the problem is in the client installation.
    • Putting in contact some people from Colombia with OSG expert to fix problems deploying a CE.
    • Helping people from Colombia with doubts about if a WNs in a grid sites can belong different subnets or not.
    • Meeting with Ruth Pordes, Miron Livny and Dan Fraser to discuss about my contribution in the CLCAR workshop.
    • Starting discussions with the people involved in CLCAR about the creation of a Latin American grid.

  • OSG issues:
    • Attending VOs meeting (https://twiki.grid.iu.edu/bin/view/Trash/Trash/VirtualOrganizations/VOGroupMeeting20100506 )
    • Starting the negotiations to become administrator for the VO OSGEDU.
    • Presentation about PanDA in the monthly OSG Sites Meeting ( https://twiki.grid.iu.edu/bin/view/SiteCoordination/SitesCoord100513 ).
    • Documentation for PanDA in the main OSG web area: www.opensciencegrid.org/panda
    • To add more sites for CHARMM I have discovered around 20 sites publishing they support VO OSG when that is not true. I am in contact with GOC guys to figure out how to fix that problem. I have reported it in the OSG sites mailing list, and getting answers from the sites involved.
    • Setup voatlas19 to download from there a small tarball with code needed by the "osg-like" wrapper, instead of from SVN at BNL. Scripts to create the pilot tarball modified to create the OSG tarball and not only the ATLAS tarball (one in /svn/monitor/ and another one in /svn/panda-cacheschedconfig).

  • Non-HEP experiments:
    • Account "dayaplt" finally created. Also a personal account "jcb" to allow me to log on the RHIC gateway (and from there jump to daya hosts)
    • Adding a bunch of panda sites to tag=CHARMM
    • Investigation why all those sites added to CHARMM are not receiving pilots
    • Fixing the myproxy/glexec integration. After some changes to adapt the code for ATLAS it stopped working for OSG. Fixed now.

  • DOE issues:
    • Becoming a DOE Agent for OSGEDU VO (to have privileges to accept or reject new users grid certificates requests)
    • Working as DOE Agent accepting and rejecting user/host/service grid certificate requests.

June 2010

  • Non-ATLAS experiments:
    • Setting up the daya pilot submission host. Cron created to generate a proxy with my credentials with voms attributes osg/role=pilot.
    • Testing the condor feature x509userproxy which allows to delegate a proxy to the WN even in condor-C.
    • Meeting with David and Brett to solve some doubts about PanDA for Daya.
    • Setup for Daya FINISHED !!
    • Attending VOs meeting and reporting issues related CHARMM and the new setup for DAYA.
    • Tim Miller (CHARMM) is having problems to use the new client sendJob.py, because of uuidgen. I have replaced uuidgen by a timestamp.
    • Recording status and output after job submission in the API client into two variables instead of just printing them out, requested by Suchandra.

  • Panda:
    • Working with Graeme to update the pilot tarball builder scripts at CERN hosts.
    • Problems with trivialWrapper.sh at BNL: testing curl against www.bnl.gov not working. I have removed all the testing code. The same with atlasProdPilotWrapper.sh
    • Using pUtil.py instead of PilotUtils.py: pUtil.py modified to not to crash with some non needed imports. trivialPilot.py modified to use pUtil functions (they have a different name than in PilotUtils)
    • Changing "WARNING Dispatcher has no jobs" by "FINISHED Dispatcher has no jobs" in pUtils, because that caused the monitor marking pilots as failed.
    • Fixing some mistake in documentation in www.opensciencegrid.org/panda
    • Minor changes in pilotWrapper to make the code cleaner and to remove job.out after finishing to avoid its content to be included in the next pilot output even if no job is run.
    • Testing the new gLExec installation at BNL.
    • Testing gLExec at Glasgow. Problems with Graeme's certificate (it was rejected by the PanDA server due extra level of delegation introduced by the gatekeeper). The proxy must to be re-generated on WN with option -noregen

  • Outreach:
    • Phone meeting about OSG & GridColombia on Jun 9.
    • Phone meeting about the Latin American Grid project.
    • Google group created to discuss about Latin Grid.
    • Migrating people from mailing list gridco-ws09 to osg-americas required by Ruth.
    • Negotiations with GOC people to setup a wiki to discuss about Latin Grid and to allow a few people from SA to edit it. Twiki was setup and I have added initial content.

  • OSG ITB integration:
    • Phone meetings with ITB people.

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

3rd quarter

July 2010

  • First week on vacations.

  • Non-ATLAS experiments:
    • Testing sites under the TAG=CHARMM one by one to see why activities are stuck. Opening GOC tickets for sites failing. Seems like the main problem was that CHARMM was using an old version of trivialWrapper. Trying with the latest version. Specs for site osg.hpc.ufl.edu fixed. BNL proxy (firewall) environment variables removed from 23 sites under TAG=CHARMM.
    • Now submission pilots for CHARMM from BNL under my credentials. Jobs seem to be running. But some sites have complained and they have been marked offline, phone meeting with Ruth to discuss it. At antaeus.hpcc.ttu.edu-sge all (?) jobs failed. That has triggered a long thread and we have discovered that Globus does not work OK with SGE.

  • PanDA client for OSG:
    • Setting prodSourceLabel='user', cloud='OSG' and jobParameters='' as default values.
    • Input option --VO implemented and default value set to 'OSG'

  • PanDA pilot for OSG:
    • heartbeat implemented.

  • PanDA:
    • Trying to test gLExec at Glasgow. Now, as first step, I am using directly the wrapper runpilot3-wrapper.sh, no-glexec, using Rod Walker credentials. Problems to stage out files.
    • Changing my OSG certificates on pilot submission hosts: gridui07, daya0001
    • GOC ticket to change the configuration of the VOMS server for VO OSG. Maarten has detected a bug in the "uri".
    • Testing the new atlas pilot wrapper I wrote (downloads the entire pilot tarball directly from CERN) to put it in production.

  • Outreach:
    • Skype meeting to prepare things before the CLCAR workshop.
    • Working with GOC to grant access to the South American collaborators to edit the twiki I created.

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

August 2010

  • PanDA client and pilot for OSG:
    • input options to allow user authentication without X509 proxy, by pointing to the usercert and userkey files
    • allowing the transformation script to be the path to the executable instead of an URL.
    • input options and code to stage-in and stage-out files with gridFTP: --pilotStageInSource, --pilotStageInDestination, --pilotStageOutSource, --pilotStageOutDestination.
    • documentation for the new options.
    • input options and code to use any protocol (not only gridFTP) for stage-in and stage-out
    • python API more flexible: jobs can be defined separately and passed as an argument (as a list) to method makeJobList(). This allow each job having a different specification, i.e. different jobParameters.
    • input options and code to delegate a proxy to a myproxy server and to register it in the panda server.
    • Documentation updated.

  • Non-ATLAS experiments:
    • Testing more potential sites for CHARMM.
    • Adding Uconn to the list of sites.
    • Reporting progress with CHARMM in the OSG VO Forum weekly meeting.
    • Discussions with Brett and David for Daya. New transformation script including code to transfer the output rootfile and semfile with xrootd.
    • Open ticket to fix problem in the farm: pilots don't enter to run. Fixed now.
    • Learning how to add queues myself. Adding (myself) the directories and files in SchedConfigs/ and JDLConfigs/ for new queue AGLT2-OSG-condor
    • Re-started the scheduler on daya0001. Now Daya is up and running

  • PanDA wrapper:
    • atlasProdPilotWrapper.sh updated to source cctools setup file.

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

  • Last week attending the CLCAR workshop in Gramado (Brazil) + a couple of vacation days.

September 2010

  • Non-ATLAS experiments:
    • Some GOC tickets still opened since before I left to Brazil. Testing again the sites. At least one is working fine now.
    • New ticket: Uconn failing:
      • curl fails randomly
      • the jobmanager kills jobs, randomly
      • Ticket was closed after the admin claiming the problem was fixed. Not true, I have reopen the ticket. I am in contact with the admin trying to understand the problem at his site.

  • Outreach:
    • Full report to Ruth about the results of the workshop in Gramado.
    • Engaging the Venezuelan community in OSG.
    • Some coordination activities after the workshop in Gramado with Mats Rynge, Marco Mambelli, and Ruth.
    • Preparing the INDICO agenda for the next workshop in Sao Paulo.
    • Article for the September edition of OSG Newsletter.
    • Preparing poster for CHEP.
    • Starting organizing the International OSG school next year.
    • Learning how to submit a proposal for grants to PASI program (NSF).

  • PanDA:
    • Started with the alert system for the pilots. I have written the framework: a python script parsing an XML config file. The functions to get and analyze data and the to trigger actions are added as plug-ins without requiring additional code.
    • Started with the new wrapper system:
      • bash wrapper has 4 input options:
        • queue
        • site
        • project (?)
        • grid (?)
        • url (?)
    • The idea of moving the setup of the grid environment for OSG to the python part of the chain seems to be not feasible. When the local batch is CONDOR, the env var $PATH is setup only after setting up the grid environment.
    • Refactoring of pUtil.py (in pilot3) to split it into several modules. So far:
      • curlUtil.py
      • logUtil.py
      • timeUtil.py
      • hostUtil.py
      • genericUtil.py
      • pandaUtil.py
      • atlasUtil.py
      • jobUtil.py
    • Doing this refactoring I have detected things that should be fixed:
      • bugs, never detected.
      • docstrings not related the function.
      • two functions with different name doing the same.
      • orphan code. Functions not invoked by anyone.
      • variables with names that are python special words (example: file, id...) These words should not be used as variables names. It can introduce problems in the future.
    • Adding some content to the pilot documentation wiki.
    • Trying gLExec at Glasgow again. Now I am using the Graeme's wrapper. The problems seem to be because I was performing tests with production jobs but with a proxy with Role=pilot. Now we are going to perform all tests with analysis jobs:
      • I have requested pilot and production roles for /atlas/ and /atlas/usatlas/
      • I have created a new PanDA site: ANALY_GLASGOW_GLEXEC, starting with 'ANALY_' to avoid problems.
      • Reading the tutorials to learn how to submit athena jobs.

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

4th quarter

October 2010

  • PanDA:
    • Still trying to test gLExec at Glasgow. I am learning how to submit analysis jobs.
    • Schedconfig for ANALY_GLASGOW_GLEXEC changed to allow me to write the output files in the SE.
    • There is a bug in gLExec 0.7, trying to bypass it executing $ glexec bash wrapper.
    • Writing slides for the CHEP talk on security.
    • Created new queue at OSCER for MPI testing.
    • Cleaning up trivialPilot.py from ATLAS stuffs: all code related XML catalogs.

  • Non-ATLAS experiments:
    • Meetings with Daya guys to talk about running MC with PanDA.
    • Problems with CHARMM: gridFTP is failing. Their hostcert has expired. I have approved their request for a new hostcert.
    • Returning RC when a job failed, instead of just 1 as before, in trivialPilot.
    • Open a GOC Ticket to fix the URL pointing to the VOMS server in MyOSG. Fixed.
    • Condor-C jobs for daya are not entering to run. Open a ticket. Fixed now. Maybe the JDL should be modified to prevent the problem to happen again?
    • 'stream_output=True' added to JDL for Daya to allow streaming back the output on "real time".
    • commands.getstatusoutput() replaced by subprocess.Popen() for site LBNE_DAYA_1 to get the payload output on real time.
    • Implemented code in trivialPilot to search for daya-specific error strings ('FATAL', 'ERROR', 'segmentation violation', 'IOError', and '!ValueError') and error codes added to the monitor. Now failed jobs are displayed in red in the PanDA monitor with a short message.
    • Discussing with Daya guys how to proceed with PDSF. Several separate sites or one TAG? Do we encapsulate the specific site details in the condor submission file?
    • People from PDSF contacted by Brett to talk about running PanDA there for DayaBay.
    • Running MC jobs for Daya.

  • Outreach:
    • Setting up the INDICO agenda for the school in Sao Paulo. Contacting people for remote talks (Mine, Ruth, Miron...) Organizing the schedule.
    • People from NSF contacted to ask for the requirement to apply to program PASI.

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

November 2010

  • Non-ATLAS experiments:
    • Testing the trivialPilot code in interactive. Some ITB robot are hanging forever at OU. trivialPilot code seems to be OK.
    • Transformation script for DayaBay changed to remove the output files (root and log) from the local directory after they are staged-out to xrootd.
    • Problem: PDSF does not support VO OSG.
    • Writing the proceeding for CHEP (poster on ITB Robot).
    • Fixing the cron on dayaplt@daya0001, making all entries to redirect output to /dev/null to prevent periodical email submission.
    • Getting account at PDSF to use the DayaBay cluster with OSG credentials
    • Fixing the schedconfig definition for the PanDA queue PDSF-NERSC
    • Problems for CHARMM at Firefly: curl+tar not working. GOC ticket opened.
    • Running 100 testing jobs for Daya.
    • Testing remote pilot submission to NERSC-PDSF. The condor jdl file from pandaconf does not work.
    • Fixing some problems with the output of jobs for Daya: a lot of empty lines (every line had an extra \n) and the whole output duplicated.
    • Fixing the setup in my account at NERSC-PDSF in order to get correct python version needed by NuWa?.

  • Outreach:
    • Preparing the introduction slides for the teachers in the school in Sao Paulo.
    • Finishing the INDICO agenda.
    • Coordinating with other teachers and local the setup for exercises.
    • Coordinating content of presentations with remote speakers.
    • Writing the proceeding for CHEP (poster on outreach).

  • PanDA:
    • Removing .py and .pyc files at the end of trivialWrapper execution, to reduce the size of the directory on local submission.
    • New panda queue created: ANALY_BNL_GLEXEC.
    • Trying to run with gLExec at BNL. After changing permissions to directory $PilotHomeDir I am able to run a buildJob with gLExec successfully.
    • svn update added to panda_nightly.py on daya0001
    • Writing the proceeding for CHEP (talk on security).
    • Adding a lock mechanism for stageout.sh in trivialPilot.py In this way stageout.sh can be invoked at any moment from the transformation script and it is not executed twice.
    • Writing the final code in trivialPilot to use stage-in and stage-out command line options (this is just a temporary solution until something based on site schedconfig and/or SiteMovers is in place)
    • List of bugs in pilot3/ reported to Paul.
    • Some progress with gLExec. The problem seems to be when gLExec is going to run the source the script to setup CMT.

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

  • Last week at CERN, attending the ATLAS C&S Week.

December 2010

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

  • Outreach:
    • OSG School in Sao Paulo
    • article for the OSG newsletter


1st quarter

January 2011

  • First week on vacations.

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

  • Non-HEP experiments:
    • Put Tim Miller in contact with Bob Ball about some weird behavior of CHARMM jobs accessing disk.
    • Derek Weitzel has opened a GOC ticket complaining about the high rate of pilot submission to ff-grid.unl.edu. The OSG VO has been banned in that site.
    • Testing trivialPilot at NERSC-PDSF. It fails because curl is not, or has no connectivity. Also voms-proxy-info does not work.
    • Adding setting instruction to NERSC-PDSF schedconfig in order to allow pilots to setup account properly to run nuwa at NERSC.
    • Testing pilots at NERSC-PDSF with a real daya job. There is a problem at NERSC-PDSF. Jobs crash randomly. Output truncated, so impossible to debug. Ticket opened.

  • PanDA:
    • Testing gLExec in Glasgow, in interactive, with the same version of gLExec we have at BNL. It fails.
    • Testing gLExec in Oxford. It fails. The MyProxy? client is not installed there.
    • Some refactoring of trivialPilot to handle the queue configuration retrieval better.
    • Adding the content of 'envsetup' from the queue configuration at the beginning of the job command in trivialPilot.

February 2011

  • DOE issues:
    • Working as DOE Agent accepting/rejecting/revoking user/host/service grid certificate requests.

  • gLExec:
    • Testing OXFORD: first tests failed:
      • I cannot change permissions to the full chain of directories (dirty trick I used to use): /home is protected.
      • gLExec reports this problem: [gLExec]: LCMAPS failed, see '/var/log/glexec/lcas_lcmaps.log' for more info.
      • Ticket https://gus.fzk.de/ws/ticket_info.php?ticket=67197

  • PanDA pilots:
    • Working on the wrapper refactoring, needed to deploy autopyfactory. First version of wrapper.sh done.


  • Non-HEP experiments:
    • Opening ticket to ask why pilots do not start running at daya000x fixed the problem.
    • Script to kill all condor jobs (pilots) in "HOLD" status on daya000x.
    • Problems to launch automatic pilot submission to NERSC-PDSF. I have checked the schedconfig: several fields in the queue specifications were empty. Fixed now.
    • Running tests at NERSC-PDSF. Some of them failed. Now the problem is understood: a bug in nuwa code raises sometime the memory usage. Fixed by adding a line in ~/.sge_request:
               $ cat ~/.sge_request
               -l h_vmem=5G

    • Until the problems with autopilot are fixed, I was running pilots by hand. First official MC campaign at NERSC-PDSF. Jobs are failing:
              sort: close failed: -: Disk quota exceeded
    • Other jobs have finished, but failed with a segmentation violation. Brett said it seems related to an already known bug.
    • All problems are understood: jobs run in the user home directory, which is in a shared filesystem -> the root files being created collapsed the available space in the /home partition. Solution: to create the root files (and sem files and log files) directly in the destination path ( /eliza16/dayabay/users/dayaplt/FMCP11a/ )
    • Pilots at LBNE_DAYA_1 were failing. Reason is that the queue definition included old ATLAS related info for variable 'envsetup' that now the pilot is using before running the payload. Fixed now.

Topic revision: r245 - 16 Jun 2016 - 20:43:34 - ElizabethChism
Hello, TWikiGuest

OSG Reports
  • add items

Meta-TWiki links


TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..