SitesCoordOct4

Introduction

  • Background: This is the first tele-meeting of the OSG Sites Coordination group which consists of participating OSG site admins and managers.
  • Attendees: Terrence / UCSD, Tom / GOC, Xin /BNL, Rob / UC, Kevin /Vanderbilt, Arvind / GOC, Wayne Betts / BNL, Charles / Caltech, Jeff/LBL, Sherayus / NERSC, Anand/uiuc, Brian Moe / UW Milwaukee, Alain / VDT, Gabriele /FNAL, Michael / Caltech, Wei / SLAC, Horst+Karthik/OU, Chris Green / FNAL, Todd / U Wisc, Suchandra / UC, Doug / LBL, Burt /FNAL, Steven /FNAL, Anne / FNAL, Abhishek / UCSD
  • Apologies: none
  • Coordinates: Thursday, 2:30pm Central; 510-665-5437, #1212
  • Previous meetings:
  • Planning: SitesCoordinationPlanning

News about OSG 0.8 and VDT 1.8.1 (Alain/RobG)

OSG performance metrics review (Todd Tannenbaum, OSG Deputy Facility Coordinator)

  • sitemetrics.ppt: Todd's Metrics Talk
  • Introduce the OSG metrics.. measurements of our effectiveness.
  • Q: how are we auditing the metrics? How do we know the metrics are "authentic"? Good question! Note there have been examples of problems with units in the recent past.
  • Q: Trouble ticket metrics. How to account for the case where there are unresponsive users. There is lots of room for improvement. Institutional pride plays a big role!
  • Q: Call for a place on web where metrics can be viewed.
  • Q: how to account for differring capabilities of different sites in the metrics?

Report from OSG users (Chris Green)

  • UG_Report_2007_09_28.ppt: Chris' OSG user report
  • There are some recurring issues, owing to the maturity of the software in the OSG stack.
    • For sites to support the VO's that they claim to support. There is significant overhead in getting onto a site - find that about only 50% of the sites authenticate all the VOs they claim to.
    • Several VOs have multiple applications from multiple groups and people. Eg., Engage (non-physics) and OSG VOs (new and small users). Fermilab VO is an umbrella for several experiments.
    • Sites which suspend or preempt jobs - for long-running jobs that are long since the overhead was high (lots of I/O). Some sites "trap", suspending the job without releasing it.
    • Would be nice if sites could reconsider the use of preemption. If they really need it, can we advertise this somehow so users know not to match against it.
    • There is the preemption flag in the glue schema, but its just yes/know. Job retirement time - but its not advertized. Supposed to be in the Polcy URL.
    • Mappings of DNs to accounts. Do sites honor FQAN information from VOMS. From a users point of view, the more sites that use VOMS and GUMS the better. Note - there is plenty of help with the deployment. People will be trying to run with two different VOs.
  • If sites have MPI resources, would request these are made available to the OSG infrastructure. (Two sites providing this now - Nersc and Purdue).
  • There is an OSG - Users group with an open agenda every second week. Will use this forum to communicate user requests, and to inform users of specific usage problems.

Lightning talk: RSV (Arvind Gopu)

  • What is RSV?
    • Site level validation of resources and services that run on them
    • We provide default scheduling infrastructure based on Condor-cron
    • Default set of probes including critical OSG probes
    • Site level web interface to RSV results for all hosts being monitored by that site admin
    • Results of local site validation tests are uploading into a central GOC maintained RSV database.
    • Data exchange between OSG GOC and WLCG SAM for interoperability
  • How can I get RSV? RSV Install Guide
  • Can I contribute probes to RSV? Yes! In fact, we welcome OSG collaborators and site admins to contribute probes for specific tests they think will be useful to the OSG community as a whole
    • You can use Perl module we've developed as a starting point, and plug in tests into a template script
    • Or, you can write your own scripts in any other programming language, as long as they conform to the WLCG probe specification 0.91 w.r.t input parameters they take and output they print out; and some additional OSG specific requirements (that are necessary to facilitate uploading of data into central RSV database).
    • Expect more documentation about both of the above options RSN
  • Why condor-cron instead of Unix-cron ??
  • Questions: please email RSV developers mailing list
  • Coming soon: Updated RSV in VDT 1.8.1a -- includes capability to use existing Condor 6.9.x install; and monitoring multiple CEs from one monitoring host; web interface to local RSV results; RSV documentation in the rsv.grid.iu.edu domain.

Questions

  • Concern about documentation - most of it is on an external webpage. Arvind will make sure that its well integrated with the twiki.
  • How about using Nagios plugins or Monalisa? Note Tomasz is working on Nagios plugins wrappers - Arvind will follow-up.
  • What are the long term plans for proxies and their renewal? One idea was to have a central myproxy server; don't know the best answer at the moment.

Next face-to-face meeting

  • Tentative date: Wednesday, Thursday December 5-6, 2007.
    • Would start the meeting at 1:30pm on 12/5/07 and go to noon on 12/6/07
  • Location: Fermilab
  • Decided at the meeting we would take an online survey for this

-- ArvindGopu - 04 Oct 2007

Topic attachments
I Attachment Action Size Date Who Comment
pptppt UG_Report_2007_09_28.ppt manage 136.0 K 04 Oct 2007 - 19:51 RobGardner Chris' OSG user report
pptppt sitemetrics.ppt manage 151.0 K 04 Oct 2007 - 19:29 ToddTannenbaum Todd's Metrics Talk
Topic revision: r9 - 16 Dec 2008 - 16:16:03 - KyleGross
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback