Introduce the OSG metrics.. measurements of our effectiveness.
Q: how are we auditing the metrics? How do we know the metrics are "authentic"? Good question! Note there have been examples of problems with units in the recent past.
Q: Trouble ticket metrics. How to account for the case where there are unresponsive users. There is lots of room for improvement. Institutional pride plays a big role!
Q: Call for a place on web where metrics can be viewed.
Q: how to account for differring capabilities of different sites in the metrics?
There are some recurring issues, owing to the maturity of the software in the OSG stack.
For sites to support the VO's that they claim to support. There is significant overhead in getting onto a site - find that about only 50% of the sites authenticate all the VOs they claim to.
Several VOs have multiple applications from multiple groups and people. Eg., Engage (non-physics) and OSG VOs (new and small users). Fermilab VO is an umbrella for several experiments.
Sites which suspend or preempt jobs - for long-running jobs that are long since the overhead was high (lots of I/O). Some sites "trap", suspending the job without releasing it.
Would be nice if sites could reconsider the use of preemption. If they really need it, can we advertise this somehow so users know not to match against it.
There is the preemption flag in the glue schema, but its just yes/know. Job retirement time - but its not advertized. Supposed to be in the Polcy URL.
Mappings of DNs to accounts. Do sites honor FQAN information from VOMS. From a users point of view, the more sites that use VOMS and GUMS the better. Note - there is plenty of help with the deployment. People will be trying to run with two different VOs.
If sites have MPI resources, would request these are made available to the OSG infrastructure. (Two sites providing this now - Nersc and Purdue).
There is an OSG - Users group with an open agenda every second week. Will use this forum to communicate user requests, and to inform users of specific usage problems.
Lightning talk: RSV (Arvind Gopu)
What is RSV?
Site level validation of resources and services that run on them
We provide default scheduling infrastructure based on Condor-cron
Default set of probes including critical OSG probes
Site level web interface to RSV results for all hosts being monitored by that site admin
Results of local site validation tests are uploading into a central GOC maintained RSV database.
Data exchange between OSG GOC and WLCG SAM for interoperability
Can I contribute probes to RSV? Yes! In fact, we welcome OSG collaborators and site admins to contribute probes for specific tests they think will be useful to the OSG community as a whole
You can use Perl module we've developed as a starting point, and plug in tests into a template script
Or, you can write your own scripts in any other programming language, as long as they conform to the WLCG probe specification 0.91 w.r.t input parameters they take and output they print out; and some additional OSG specific requirements (that are necessary to facilitate uploading of data into central RSV database).
Expect more documentation about both of the above options RSN
Coming soon: Updated RSV in VDT 1.8.1a -- includes capability to use existing Condor 6.9.x install; and monitoring multiple CEs from one monitoring host; web interface to local RSV results; RSV documentation in the rsv.grid.iu.edu domain.
Questions
Concern about documentation - most of it is on an external webpage. Arvind will make sure that its well integrated with the twiki.
How about using Nagios plugins or Monalisa? Note Tomasz is working on Nagios plugins wrappers - Arvind will follow-up.
What are the long term plans for proxies and their renewal? One idea was to have a central myproxy server; don't know the best answer at the moment.
Next face-to-face meeting
Tentative date: Wednesday, Thursday December 5-6, 2007.
Would start the meeting at 1:30pm on 12/5/07 and go to noon on 12/6/07
Location: Fermilab
Decided at the meeting we would take an online survey for this