OSG Area Coordinators Meeting
|| Thursday, Feb 25, 2010
|| Noon Central
| Telephone Number
| Teleconference ID
Abhishek Rana, Maxim Potekhin, Jason Zurawaski, Mine Altunay, Alain Roy, Igor Sfiligoi, David Ritchie, Chander Sehgal
- 1.1 Software - Alain Roy (confirmed)
- 1.4 VO's - Abhishek Rana (confirmed)
- 1.6 Campus Grids – Sebastien Goasguen - could not attend; need to reschedule
- 1.7 Security – Mine Altunay (confirmed)
We last reported on the full software group on October 15th, 2009
, but Tanya gave an update on the storage portion of the software on January 28th, 2010
, so this report will not cover the storage related aspects, mostly.
Since last October, we've released three software updates to the (Pacman packaged) software in the OSG Software Stack, and a fourth is pending in the ITB. They've covered a variety of issues, including software updates and security releases. Of note, we're changed how RSV is distributed, and it now is shipped from a tarball (not a Subversion repository) hosted at the VDT.
Much of the responsibility of hosting RSV has shifted from the GOC to the VDT. We've updated the RSV documentation
considerably. In particular, we've documented how RSV metric development should proceed, since our developers are distributed, and we've clearly defined how changes go from development to testing to release.
We've been working on moving the new rsv-control client tool to be production quality, and it's nearing release. It should greatly simplify interaction with RSV for the site administrators.
The UW VDT group has been hard at work on native packaging. We've continued to focus on LIGO's needs and have made many improvements. As of today (25-Feb-2010) they are nearing finalization and release. We've been working closely with LIGO.
We've recently been asked to help with the NEES file transfer work, and are doing so. Details More Details
Globus 5 & CREAM
Igor Sfiligoi has been evaluating GRAM 5 and CREAM for deployment, and Alain has provided some assistance. We have a document about what needs to happen for GRAM 5 to be deployed in OSG, and are working on the same thing for CREAM.
ATLAS and alternate installation method
Just recently, we started having discussions with the USATLAS Tier 3 effort about how the VDT is packaged. They have specific needs for the client tools that aren't met by the current Pacman packaging or native packaging. We are working on a proposal to deal with this by early May, when it will be needed by USATLAS.
Debian packaging of the OSG CA certificates for LIGO
The UW VDT team (particularly Tim Cartwright) provided assistance in packaging the OSG CA certificates with Debian packaging. Tim wrote tools to automate the creation of the Debian packages and the APT repository to host them, then passed them off to the OSG security team and the GOC for use and deployment.
We last reported Oct 15, '09.
Ground work and impact on VOs production
- 1. Dzero monte-carlo production remains cyclic and fluctuated with peaks 10+ million Events/week. Production dropped to near-zero for 2-4 weeks; issues localized in D0 submission infrastructure and lack of workload. New application software has restored the average rate.
- 2. Prepared CDF for SL5 migration; Kerberos CA transition. Used a separate queue for SL5 to keep SL5 evaluation isolated from production on SL4. CDF successfully upgraded new portal to GlideinWMS; working well.
- 3. Brought Geant4's biannual EGEE-based exercise also onto OSG. Ran on OSG in Dec'09; scale was low; major fraction provided by EGEE. Peak 12,000 hours/day; average 2,000 hours/day; but limited to 2-4 sites. Discrepancy in wasted wall hours. 50% wasted in OSG resource-view, 0% wasted in Geant4-view. Reason likely to be OSG-side issue: exit-code discrepancy due to Pilots.
- 4. US ALICE's internal plan for a Tier-2 Facility (NERSC/LBNL + LLNL) moving forward; contingent on ALICE funding. Trash/Integration site up, scalability evaluation of OSG-AliEN interface to restart in coming months.
- 5. Sustained nanoHUB production at 200-700 wall hours/day. 5+ types of nanotechnology applications. Opened some 'end-user' production jobs; 'probe' jobs ongoing as earlier. Accounting based on nanoHUB-view is still not clear; expected only in long term.
- 6. Ongoing work with SBGrid/NEBioGrid to sustain an increased scale of production. Peak 60,000 hours/day; usage in bursts; average 7,000 hours/day; current average 40,000 hours/day. Achieved milestone of ~3000 simultaneous jobs. Job success rate 30-50%. Using MatchMaker and Squid; plans are to evaluate Panda. Multiple factors evaluated/addressed; thus, progress steady but slow.
- 7. IceCube started up; proof of principle accomplished in Oct'09. Limited data-access model. Total usage across 6 sites was 4,000 wall hours; 600 jobs at 50% efficiency. Work to integrate and evaluate data staging mechanisms has been slow; expected only in long term.
- 8. GridUNESP started up in Brazil; coordinated assistance and direction from DOSAR/Horst. GridUNESP researchers submit MPI jobs through full site/VO infrastructure; running on OSG. Intermittent switch/networking/TCP/IP problems.
- 9. GlueX started up. Local site UConn-OSG functional. Working with Richard Jones to get jobs from other VOs on the GlueX site.
- 10. Facilitated NYSGrid & nanoHUB on NYSGrid's new portal; based on HUBzero, an offshoot of early nanoHUB work. NYSGrid's new consortium - http://hpc2.org
- 11. CompBioGrid site functional; used by other VOs. Application integration of Virtual Cell expected in long-term; contingent on CompBioGrid internal FTEs.
- 12. GPN and GROW have active sites; but scale is moderate.
- 13. CIGI, GRASE, NWICG valuable partners in Consortium; but science production is moderate. Plans need to be encouraged.
- 14. Restarting work with CHARMM team (part of Engage VO); Maxim is coordinating effort with NHLBI/NIH.
Facilitated VO participation for peer Areas
- MyOSG SLA public review. Feedback from Fermi-VO, NYSGrid, STAR, SB/NEBioGrid. (Rob Q)
- Identity Management Survey; and the following Workshop. Feedback from 15 VOs. (Mine)
- Monitoring & End-to-end Information Requirements. Partly out of loop; Fermilab-internal coordination. (Burt, Anthony)
- Preparedness of at-large VOs for SL5 migration of LHC sites.
Current general concerns pointed out by VOs
- Globus 5's adoption timeline and VOs preparedness.
- Workload Management solutions and choices.
- Opportunistic storage availability.
- Accounting discrepancies - e.g., uncatched/mismatched exit codes, Pilots.
- Lack of real-time job status monitoring.
- Lack of accuracy in advertisement of heterogenous subcluster parameters.
- Fermilab VO implemented GIP configuration changes; working on documentation.
- Impact of FNAL FCC electrical power outage/regulation - Dzero, CDF, Fermi-VOs, (+ CMS).
- Preemption/Eviction remains to be fully addressed. Dan F, Brian, Abhishek worked with CMS T2's to enforce Dan Bradley's solution: sites should use LastHeardFrom instead of the more popular CurrentTime. Need more discussion.
- Possible need for phase-II of Joint Taskforces in coming months: nanoHUB, SBGrid, ALICE.
- The last security reports were on 10/29 and 8/20
- Top issues and ongoing work items
- IGTF new distribution layout md5 to sha-1 changes
- Doug is testing our CA package generation scripts. We will make a ITB testing plan once we understand the new layout better.
- Action Items from the ID mgmt workshop
- Working on easy-to-use certificate life cycle management tool on the desktop. Found some tools from European CAs, but none are complete. Evaluated Fermi's NetIDManager? solution but it does not solve everything either. No plans for development or adopting a specific tool yet. Will work that later with STG once we come up with a more concrete plan
- Worked with SBgrid to reduce their registration workflow. An 8-step process, we managed to eliminate couple steps. And have some more ideas to eliminate further
- Examined Federated CAs and LIGO CAs. Understood LIGO CA is IGTF accreditation ready.
- Certificate issuance is too-heavy weight too-slow -- biggest complaint from the workshop. We installed monitoring of Agents behavior so we can understand where it bottlenecks. Had a series of meetings with all RA agents. Changed the process slightly to include the end user in the process. We are monitoring if Agents are following the changed process. Noticed Atlas agents are more inconsistent and slower than other VOs. Worked with BNL service desk to understand the reasons. Sponsor problems should be solved by Atlas since only they know who their sponsors are. relayed these to Hover.
- Follow up meeting to ID Mgmt Workshop at AHM
- Ligo, CILogon, and ESNet are coming
- Have a session of Federated CAs.
- VO risk scenarios were asked at the last workshop. we will continue with those in this meeting.
- T3 support
- Vulnerability detection at OS level is requested by Atlas and CMS T3 admins. Got a tool from EGEE, Pakiti and installed and configured for testing purposes. T3 admins liked the tools workings. Now we need to figure out how to do securely distribute the vulnerability reports to T3 sys admins. There is no feature doing that yet. But for now, v3 will hopefully do that. But now we think about setting automated emails to T3 admins. Nothing very fancy. We will check with Benjamin and Snihur if they find the outputs useful
- Documentation and communication with T3s are going well. We continue to call T3 admins for introducing ourselves and also work on a security best practices document for them. We attend T3 meetings and constantly communicate with Snihur and Benjamin
- another software request: feature add-on to vdt-ca-manage tool. 2 days worth of work.
- We need to find a home for Pakiti server to live. We think about asking BNL and FNAL to provide this service.
- GUMS bug fixes and releases will not happen at least two more months
- Too many requests and work items popping up unexpectedly. Sent a list of WBS items vs ongoing work to EB for prioritization
- Debian CA package, IGTF distribution, etc
- CRL distribution problems bugged a lot of site admins. We found a solution, but that involves running a service. So we will postpone this for later.
- Items last reported: what happened to them
- Security Announcement problem -- completed. Contacts get the announcements, set up a security blog, and send the announcement to security contacts
- Monitoring DOEGrids CA database for un-mitigated risks. Complete. We are monitoring the Agents behaviors. We did a audit of some agents and discovered they are not archiving the vetting trail as they should be. Changed the process to ensure they do it.
- Automated VO package updates: completed. tested in itb.
- ID mgmt workshops: Completed
- Tier 2 drills -- Completed.
- 28 Jan 2010