-- DanFraser - 02 May 2012

Action/Significant Items:


  • (to be updated after the meeting) Alain, Mats, Xin, Armen, Brian, Suchandra, Tony, Marco, Rob Q., Scott T., Mine, Chander, Dan

CMS (Tony)

  • Job statistics for last week
    • 8,281 jobs/day

  • Transfer statisics for last week
    • ~562 TB/day

Atlas (Armen & Xin)

  • General production status
    • After LHC came back from the technical stop last Tuesday, stable data taking. Still some beam quality issues to be fixed. Current collected luminosity for ATLAS ~1.15 fb-1.
    • US ATLAS production during the week was quite stable, at the average level of about 18-20K running jobs, mostly simulation type.
  • Job statistics for last week.
    • Gratia report: 1.4M pilot jobs run on USATLAS sites, with CPU/walltime ratio of 90%
    • Real Jobs processed by US sites for last week, reported from PanDA? monitor
      • 1M
  • Data Transfer statistics for last week
    • Data transfer rate was 250~300TB/day at BNL T1 in last week.
  • Issues
    • MWT2 pilot submission problem: upgraded condor on the submit host to 7.7.6, which has a lot of improvements on how Condor-G handles GRAM5 CE more gracefully. Also increased time interval of probing job status against remote CE. Situation is much stable now.
    • GlideInWMS? and PanDA?: under discussion; ATLAS will test it out.

Grid Operations Center (Rob Q.)

Operations last week

  • GOC Services Availability/Reliability
  • Current Status
  • Kyle and Elizabeth were at Condor Week for training
  • ITB release, Release note
  • FermiGrid Ops
    • Bluearc NAS appliance outage at FermiGrid? will not affect OSG-facing services (Gratia, RESS, VOMS) but will affect all job submission to FNAL_FERMIGRID gatekeepers. Job submissions will be blocked from 8AM CST 5/1/2012 to approximately 9AM CST 5/2/2012.
  • WMS Glide In Factory
    • Thursday around Noon PT bad pressure gauge caused cooling problem at UCSD - water was effectively shut off.
    • All T2 services had to be shut down
    • Important services were brought back up after 4 hours - including UCSD Glidein factory
    • Cooling system brought back up on Friday
    • Continued to add O(100) new entries for Atlas in UCSD Factory

Operations this week

  • Rob Q. on vacation this week, Scott next week.
  • /usr/local on repo.grid.iu.edu to grown to 128 GB from 16 GB
  • www.grid.iu.edu retired
  • Physical move of an unused VM host from IUPUI to IUB
    • Will host GOC internal services, (monitor, jump, etc)
  • FermiGrid Ops
    • No changes planned.
  • WMS Glide In Factory
    • Plan to test condor 7.6.7 on ITB Factory but won't likely upgrade GOC Factory this week but on next production window.

Campus Infrastructures / HTPC (Dan, Brooklin)

Integration (Suchandra)

  • ALICE started test simple jobs on ITB sites
  • Will be talking to Alain to discuss testing tasks for the next several weeks

Site Coordination (Marco)

Note that this report lists the currently active resources in OSG. If a site is down or not reporting it will not be counted. Therefore there may be fluctuations. Each line has the current number and variation from last week in parenthesis. You can find a table with current OSG and VDT versions at http://www.mwt2.org/~marco/myosgldr.php
  • Site update status (from MyOSG as of today):
    • Most recent production version is OSG 3.1.1 / 1.2.28
    • 23 (0) OSG 3.x ( 4 are 3.1.1)
    • 72 (-3) OSG 1.2.X resources ( 15 are 1.2.28)
    • 2 (0) OSG 1.0.X resources ( 0 are 1.0.6)
    • 1 (-1) OSG 1.0.0 resources

User Support (Chander, Mats)

  • SuperB VO is now running test jobs at various sites; and the Ohio Supercomputing Center cluster is almost integrated as a new site in OSG under the SuperB VO
  • PNNL is integrating a site onto OSG as part of Belle VO
  • U-Maryland Institute for Genome Sciences is integrating a site onto OSG as part of OSG VO
  • Plan is being executed to re-connect Baker Lab flocking from RENCI to UCSD by mid-May

Security (Mine)

  • Emergencies/Vulnerabilities/Incidents
    • None.
  • Operational Items.
    • Incident Drill for Tier3s. We are conduting teh drill with Atlas tier3 sites on may 15th. We contacted the prospective sites. 7 sites committed to participating.
    • Annual Security Test and Controls and update of Risk assessment Plan has started.
    • New CA bundle is coming out.
    • IGTF and SPI meetings are taking place in Munich. I will join remotely. Will report next week with significant work items.
    • Old GOC CA rpm repo will be turned off on May 31st. We contacted sites still using this repo in April. We will remind them again this week. We will send out a broad OSG announcement as a reminder next week. Sites they were using the old repo all promised they would switch to the new repo. we will confirm it this week.
    • Worker node certificates. We have been exploring if technically it is feasible to not have any service certs on WN. We seem to all agree there is no need for these certs anymore. We should start the work to get rid of these in production. Tasks like communicating to sites and documenting are needed. Who will lead this work? Software, Ops, security?

The full report with links is available at https://twiki.grid.iu.edu/bin/view/Production/WeeklyProductionMeetings

The full report with links is available at https://twiki.grid.iu.edu/bin/view/Production/WeeklyProductionMeetings

Topic revision: r12 - 08 May 2012 - 20:53:44 - ScottTeige
Hello, TWikiGuest


TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..