Gratia Release v0.34 and v0.34.10 (5/14/08 - 5/29/2008)

Overview

The main purpose of this release (v0.34.10) is the DN/FQAN collection improvements and the requirement for authentication for administrating the Collector.

Reporter Improvements (v0.34.9)

  • Report displaying those jobs that have been running more than X days by VO. where X is a week and a month (RR23)
  • Report displaying those sites that show 0 Wall Duration for more than X days - where X is a week and a month. (RR24)
  • Update the osg daily report for Birt 2.2
  • Periodically generated (daily, weekly and/or monthly) csv files that are available for immediate view without need to read db records
    1. CPUday per day for the last month/30 days for each VO
    2. CPUday per day for the last month/30 days for each site
    3. total CPUweeks per week for the last 52 weeks (year)
  • Extend User daily report to include VO name.
  • Enable easy selection/reporting of CMS-Tier-2 sites.

Collector Improvements (v0.34.7)

  • Enable authentication in the administration pages
  • Implement looking up the DN, Role and VO directly from certificate
  • Update of the record checksum to exclude DN, Role, VO"
  • Review and improve the Summary table creation mechanism
  • Update Collector to Collector connection to send more than one record on each connection
  • Add VO to User Summary Table
  • Replace pre-duplicate check by exception handling

Probe Improvements (v0.34.1)

  • Document on the issues related to full and complete probe black-out detection.
  • Globus patched to interrogate delegated proxies wherever available for DN and FQAN information.
  • Condor probe updated:
    1. to get information on WS jobs;
    2. to use ClassAds generated by PER_JOB_HISTORY_DIR (if available) to avoid expensive calls to condor_history. Will fall back on condor_history;
    3. to facilitate "local" probes by means of a constraint option so probes only process certain types of job.
    4. avoid double-counting when information about the same job is received from multiple sources.
  • Package dCache probe as RPM.
    1. dCache / SRM probes now available (see dCache release notes).
  • Minor improvements to SGE probe from Shreyas Cholia.
  • PBS/LSF probes improved to prevent misreading an incomplete line in the batch manager log.

Anticipated downtime

It is expected that this release will require the Gratia services and reporting to be unavailable beginning at:
  • Start: 5/14/08 hh:mm CST
  • Available: 5/14/08 hh:mm CST

The changes affecting downtime for this release are:

  1. Length of time to make a backup of the database to the backup area using mysqlhotcopy
  2. Installation and validation on the 6 Gratia schemas

Build the v0.34.10 for distribution

This release was done in 2 stages and a couple interim stages when problems were detected:
  • v0.34 - for the VDT 1.10.1 distribution (ITB 0.9.0)
  • v0.34.1 - for local distribution. Changes were required to the local scripts/configurations for the actual conversion at FNAL on gratia08 and gratia09 collectors.

  1. Make sure your build area contains all committed changes.
    • cvs update
  2. In gratia/build-scripts/Makefile , change the version_default to:
    • version_default = v0.35
    • commit the change
  3. Tag the release:
    • cvs tag -R v0-34 gratia This tags everything that is in your build directory and presumably tested.
  4. Into a new area, export the tagged release:
    • cvs export -d gratia-v0.34 -r v0-34 gratia
  5. Build it for the release (this insures that tar files are produced for VDT):
    • cd gratia-v0.34/build-scripts
    • source setup-jdk15.sh
    • make release
  6. Copy the build/release area to the gratia home directory for posterity:

    • mkdir /home/gratia/gratia-releases/v0.34
    • cp -pr /home/weigand/cdcvs/gratia-v0.34/* /home/gratia/gratia-releases/v0.34/.

  7. Copy the built tar files to the release area:
    • scp ../target/*_v0.34.tar flxi07.fnal.gov:/afs/fnal.gov/files/expwww/gratia/html/Files/
  8. Update the version number on the services release TWiki page:
    • Edit and update the TWiki variable ReleaseVersion. Status: Completed on 5/10/2008 08:00 CDT


*Update: 5/11/2008 14:45 CDT:*
Had to modify Makefile to include the voms_servers and voms-server.sh files in the Gratia-Services distribution for VDT. Re-tagged, rebuilt, moved to the flxi07.fnal.gov site.


*Update: 5/12/2008 14:15 CDT:*
Problems encountered when working with an empty database. Previous testing was done with an existing database. The following programs required modification:

  • collector/gratia-services/net/sf/gratia/services/ListenerThread.java
  • collector/gratia-services/net/sf/gratia/services/NewVOUpdate.java
  • collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
  • common/configuration/hibernate.cfg.xml

Steps 3 through 6 were repeated for these modules. Note: To tag the individual module (example):

  • cvs tag -F v0-34 collector/gratia-services/net/sf/gratia/services/ListenerThread.java

Provided a new configure_gratia to VDT. This was tested in a condor environment as best we could especially for the validation of the condor environment relevant to the PER_JOB_HISTORY_DIR that is critical.

This is the v0.34 used in VDT 1.10.1 (ITB 0.9.0).


*Update: 5/14/2008 15:45 CDT: Tagged as v0-34-1*
The following modules were modified for changes to local (non-VDT) installations/upgrades. Tests are currently being run by Chris Green and a gratia instance on gratia07:/test/data/mysqldb/gratia using this software:

  • gratia/build-scripts/Makefile
  • gratia/common/configuration/cleanup_server_lib
  • gratia/common/configuration/configure-collector
  • gratia/common/configuration/update-gratia-local

Steps 3 through 6 were repeated for these modules. Note: The tagging was done as:

  • cvs tag -R v0-34.1 gratia and the export done as:
  • cvs export -d gratia-v0.34.1 -r v0-34.1 gratia and the build done as:
  • cd gratia-v0.34.1/build-scripts
  • source setup-jdk15.sh
  • make release and the copy to the release area:
  • scp ../target/*_v0.34.1.tar flxi07.fnal.gov:/afs/fnal.gov/files/expwww/gratia/html/Files/.
  • updated the version number on the services release TWiki page

Also note these changes do not affect the current VDT 1.10.1 release of Gratia v0.34. The latest version was copied in as in step 6 and 7, but VDT does not need to access this version.


Update: 5/20/2008 14:42 CDT: Tagged as v0-34-2
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java
U collector/gratia-services/net/sf/gratia/storage/SummaryUpdater.java
U collector/gratia-services/net/sf/gratia/storage/Utils.java
U common/configuration/collector-pro.dat
U common/configuration/service-configuration.properties
U probe/build/gratia-probe.spec
U probe/common/Gratia.py
U probe/common/GRAM/JobManagerGratia.pm
U probe/condor/condor_meter.pl
U reporting/summary/PSACCTReport.py

cvs tag -R v0-34-2 gratia


5/20/2008 16:46 CDT
cvs tag -F v0-34-2  collector/gratia-services/net/sf/gratia/services/CollectorService.java
cvs tag -F v0-34-2  collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
cvs tag -F v0-34-2  common/configuration/summary-procedures.sql


5/21/2008 11:03 CDT
Since the osg_daily record does not have a jobid or anything similar, we must keep the 
UserIdentity field in the md5 calculation otherwise we find many false duplicate
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java

Add support for DN comming from cron 
(/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=fermigrid0.fnal.gov/CN=cron/CN=Keith Chadwick/CN=UID:chadwick).  
In that case, use the 3rd CN
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java

Re-add VOName Correction link
U collector/gratia-administration/WebContent/dashboard.jsp

use full md5 for ps accounting too
U common/configuration/collector-pro.dat

cvs tag -R v0-34-3 gratia


5/22/2008 10:45 CDT Changes:
Re-add missing Host field lookup needed by psacct
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
U common/configuration/summary-procedures.sql

actually use the common name we just found
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java

cvs tag -R v0-34-4 gratia


5/27/2008 13:10 CDT Changes for gratia_psacct
report library is in production-osg only
  reporting/gratia-reports/WebContent/reports/production-fermi/gratia-lib1.rptlibrary is no longer in the repository
  reporting/gratia-reports/WebContent/reports/production-osgdaily/gratia-lib1.rptlibrary is no longer in the repository
U reporting/gratia-reports/WebContent/reports/production-osg/gratia-lib1.rptlibrary

cvs tag -R v0-35-5 gratia

One more change:

Give up trying to guess which CN is the "right" one: record them all.
  U collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java

cvs tag -F v0-34-5  collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java


5/28/2008 12:16 CDT Changes:
Config change to address "APPARENT DEADLOCK" problem.
  U common/configuration/hibernate.cfg.xml

cvs tag -R v0-34-6 gratia

More 5/28/2008 13:00 CDT

uncomment WeeklyUsageByVO and comment out WeeklyUsageByVORanked (as intended in previsous ci)
  U common/configuration/create_build-stored-procedures-sql

Update DB version to trigger stored procedure reload.
  U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java

Add collector_host to enable SSL operation. Configure ITB to run 5 threads.
Add conversion collector for gratia-fermi on 8884.
   U common/configuration/collector-pro.dat

Need to cope with a previous incomplete upgrade (DB 34 did *not* reload
the triggers like it was supposed to IFF the existing DB version was
already 33).
  U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java

cvs tag -F v0-34-6 common/configuration/create_build-stored-procedures-sql
cvs tag -F v0-34-6 collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
cvs tag -F v0-34-6 common/configuration/collector-pro.dat


5/29/2008 10:00 CDT Changes:
Summary table rebuild required to correct discrepancies with ResourceType filtering
  U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java

Trigger filter logic from ResourceType should match summary table build.
  U common/configuration/summary-procedures.sql

Expand the list of ResourceType values allowed in the summary table to Batch and RawCPU
  U common/configuration/build-summary-tables.sql 

cvs tag -R v0-34-7 gratia


6/3/2008 17:00 CDT Changes:
Remove (after verification) grouping from views and trigger reload of views.
  gratia/common/configuration build-summary-view.sql
  gratia/collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java,

Fix problem with old job data removal.
  gratia/probe/common Gratia.py
  
More statistics when operating in verbose mode.
Only set Grid=Local if we're sure that the ClassAd attribute should have
been added by the JobManager (assuming it passed through same) but was
not (indicating a local job).
Improved logic testing whether we can rely on the check for the
JobManager in the case that the site is running Condor for managed fork
and PBS for standard batch jobs.
  gratia/probe/condor condor_meter.pl

Turn verbose mode on by default and use DebugPrint.py
  gratia/probe/condor condor_meter.cron.sh

Fix behavior upon receiving blank lines (would truncate input).
  gratia/probe/common DebugPrint.py

Incorporate fixes for condor and common packages:
- Correct cleanup of no-longer-useful files in gratia/var/data.
- Improve DebugPrint.py in the case that input contains blank lines.
- Improve logic used in condor probe to decide whether we can use the absence
-  of the GratiaJobOrigin ClassAd attribute to infer that a job is local.
- Condor probe is now verbose but prints to main Gratia log.
- Condor probe only assigns grid=Local to jobs it's really sure are local.
Bump version number to correct local permissions problem with DebugPrint.py.
  gratia/probe/build gratia-probe.spec

cvs tag -R v0-34-8 gratia


6/6/2008 12:00 CDT Changes:
Fixed birt "Export Data": shows all columns (penelope)
  gratia/reporting/gratia-reports/WebContent/reports/production-osg/WeeklyUsageByVO-ranked.rptdesign,
  gratia/reporting/gratia-reports/WebContent/reports/production-osg/UsageBySiteByDate-ranked.rptdesign
  gratia/reporting/gratia-reports/WebContent/reports/production-osg/2007WeeklyUsageByVO-ranked.rptdesign

First attempt to correct the parsing of the bundled records sent by replicator.  
Also fix the content of the replication admin page (rowcount).
Clarify debug message list number of input message and records
Change the envelope for bunching to be record type independent.  Fix the test for adding it
U collector/gratia-services/net/sf/gratia/services/ChecksumUpgrader.java
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java
U collector/gratia-services/net/sf/gratia/services/ReplicationDataPump.java
U collector/gratia-services/net/sf/gratia/storage/MetricRecordLoader.java
U collector/gratia-services/net/sf/gratia/storage/UsageRecordLoader.java


Updated to create a gratia-release file in the top level of every deployable
war file and in the gratia directory. The gratia-release file contains, for example:
Gratia release: v0.34.x 
  Build date: Thu Jun  5 14:40:53 CDT 2008 
  Build host: gratia06.fnal.gov 
  Build path: /home/weigand/cdcvs/gratia
  Builder: uid=9789(weigand)
U gratia/build-scripts Makefile


Added the service.admin.DN.0 attribute so manual modification is not
required on each install.  This needs to be changed later to use a
VOMS fermigrid role.
U gratia/common/configuration collector-pro.dat,

Unique index on old md5 column is no longer needed after upgrade is complete.
Update TableStatistics triggers to not attempt to put a null value into a non-null column.
U collector/gratia-services/net/sf/gratia/services/ChecksumUpgrader.java
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
U common/configuration/JobUsage.hbm.xml
U common/configuration/post-install.sh

Fix problem with history replay.
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java

Probe changes not affecting collector:
U probe/build/gratia-probe.spec
U probe/common/GRAM/JobManagerGratia.pm
U probe/glexec/README
U probe/psacct/gratia-psacct

save only the part of bundle that describe the current record
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java

cvs tag -R v0-34-10 gratia

6/10/2008 12:00 CDT

Changes:

Remove Grid from attributes taken into account for checksum.
New column for GridDescription.
Improve debug messages when handling duplicates.
Correct minor problem in warning message.
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java
U collector/gratia-services/net/sf/gratia/storage/UsageRecordLoader.java
U common/configuration/JobUsage.hbm.xml

  • cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/services/ListenerThread.java
  • cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java
  • cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/storage/UsageRecordLoader.java
  • cvs tag -F v0-34-10 common/configuration/JobUsage.hbm.xml

6/11/2008 09:00 CDT

Changes:

Refrain from overwriting the locally-generated ExtraXML attribute with
that received from an external source.
P collector/gratia-services/net/sf/gratia/services/ListenerThread.java

  • cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/services/ListenerThread.java

Status of checksum differences
Philippe and Chris have ascertained that the reason for the difference between the checksums is benign: the ExtraXML which is used in the checksum calculation is different due to legitimate differences in the make-up of the incoming record. The fact that there are no differences in checksum for anything received since 5/27 on gratiax33 is encouraging.

One change was made to ListenerThread.java in that previous behavior was to overwrite the ExtraXML produced from the bits that the receiving server couldn't understand with the incoming ExtraXML from the replicated (or history) record. This has been changed such that the lately-generated ExtraXml is used and any previous occupant of that title is ignored.

Steps 3 through 6 were repeated for these modules.
Note: the export done as yourself :

  • cvs export -d gratia-v0.34.10 -r v0-34-10 gratia and copy it to the /home/gratia/gratia-releases area (as gratia user):
  • cp -pr /home/weigand/cdcvs/gratia-v0.34.10 /home/gratia/gratia-releases/. and build it (as gratia user):
  • cd /home/gratia/gratia-releases/gratia-v0.34.10/build-scripts
  • source setup-jdk15.sh
  • make release and make a copy of the tarballs in the /home/gratia/gratia-releases directory (as gratia user)
  • cp -p /home/gratia/gratia-releases/gratia-v0.34.10/target/*_v0.34.10.tar /home/gratia/gratia-releases/tarballs/. and the copy to the release distribution area ( as yourself ):
  • scp /home/gratia/gratia-releases/gratia-v0.34.10/target/*_v0.34.10.tar flxi07.fnal.gov:/afs/fnal.gov/files/expwww/gratia/html/Files/.
  • updated the version number on the services release TWiki page

Status: Finally Completed v0.34.9 on 6/11/2008 09:00 CDT

Collectors and Databases Affected

The following Gratia collectors and databases will be converted with this release:

Schema MySql port Collector URL Size (bytes) Size (rows) Max dbid Duplicates
fermi_transfer gratia06.fnal.gov:3320 gratia08.fnal.gov:8886 5.2M 0 0 0
gratia_osg_transfer gratia06.fnal.gov:3320 gratia09.fnal.org:8886 5.2M 0 0 0
gratia_osg_daily gratia06.fnal.gov:3320 gratia09.fnal.gov:8884 89M 84,437 85117 0
fermi_itb gratia06.fnal.gov:3320 gratia08.fnal.gov:8881 558M 58,082 159187 92,879
gratia_qcd gratia06.fnal.gov:3320 gratia08.fnal.gov:8883 4.3G 1,210,400 1588366 0
gratia_osg_integration gratia06.fnal.gov:3320 gratia09.fnal.gov:8885 8.1G 1,438,686 2561449 691,626
gratia_psacct gratia06.fnal.gov:3320 gratia08.fnal.gov:8882 13G 7,763,036 7792895 0
gratia_itb gratia06.fnal.gov:3320 gratia09.fnal.gov:8881 21G 8,327,662 8958895 101,365
fermi_osg gratia06.fnal.gov:3320 gratia08.fnal.gov:8880 24G 8,707,956 8712636 3,652
gratia gratia06.fnal.gov:3320 gratia09.fnal.gov:8880 117G 68,418,063 70987563 2,231,774

Shutdown and database backup.

The nightly backups will suffice for this implementation.

Small DB - Upgrade and implementation

The upgrades should be single-threaded , that is, performed for each database schema one at a time.

We will perform these upgrades based on the size of the individual database schema, in ascending order.

The DupRecord table (duplicates) should be removed before starting the upgrade.

The general procedure for this upgrade is:

  1. Stop the tomcat service
    • service TOMCAT_INITD_SERVICE stop
  2. Install the new software on the Gratia tomcat instance:
    • pswd=xxx
    • source=/home/weigand/cdcvs/gratia-v0.34.10
    • pgm=/home/weigand/cdcvs/gratia-v0.34.10/common/configuration/update-gratia-local
    • $pgm -d $pswd -S $source -s DATABASE
  3. Tar the existing log files and save to a backup area. Then clean the log file directory.
    • date=`date '+%Y%m%d'`
    • cd TOMCAT_LOCATION/logs/
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-TOMCAT_INITD_SERVICE.$date.tgz *
    • rm -f *
  4. Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the TOMCAT_LOCATION/gratia/service-configuration.properties file with the following attributes. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
    • service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
    • service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
    • service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
    • service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
    • service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
    • service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
    • service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
  5. Start the tomcat service
    • service TOMCAT_INITD_SERVICE start
      When the tomcat service initializes, it will detect any schema changes have been effected and a conversion process will begin.
    • tail the log files in _TOMCAT_LOCATION/logs. When the conversion process completes the log will show the following message: - "INFO: Server startup in _xxx_ms"
    • In your browser, access the gratia-administrative web service, select the System / Administration menu option in the left menu, Then scroll down to the Starting/Stopping Database Update Services section and select the Stop Update Services link
  6. Wait until the schema upgrade is complete , then, start the Gratia update services for the database schema just upgraded.
    • In your browser, connect to the Gratia administrative services url for each of the databases.
    • Select the System / Administration menu option in the left menu
    • Then scroll down to the Starting/Stopping Database Update Services section and select the Start Update Services link.
    • As the collector/tomcat host update service is started, monitor the tomcat logs files for any errors.
    • Bring up the gratia administration web interface and verify that the collectors are processing the data on the status page.
    • Bring up the gratia reporting web interface and verify that the reports look reasonable while still tail'ing the log files.
  7. Run the static reports cron as root and verify these reports are generated:
    • 3 pdf files and 3 csv files
  8. If satisifed, you may continue to the next one.

Did not have to do the stop/start of update service as you those services never got started

Status v0.34. through v0.34.7

Database Tomcat node init.d service tomcat location Size Start Complete
fermi_transfer gratia08.fnal.gov:8886 tomcat-fermi_transfer /data/tomcat-fermi_transfer 5.2M 5/18 09:00 5/22 13:05 v0.34.4
fermi_itb gratia08.fnal.gov:8881 tomcat-fermi_itb /data/tomcat-fermi_itb 558M 5/18 10:10 5/22 12:54 v0.34.4
gratia_qcd gratia08.fnal.gov:8883 tomcat-qcd /data/tomcat-qcd 4.3G 5/18 11:20 5/22 12:59 v0.34.4
gratia_psacct gratia08.fnal.gov:8882 tomcat-ps /data/tomcat-ps 13G 5/18 11:58 5/27 16:00 v0.34.5
fermi_osg gratia08.fnal.gov:8880 fermi_osg /data/tomcat-fermi_osg 24G 5/19 07:51 5/22 12:44 v0.34.4

Database Tomcat node init.d service tomcat location Size Start Complete
gratia_osg_transfer gratia09.fnal.org:8886 tomcat-osg_transfer /data/tomcat-osg_transfer 5.2M 5/18 13:55 5/22 13:12 v0.34.4
gratia_osg_daily gratia09.fnal.gov:8884 tomcat-osg_daily /data/tomcat-osg_daily 89M 5/18 14:28 5/22 13:15 v0.34.4
gratia_osg_integration gratia09.fnal.gov:8885 tomcat-osg_integration /data/tomcat-osg_integration 8.1G 5/18 18:20 5/22 13:18 v0.34.4
gratia_itb gratia09.fnal.gov:8881 tomcat-itb /data/tomcat-itb 21G 5/18 18:48 5/29 10:32 v0.34.7

Issues/Problems:

  1. No host certificate. So administratiion functions disabled. Will have to wait until Monday to request host and http certificates from Steve.
    • Installed host certificate on 5/12
  2. Static reports not working on fermi_transfer and osg_transfer. These databases are completely empty (zero records) and this is causing the problem.
  3. The update_gratia_local adds a cron entry for static reports every time it is run. You have to manually edit the crontab of the old entries.
    • It is also adding it with the ssl port number where in the past it was using the non-ssl port. Resolved on 6/6
      42 0 * * * '/data/tomcat-fermi_osg/gratia/staticReports.py'
                       '/data/tomcat-fermi_osg' 
                       'http://gratia-fermi.fnal.gov:<b>8880</b>/gratia-reporting/' 
      <b>One below added new:</b>
      ##42 0 * * * '/data/tomcat-fermi_osg/gratia/staticReports.py' 
                       '/data/tomcat-fermi_osg' 
                       'https://gratia-fermi.fnal.gov:<b>8845</b>/gratia-reporting/' 
      
  4. osg_daily catalina.out logs
    • May 18, 2008 2:33:26 PM org.eclipse.birt.chart.script.ScriptHandler register
      WARNING: An invalid Chart Java event handler class [class net.sf.gratia.reporting.GratiaBirtReportEH] has been detected and ignored.
    • Reporting menu is entirely different.
    • should the static reports even be run for these
  5. gratia_psacct
    • Appears to be hung in the post-install.sh summary . Killed tomcat around 5/19 07:47 with an explicit kill and restart. It is hung again in the same step.
  6. fermi_osg
    • Encountered this error with index12 :
      FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) : 
      com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: 
           Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
            
    • This are the alter table.. statements in the gratia-0.log
      380:FINE: Executing: alter table MetricRecord_Meta add unique index index12(md5)
      382:FINE: Command: OK: alter table MetricRecord_Meta add unique index index12(md5)
      388:FINE: Executing: alter table ProbeDetails_Meta add unique index index12(md5)
      390:FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) :
          com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: 
          Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
      470:FINE: Executing: alter table ProbeDetails_Meta add unique index index12(md5)
      472:FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) :
          com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: 
          Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
      590:FINE: Executing: alter table ProbeDetails_Meta add unique index index12(md5)
      592:FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) :
          com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: 
          Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
      618:FINE: Executing: alter table JobUsageRecord_Meta add index index17(md5v2
            
  7. osg_daily
    • 5/20 08:00 noticed java exceptions
      FINEST: fixDuplicatesOnce: resolving 3 duplicates 
      with checksum C7E72574A5A891270BF2F6ECA1127A10
      May 20, 2008 9:12:45 AM net.sf.gratia.util.Logging debug
      FINEST: fixDuplicatesOnce: deleting record 54646 in favor of record 54704
      May 20, 2008 9:12:45 AM net.sf.gratia.util.Logging warning
      WARNING: fixDuplicatesOnce: caught exception resolving duplicates 
      with checksum C7E72574A5A891270BF2F6ECA1127A10
      com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: 
      PROCEDURE gratia_osg_daily.del_JUR_from_summary does not exist
            
    • Philippe reports that the checksum calculation is wrong for this instance and is looking into it
    • stopped the instance 5/20 09:12
  8. gratia_itb
    • hibernate got in a deadlock state. Recycle on 5/20 09:08. Waiting to see if it occurs again.
      2008-05-19 08:22:50,075 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
       com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@e39068 
      -- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks!
      2008-05-19 08:22:50,091 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
       com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@e39068 
      -- APPARENT DEADLOCK!!! Complete Status:
              Managed Threads: 3
              Active Threads: 3
              Active Tasks:
      
      2008-05-20 10:16:50,019 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
       com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@d7f3b9 
      -- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks!
      2008-05-20 10:16:50,048 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
       com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@d7f3b9 
      -- APPARENT DEADLOCK!!! Complete Status:
              Managed Threads: 3
              Active Threads: 3
              Active Tasks:
            

Notes

  1. The tmp directory being used by MySql on gratia06 is defined on the command line and not by the my.cnf value
    gratia    6415 14500 42 Apr25 ?        10-01:17:45 
              /fnal/ups/prd/mysql/v5_0_22/Linux-2-4/libexec/mysqld 
             --basedir=/fnal/ups/prd/mysql/v5_0_22/Linux-2-4 
             --datadir=/data/mysqldb/ --user=gratia 
             --pid-file=/data/mysqldb//gratia06.fnal.gov.pid --skip-locking 
             --port=3320 --socket=/data/mysqldb/mysql.sock 
             --tmpdir=/data/mysqltmp/tmp
       
  2. gratia_psacct
    • Had to perform an InnoDB conversion in the beginning. (Start 12:59 Completed: 13:09, elapsed 01:11)
    • Then it had to create the new table for the JobUsageRecord_Meta table column. (Start 13:11 Complete:14:26 elapsed 01:15)
    • Adding the index (Start 15:21 ). No clue when it ended
    • At 15:46 running gratia/build-summary-tables.sql
  3. fermi_osg
    • Creating the JobUsageRecord_Meta table column and index (Start 5/19 09:32 Completed: 5/19 11:57 Elapsed: 02:25)
    • Summary table creation (Start: 11:57 Completed: Elapsed: )
  4. Philippe changed the creation of the summary tables ( build-summary-tables.sql )
    • This was to eliminate glexec records from the tables due to a large number of bad records in the main database when glexec first fired up.
    • Ran from gratia-vm02:/data/tomcat-greenc_test area against the gratia07:/test/data/mysqldb/gratia database
    • Started 5/19 14:40 Completed 5/20 04:52 Elapsed: 12:12
    • Reduced time by 4 hours

Certificates on gratia08 and graita09

  1. Installed pacman
  2. Installed VDT CA-Certifcates
    • mdkir /etc/grid-securtiy
    • mkdir /usr/local/vdt-certificates.1101
    • cd /usr/local/vdt-certificates.1101
    • pacman -get VDT:CA-Certificates
    • (answer yes to all questions and "r (root)" when asked where you want to install them. This puts them directly in /etc/grid-security.
  3. Enable the VDT services (crons in his release) to maintain the certificates and crls
    • source /usr/local/vdt-certificates.1101/setup.sh
    • vdt-control --on

Status v0.34.9

Since changes have been made to the checksum calculation and these instances have already been upgraded for the new md5 (md5v2) value, the upgrade to v0.34.9 will require the following. This is so a new checksum (md5v2) value is calculated.

The general procedure for this upgrade is:

  1. Stop the tomcat service
    • service TOMCAT_INITD_SERVICE stop
  2. Tar the existing log files and save to a backup area. Then clean the log file directory.
    • date=`date '+%Y%m%d'`
    • cd TOMCAT_LOCATION/logs/
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-TOMCAT_INITD_SERVICE.$date.tgz *
    • rm -f *
  3. In MySql client ( as root ), we need to force the recalculation of the new checksum values:
    • show index from JobUsageRecord_Meta;
    • alter table JobUsageRecord_Meta drop index index17;
      (index17 should be the md5v2 unique index. If not, drop the one that is.)
    • update JobUsageRecord_Meta set md5v2 = null;
    • delete from SystemProplist where car = 'gratia.database.disableChecksumUpgrade';
  4. Install the new software on the Gratia tomcat instance:
    • pswd=xxx
    • source=/home/gratia/gratia-releases/gratia-v0.34.10
    • pgm=/home/gratia/gratia-releases/gratia-v0.34.10/common/configuration/update-gratia-local
    • $pgm -d $pswd -S $source -s DATABASE
  5. Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the TOMCAT_LOCATION/gratia/service-configuration.properties file with the following attributes. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
    • service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
    • service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
    • service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
    • service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
    • service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
    • service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
    • service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
  6. AND change these attributes as well to either gratia-fermi.fnal.gov (gratia08) or gratia.opensciencegrid.org (gratia09). Note : the port number should be correct, but double-check it.
    • service.open.connection=http://localhost:8884
    • service.secure.connection=https://localhost:8860
  7. Start the tomcat service
    • service TOMCAT_INITD_SERVICE start
      When the tomcat service initializes, it will detect any schema changes have been effected and a conversion process will begin.
    • tail the log files in _TOMCAT_LOCATION/logs. When the conversion process completes the log will show the following message: - "INFO: Server startup in _xxx_ms"
    • In your browser, access the gratia-administrative web service, select the System / Administration menu option in the left menu, Then scroll down to the Starting/Stopping Database Update Services section and select the Stop Update Services link
  8. Wait until the schema upgrade is complete , then, start the Gratia update services for the database schema just upgraded.
    • In your browser, connect to the Gratia administrative services url for each of the databases.
    • Select the System / Administration menu option in the left menu
    • Then scroll down to the Starting/Stopping Database Update Services section and select the Start Update Services link.
    • As the collector/tomcat host update service is started, monitor the tomcat logs files for any errors.
    • Bring up the gratia administration web interface and verify that the collectors are processing the data on the status page.
    • Bring up the gratia reporting web interface and verify that the reports look reasonable while still tail'ing the log files.
  9. Run the static reports cron as root and verify these reports are generated:
    • 3 pdf files and 3 csv files
  10. If satisifed, you may continue to the next one.

gratia08 services

Database Tomcat node init.d service tomcat location Size Rows Start Complete Alter table checksum
fermi_transfer gratia08.fnal.gov:8886 tomcat-fermi_transfer /data/tomcat-fermi_transfer 5.2M 0 6/12 11:20 6/12 11:39 n/a n/a
fermi_itb gratia08.fnal.gov:8881 tomcat-fermi_itb /data/tomcat-fermi_itb 513M 59K 6/12 11:40 6/12 12:02 n/a 00:11
gratia_qcd gratia08.fnal.gov:8883 tomcat-qcd /data/tomcat-qcd 4.6G 1.2M 6/12 12:33 6/12 17:24 00:08 29:00
gratia_psacct gratia08.fnal.gov:8882 tomcat-ps /data/tomcat-ps 19G 8.5M 6/13 09:54 6/18 08:45 04:23 started 6/18 08:45
fermi_osg gratia08.fnal.gov:8880 fermi_osg /data/tomcat-fermi_osg 26G 9.5M 6/13 10:00 6/14 12:48 unknown unknown

tomcat-qcd

Started 06/12 12:46 Complete:6/12 17:24 Elapsed: 04:38
Jun 12, 2008 5:19:13 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: main duplicate resolution phase complete in 1 iterations.
INFO: ChecksumUpgrader: resolved duplicates for a total of 65 duplicated checksums.
FINEST: fixDuplicatesOnce: starting duplicate resolution cycle
FINEST: fixDuplicatesOnce: this cycle detected 0 duplicated checksums
INFO: ChecksumUpgrader: make index on md5v2 unique (could take some time)

Jun 12, 2008 5:21:56 PM net.sf.gratia.util.Logging log
FINE: Executing: alter table JobUsageRecord_Meta drop index index12
Jun 12, 2008 5:24:01 PM net.sf.gratia.util.Logging log
FINE: Command: OK: alter table JobUsageRecord_Meta drop index index12
INFO: ChecksumUpgrader: reactivating listener threads
FINE: ListenerThread: ListenerThread: 0:Duplicate Check: true
   

tomcat-psacct

-rw-rw----  1 gratia gratia   209715200 Jun 13 23:29 MasterSummaryData.ibd
-rw-rw----  1 gratia gratia   155189248 Jun 14 10:48 NodeSummary.ibd

Jun 13, 2008 11:28:09 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-tables.sql ... OK
Jun 13, 2008 11:28:10 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-view.sql ... OK

2008-06-13 22:01:21,483 org.hibernate.tool.hbm2ddl.SchemaUpdate [INFO]: schema update complete

Restarted on 6/15

Jun 15, 2008 1:47:00 PM net.sf.gratia.util.Logging log
FINE: DatabaseMaintenance: calling post-install script for action "summary"
Jun 15, 2008 2:23:39 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-tables.sql ... OK
Jun 15, 2008 2:23:39 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-view.sql ... OK
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-ps-node-summary-table.sql ... OK
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: exitValue: 0
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: Executing: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion"
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: Command: Error: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion" : com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
java.io.EOFException

6/18 08:30 - The above error occurred because of the timeout on the connection. The SystemProplist table has to be updated manually for this entry:

use gratia_psacct;
update SystemProplist set cdr=38 where car = "gratia.database.summaryTableVersion";
   

6/18 08:35 - restarted tomcat-psacct service

6/16/13:05 - Philippe questioning why the summary tables had to be rebuilt

Chris' reply: 
For the first 0.34 upgrade: if the upgrade "failed" due to the connection timeout 
during summary table production, the DB entry should have been updated. 
If that didn't happen, that's where the problem is.
   - summaryTableVersion is 33.
   - Table reload required at v36
   

tomcat-fermi_osg

-rw-rw----  1 gratia gratia    98304 Jun 14 12:48 SystemProplist.ibd
-rw-rw----  1 gratia gratia 12582912 Jun 14 12:38 MasterSummaryData.ibd
(appears no more updates occurred against the summary table as all are glexec records 
   

6/18 10:00 - checksum calculation is doing one of its resolutions. Appears to be 10 hours into this query. Do not want to do anything against gratia06 database until this completes. Apears to be all done in memory. Output of show processlist :

fermi_osg Execute 36353 Sending data select md5v2, count(*) as `Count` from JobUsageRecord _Meta where md5v2 is not null group by md5v2 ..

6/18 13:20

FINEST: fixDuplicatesOnce: this cycle detected 0 duplicated checksums
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: main duplicate resolution phase complete in 1 iterations.
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: resolved duplicates for a total of 34 duplicated checksums.
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: final pass on duplicates
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging debug
FINEST: fixDuplicatesOnce: starting duplicate resolution cycle
Jun 18, 2008 1:20:36 PM net.sf.gratia.util.Logging debug
FINEST: fixDuplicatesOnce: this cycle detected 0 duplicated checksums
Jun 18, 2008 1:20:36 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: make index on md5v2 unique (could take some time)

      

gratia09 services

Database Tomcat node init.d service tomcat location Size Rows Start Complete Alter table checksum
gratia_osg_transfer gratia09.fnal.org:8886 tomcat-osg_transfer /data/tomcat-osg_transfer 5.2M 0 6/19 7:20 6/19 9:00 n/a n/a
gratia_osg_daily gratia09.fnal.gov:8884 tomcat-osg_daily /data/tomcat-osg_daily 155M 86K 6/19 09:50 6/19 10:15 n/a n/a
gratia_osg_integration gratia09.fnal.gov:8885 tomcat-osg_integration /data/tomcat-osg_integration 7.1G 1.4M 6/19 10:35 6/19 11:37    
gratia_itb gratia09.fnal.gov:8881 tomcat-itb /data/tomcat-itb 23G 8.2M 6/19 13:05 6/19    

osg_transfer

Got this stacktrace shortly after startup:
 *In catalina.out:* java.rmi.ServerException: RemoteException occurred in server thread;  nested exception is:         java.rmi.AccessException: Registry.Registry.rebind disallowed;  origin /131.225.107.75 is non-local host  *In gratia-0.log:* Jun 19, 2008 7:41:42 AM net.sf.gratia.util.Logging warning WARNING: java.rmi.server.ExportException: Port already in use: 21000;  nested exception is:  java.net.BindException: Address already in use 

A look at collector-pro.dat shows:

  • gratia09 collectors:
    gratia => { http_port => 8880, rmi_port => 17000, itb => { http_port => 8881, rmi_port => 18000, osg_daily => { http_port => 8884, rmi_port => 20000, osg_integration => { http_port => 8885, rmi_port => 21000, osg_transfer => { http_port => 8886, rmi_port => 21000, 
  • gratia08 collectors:
    fermi_osg => { http_port => 8880, rmi_port => 17000, fermi_itb => { http_port => 8881, rmi_port => 18000, ps => { http_port => 8882, rmi_port => 19000, qcd => { http_port => 8883, rmi_port => 21000, fermi_transfer => { http_port => 8886, rmi_port => 22000, conversion => { http_port => 8884, rmi_port => 20000, 
  • The osg_transfer is the only one with conflicting ports. Changing manually to 22000 in /data/tomcat-osg_transfer and in CVS.

6/19 08:30

  • Now got this error on startup in catalina.out:
    java.rmi.ServerException: RemoteException occurred in server thread; nested exception is: java.rmi.AccessException: Registry.Registry.rebind disallowed; origin /131.225.107.75 is non-local host 
  • caused by this alias in the *service-configuration.properties file
    service.rmi.rmibind=//gratia-transfer.opensciencegrid.org:22000 service.rmi.rmilookup=rmi://gratia-transfer.opensciencegrid.org:22000 
  • these were changed as well
    monitor.subject=gratia-transfer.opensciencegrid.org:8886 service.open.connection=http://gratia-transfer.opensciencegrid.org:8886 service.secure.connection=https://gratia-transfer.opensciencegrid.org:8840 
  • Changed to gratia.opensciencegrid.org Changing manually in /data/tomcat-osg_transfer and in CVS.

Done at 6/19 09:00

osg_daily

No problems.
Jun 19, 2008 11:53:42 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: make index on md5v2 unique (could take some time) Jun 19, 2008 11:53:55 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: drop index on old md5 column  FINE: Command: OK: alter table JobUsageRecord_Meta drop index index12 Jun 19, 2008 11:54:03 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: reactivating listener threads  Jun 19, 2008 11:54:04 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: checksum upgrade operation complete  

osg_integration

alter table JobUsageRecord_Meta drop index index17; elapsed: 00:10  update JobUsageRecord_Meta set md5v2 = null; Started 11:17  Ended: 11:19  Jun 19, 2008 11:23:50 AM net.sf.gratia.util.Logging log FINE: Executing: alter table JobUsageRecord_Meta add index index17(md5v2)  Jun 19, 2008 11:27:19 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: updating all checksums in JobUsageRecord_Meta     

gratia_itb

Post-mortem

At this time, this will appear to be random notes. After the conversion, they may be organized.

Main gratia DB - Upgrade to v0.34.9

The main Gratia database ( gratia ) served by the gratia09:/data/tomcat-gratia instance will require a very long conversion time. So to maximize availability we will:

  1. create a current copy of the gratia06/gratia database on gratia07
  2. using a temporary Gratia collector on gratia08:/data/tomcat-conversion
    • this collector will be upgraded to v0.34.9 allowing the new summary table and md5v2 checksums to be calculated.
    • we will then replicate from gratia06 to gratia07 to "catch up" the database before we switch the production gratia09:/data/tomcat-gratia over to using it.
  3. switch the current gratia09:/data/tomcat-gratia to use the gratia07/gratia database
  4. perform the upgrade of the gratia06/gratia database to v0.34.9 using the Gratia collector on gratia08:/data/tomcat-conversion .
  5. While the checksum upgrades are occurring, we will replicate from gratia07 to gratia06 so the "catch up" time to make it current is reduced.
  6. when the upgrade is complete (inclusive of the conversion to the new md5 key), we will switch the gratia09:/data/tomcat-gratia instance back to using the gratia06/gratia database

During the upgrade period, this will be the tomcat/database environement for the gratia database:

Tomcat instance Database instance
gratia09:/data/tomcat-gratia port 8880 gratia07:/data/mysqldb/gratia port 3320
gratia08:/data/tomcat-conversion port 8884 gratia06:/data/mysqldb/gratia port 3320

The above is a high level summary of the process. The details are contained in subsequent sections.

Create a copy of the current database on gratia07

On 5/28, the gratia07:/data/mysqldb was re-partitioned into a single 200Gb area to provide the required space for the gratia database.

Create the gratia07:gratia database using the ZRM backup

This is in process now (5/30) and a finalized procedure will be included when it completes.

Done - 6/3/08 08:00 CDT

Configure gratia08:/data/tomcat-conversion to use the gratia07:gratia database

On gratia08, edit the /data/tomcat-conversion/gratia/service-configuration.properties file for this resulting attribute:
  • service.mysql.url=jdbc:mysql://gratia07.fnal.gov:3320/gratia

Done - 6/3/08 12:37 CDT (John Weigand)
Re-confirmed - 6/11/08 10:30 CDT (John Weigand)

Upgrade gratia08:/data/tomcat-conversion to v0-34-9

On gratia08 :
  1. Stop the tomcat service
    • service tomcat-conversion stop
    • Done - 6/11/08 14:48 CDT

On gratia07 :

  1. Disable the checksum upgrade on gratia07/gratia database. In a MySql client:
    • update SystemProplist set cdr=1 where car='gratia.database.disableChecksumUpgrade';
    • Done - 6/11/08 14:51 CDT

Back on gratia08 :

  1. Perform the upgrade to v0.34.9
    • pswd=xxx
    • source=/home/gratia/gratia-releases/gratia-v0.34.9
    • pgm=/home/gratia/gratia-releases/gratia-v0.34.9/common/configuration/update-gratia-local
    • $pgm -d $pswd -S $source -s conversion
  2. Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the TOMCAT_LOCATION/gratia/service-configuration.properties file with the following attributes. You will also need to change 2 connection attributes, as well. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
    • service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
    • service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
    • service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
    • service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
    • service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
    • service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
    • service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
    • service.open.connection=http://gratia-fermi.fnal.gov:8884
    • service.secure.connection=https://gratia-fermi.fnal.gov:8860
  3. IMPORTANT IMPORTANT... THESE MUST BE CHANGED
    • service.mysql.url=jdbc:mysql:// gratia07.fnal.gov:3320/gratia in the service-configuration.properties file
    • in post-install.sh, change the mysql client execution from gratia-db01 to gratia07.
  4. Start the tomcat service
    • service tomcat-conversion start

Done - 6/11/08 14:55 CDT
First shot was against the production database. I did not realize the collector-pro.dat file set the mysql to gratia_db01:3320/gratia. Chris bailed me out and confirmed that no damage had occured. Chris changed the collector-pro.dat for the conversion instance to garbage data to avoid this on future releases forcing manually changing the file. I will add this to the upgrade instructions for the gratia09:tomcat-gratia .
Re-Done - 6/11/08 15:21 CDT

First thing it does is add the GridDescription to the JobUsageRecord_Meta table.
In MySql client ( show processlist; ):

Id User Host db Command Time State Info
2786 gratia gratia08.fnal.gov:11262 gratia Query 57891 copy to tmp table alter table JobUsageRecord _Meta add column GridDescription varchar(255)
Started: 6/11 15:20 CDT
Status as of 6/12 07:27 CDT
  • -rw-rw---- 1 gratia gratia 23G Jun 5 09:08 JobUsageRecord_Meta.ibd
  • -rw-rw---- 1 gratia gratia 14K Jun 11 15:20 #sql-3275_ae2.frm
  • -rw-rw---- 1 gratia gratia 16G Jun 12 07:27 #sql-3275_ae2.ibd

GridDescription add to JobUsageRecord _Meta table

  • Started - 6/11 15:21 CDT Ended - 6/12 22:30 CDT Elapsed - 31:09 md5v2 and index add to JobUsageRecord _Meta
  • Started - 6/12 23:26 CDT Ended - 6/14 10:07 CDT Elapsed -34:41 MasterSummary table creation (started tomcat-conversion service)
  • Started - 6/14 10:45 CDT Ended - 6/12 CDT Elapsed
    Jun 14, 2008 10:55:21 AM net.sf.gratia.util.Logging log
    FINE: Executing: /data/tomcat-conversion/gratia/post-install.sh summary
    ABOVE ONE WAS AGAINST THE gratia06 database as the post-install.sh script needed to be changed too
    
    Jun 14, 2008 11:21:24 AM net.sf.gratia.util.Logging log
    FINE: Executing: /data/tomcat-conversion/gratia/post-install.sh summary
    
    <b>Jun 15, 2008 3:01:31 AM net.sf.gratia.util.Logging log
    FINE: ERROR - ERROR 1206 (HY000) at line 45: The total number of locks exceeds the lock table size</b>
    
    In the grati07.fnal.gov.err file....
    080615  2:47:28  InnoDB: WARNING: over 67 percent of the buffer pool is occupied by
    InnoDB: lock heaps or the adaptive hash index! Check that your
    InnoDB: transactions do not set too many row locks.
    InnoDB: Your buffer pool size is 512 MB. Maybe you should make
    InnoDB: the buffer pool bigger?
    InnoDB: Starting the InnoDB Monitor to print diagnostics, including
    InnoDB: lock heap and hash index sizes.
    
    
    Jun 15, 2008 13:30
    Set increased my.cnf variable:
    innodb_buffer_pool_size = 512M
    to
    innodb_buffer_pool_size = 768M
    
    also stopped mysql_test instance... restarted mysql ... restarted tomcat-conversion
    
    Jun 15, 2008 1:34:57 PM net.sf.gratia.util.Logging log
    FINE: Executing: /data/tomcat-conversion/gratia/post-install.sh summary
    Jun 16, 2008 9:27:50 AM net.sf.gratia.util.Logging log
    FINE: OUTPUT>post-install.sh: loading /data/tomcat-conversion/gratia/build-summary-tables.sql ... OK 
    Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
    FINE: OUTPUT>post-install.sh: loading /data/tomcat-conversion/gratia/build-summary-view.sql ... OK
    Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
    FINE: exitValue: 0
    Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
    FINE: Executing: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion"
    Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
    FINE: Command: Error: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion" : com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
    ** BEGIN NESTED EXCEPTION **
    java.io.EOFException
       

Apparently connection timed-out around 6/16 09:27. The summary table and views were updated successfully. Just failed in the update of the SystemProplist table entries. Chris updated them manually. Then we recycled the tomcat-conversion service on gratia08.

Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging warning
WARNING: Command: Error: insert into SystemProplist(car,cdr) values("gratia.database.summaryTableVersion", "38")

root      7218     1 20 10:05 ?        00:32:35 /data/jre/bin/java 
   

WAIT until the summary tables have been created. Then you can proceed to the next step of "catching up" the gratia07 database.

Bring the gratia07 database up to current

Since the gratia07 database is based on a nightly backup and production has been collecting data from the CE nodes, it needs to be brought up to current by replicating from gratia06 to gratia07 using the administrative services Replication menu.
  • on http://gratia.opensciencegrid.org:8880/gratia-administration
  • both Job Usage and Metric replication services should be activated to replicate to:
    - http://gratia-fermi.fnal.gov:8884
  • The starting dbid should be determined ( once only after the backup is restored) using the mysql client on gratia07
    - select max(dbid) from JobUsageRecord_Meta;
    - select max(dbid) from MetricRecord_Meta;

Partially done - 6/3/08 13:45 CDT (John Weigand)

select max(dbid) from JobUsageRecord_Meta;  +-----------+ | max(dbid) | +-----------+ |  75242300 |  +-----------+  select max(dbid) from MetricRecord_Meta;  +-----------+ | max(dbid) | +-----------+ |      1506 |  +-----------+  *On gratia06:* update Replication set dbid=75242300  where openconnection='http://gratia-fermi.fnal.gov:8884' and replicationid=52;  update Replication set dbid=1506  where openconnection='http://gratia-fermi.fnal.gov:8884' and replicationid=54;  

Starting point (approx 6/3/08 14:00 CDT)

gratia06 max(dbid) 75,851,594
gratia07 max(dbid) 75,242,300
catch up 609,294

Catch up complete around 6/3/08 23:00 CDT

hour UTC updates
080604 12 7715
080604 11 8224
080604 10 8435
080604 09 6962
080604 08 7869
080604 07 8005
080604 06 7753
080604 05 6874
080604 04 7944
080604 03 9629
080604 02 14588
080604 01 92283
080604 00 89794
080603 23 87849
080603 22 70699
080603 21 89476
080603 20 89201
080603 19 32873

2nd round of "catchup" statistics

  • started at 6/16 10:10
  • 2M records to process
  • at 6/16 12:46 these are the counts.
Period JobUsage Metric
1 hours ago 43173 0
2 hour ago 35046 0
3 hours ago 7590 25

Wait until the gratia07 is caught up to the gratia06 database. You can monitor this by:

  • on gratia06
    - select max(dbid) from JobUsageRecord_Meta;
    - select max(dbid) from MetricRecord_Meta;
  • on gratia07
    - select max(dbid) from JobUsageRecord_Meta;
    - select max(dbid) from MetricRecord_Meta;

When they are close, stop the administrative update service on

The replication services should still be running to ensure we "catch up" the last of the updates on gratia06 to gratia07. The replication activity shall be complete when both replication data pumps (as shown by gratia-0.log on gratia08) have been through at least one check-duplicate cycle without uploading records.

Then, turn off both the Job Usage and Metric replication services on:

Take note of max(dbid) in JobUsageRecord_Meta and MetricRecord_Meta on gratia07, as these numbers will be used to set up replication from gratia07 back to gratia06 while checksums are being upgraded on gratia06.

Update the Replication services on gratia06 to gratia07

On gratia06, if there are Job Usage and Metric replication services active, we will need to create entries in the gratia07 instance as this will shortly become our production database.

When creating these entries, we will need to insure that we start with the last record replicated from gratia06, which may not have the same dbid on gratia06 as it did on gratia07.

Instructions for doing this will be provided later if this needs to be performed for this release.

This should be done before we switch the gratia09:/data/tomcat-gratia to use the gratia07/gratia database.

The entries should be added and only a test that the entry is correct made at this time. We will set the starting dbid later and activate the entry after we switch the tomcat collectors.

Stop gratia09:/data/tomcat-gratia and gratia08:/data/tomcat-conversion

Stop each collector:
  1. on gratia08
    • service tomcat-conversion stop
  2. on gratia09
    • service tomcat-gratia stop

Upgrade the current gratia09:/data/tomcat-gratia to v0.34.9

REMEMBER:
  1. on gratia07 my.cnf, we need to eliminate the binary log files.... NO BACKUPS WILL NOT WORK W/O BINARY LOGS... AT LEAST SO FAR.
  2. on gratia06 my.cnf, we need to change the buffer size as we did on gratia07. Also need to eliminate the binary log files. This will mean stopping all collectors and bouncing the mysql. BE CAREFUL THAT BACKUPS HAVE COMPLETED.
  3. backups on gratia07

On gratia09:

  1. Perform the upgrade to v0.34.9
    • pswd=xxx
    • source=/home/gratia/gratia-releases/gratia-v0.34.9
    • pgm=/home/gratia/gratia-releases/gratia-v0.34.9/common/configuration/update-gratia-local
    • $pgm -d $pswd -S $source -s gratia
  2. Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the /data/tomcat-gratia/gratia/service-configuration.properties file with the following attributes. You will also need to change 2 connection attributes, as well. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
    • service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
    • service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
    • service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
    • service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
    • service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
    • service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
    • service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
    • service.open.connection=http://gratia.opensciencegrid.org:8880
    • service.secure.connection=https://gratia.opensciencegrid.org:8845
  3. Change the post_install.sh script to point to gratia07.fnal.gov

Started 6/24 08:17 Done 6/24 08:24

Switch gratia09:/data/tomcat-gratia to use the gratia07/gratia database

Update the service-configuration.properties file to change the database that the production instance points to: IMPORTANT IMPORTANT... THIS MUST BE CHANGED
  1. In gratia09:/data/tomcat-gratia/gratia/service-configuration.properties
    • service.mysql.url=jdbc:mysql://gratia07.fnal.gov:3320/gratia Done 6/24 08:24
  2. In gratia09:/data/tomcat-gratia/gratia/post-install.sh, change the mysql client execution from gratia-db01 to gratia07.
  3. Start the gratia09 production collector only (it will be using the gratia07 database
    • service tomcat-gratia start

At this point we have production gratia reporting and services back on-line and available. Verify there are no problems by tailing the log files in gratia09:/data/tomcat-gratia/logs .

Done 6/24 08:30

Problems:

  1. catalina.out exceptions
    command: select T.SiteName, P.currenttime, P.probename from Probe P, Site T where P.active = 1 and T.siteid = P.siteid order by T.SiteName,P.currenttime desc java.lang.ArrayIndexOutOfBoundsException 
  2. gratia-rmi-servlet0.log - coming every 10 minutes. Chris thinks from a probe.
    INFO: RMIHandlerServlet: Error: Problematic req: org.apache.catalina.connector.RequestFacade@564a38 Jun 24, 2008 2:29:25 PM net.sf.gratia.util.Logging info 
    • A v0.34.9a was installed (not put in CVS) as 11:35am to trouble-shoot the problem.
    • A new v0.34.a installed (11:47 - 11:51) with problem fixed hopefully
  3. Static reports appear to be truncating... should be landscape to fit in or scaled differently.

If all is well, we can proceed to the v0.34.9 upgrade of the gratia06:gratia database.

Backups

We will need to take backups on the gratia07 database for the gratia database.

And also stop the backups on gratia06 for the gratia database.

Crons

This section can be delayed depending on the time of day.

Cron processes during upgrade

There are several processes (cron) that will need to be copied/modified on gratia06 and gratia09 for reports and interfaces.
  1. gratia06 - user gratia
    • APEL-WLCG interface
       dir=/home/gratia/interfaces/apel-lcg; cd $dir; ./lcg.sh --config=lcg.conf --date=current --update 
      This cron process can remain on gratia06. The only modification required is to the /home/gratia/interfaces/apel-lcg/lcg-db.conf file for the following parameter:
      • GratiaHost? gratia07.fnal.gov

        Done - 6/24 13:19

    • OSG daily reports
      /home/gratia/gratia-summary/gratiaSum.cron.sh
      /home/gratia/gratia-summary/daily_mail_cron.sh
      /home/gratia/gratia-summary/range_mail_cron.sh --weekly
      /home/gratia/gratia-summary/range_mail_cron.sh --monthly
      /home/gratia/gratia-summary/rungratia.sh --production 
      
    • CMS weekly reports
      /home/gratia/gratia-summary/cms_mail_cron.sh --weekly

      Change in /home/gratia/gratia-summary/PSACCTReport.py the

      gMySQLConnectString = " -h gratia-db01.fnal.gov -u reader --port=3320 --password=reader "
      Done - 6/24 13:20

  2. gratia06 - user root
    Database backups on gratia06 will need to be changed to exclude the gratia database. This includes the purging of log files by the zrm process.
    /usr/local/bin/gratia_backup_cron.sh /usr/bin/mysql-zrm --action purge 
Done 06/24 12:43 - commented the gratia database backup.

However, could not really stop the purging. So copied them to /backup/mysqldb/gratia/zrm.20080624.backup to preserve them.
Done 6/24 13:05

  1. gratia09 - user root
    The static reports cron on gratia09 does not