Gratia Release v0.34 and v0.34.10 (5/14/08 - 5/29/2008)
Overview
The main purpose of this release (v0.34.10) is the DN/FQAN collection improvements and the requirement for authentication for administrating the Collector.
Reporter Improvements (v0.34.9)
- Report displaying those jobs that have been running more than X days by VO. where X is a week and a month (RR23)
- Report displaying those sites that show 0 Wall Duration for more than X days - where X is a week and a month. (RR24)
- Update the osg daily report for Birt 2.2
- Periodically generated (daily, weekly and/or monthly) csv files that are available for immediate view without need to read db records
- CPUday per day for the last month/30 days for each VO
- CPUday per day for the last month/30 days for each site
- total CPUweeks per week for the last 52 weeks (year)
- Extend User daily report to include VO name.
- Enable easy selection/reporting of CMS-Tier-2 sites.
Collector Improvements (v0.34.7)
- Enable authentication in the administration pages
- Implement looking up the DN, Role and VO directly from certificate
- Update of the record checksum to exclude DN, Role, VO"
- Review and improve the Summary table creation mechanism
- Update Collector to Collector connection to send more than one record on each connection
- Add VO to User Summary Table
- Replace pre-duplicate check by exception handling
Probe Improvements (v0.34.1)
- Document on the issues related to full and complete probe black-out detection.
- Globus patched to interrogate delegated proxies wherever available for DN and FQAN information.
- Condor probe updated:
- to get information on WS jobs;
- to use ClassAds generated by PER_JOB_HISTORY_DIR (if available) to avoid expensive calls to condor_history. Will fall back on condor_history;
- to facilitate "local" probes by means of a constraint option so probes only process certain types of job.
- avoid double-counting when information about the same job is received from multiple sources.
- Package dCache probe as RPM.
- dCache / SRM probes now available (see dCache release notes).
- Minor improvements to SGE probe from Shreyas Cholia.
- PBS/LSF probes improved to prevent misreading an incomplete line in the batch manager log.
Anticipated downtime
It is expected that this release will require the Gratia services and reporting to be unavailable beginning at:
- Start: 5/14/08 hh:mm CST
- Available: 5/14/08 hh:mm CST
The changes affecting downtime for this release are:
- Length of time to make a backup of the database to the backup area using
mysqlhotcopy
- Installation and validation on the 6 Gratia schemas
Build the v0.34.10 for distribution
This release was done in 2 stages and a couple interim stages when problems were detected:
- v0.34 - for the VDT 1.10.1 distribution (ITB 0.9.0)
- v0.34.1 - for local distribution. Changes were required to the local scripts/configurations for the actual conversion at FNAL on gratia08 and gratia09 collectors.
- Make sure your build area contains all committed changes.
- In gratia/build-scripts/Makefile , change the version_default to:
- version_default = v0.35
- commit the change
- Tag the release:
- cvs tag -R v0-34 gratia This tags everything that is in your build directory and presumably tested.
- Into a new area, export the tagged release:
- cvs export -d gratia-v0.34 -r v0-34 gratia
- Build it for the release (this insures that tar files are produced for VDT):
- cd gratia-v0.34/build-scripts
- source setup-jdk15.sh
- make release
- Copy the build/release area to the gratia home directory for posterity:
- mkdir /home/gratia/gratia-releases/v0.34
- cp -pr /home/weigand/cdcvs/gratia-v0.34/* /home/gratia/gratia-releases/v0.34/.
- Copy the built tar files to the release area:
- scp ../target/*_v0.34.tar flxi07.fnal.gov:/afs/fnal.gov/files/expwww/gratia/html/Files/
- Update the version number on the services release TWiki page:
- Edit and update the TWiki variable ReleaseVersion. Status: Completed on 5/10/2008 08:00 CDT
*Update: 5/11/2008 14:45 CDT:*
Had to modify Makefile to include the voms_servers and voms-server.sh files in the Gratia-Services distribution for VDT. Re-tagged, rebuilt, moved to the flxi07.fnal.gov site.
*Update: 5/12/2008 14:15 CDT:*
Problems encountered when working with an empty database. Previous testing was done with an existing database. The following programs required modification:
- collector/gratia-services/net/sf/gratia/services/ListenerThread.java
- collector/gratia-services/net/sf/gratia/services/NewVOUpdate.java
- collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
- common/configuration/hibernate.cfg.xml
Steps 3 through 6 were repeated for these modules. Note: To tag the individual module (example):
- cvs tag -F v0-34 collector/gratia-services/net/sf/gratia/services/ListenerThread.java
Provided a new configure_gratia to VDT. This was tested in a condor environment as best we could especially for the validation of the condor environment relevant to the PER_JOB_HISTORY_DIR that is critical.
This is the v0.34 used in VDT 1.10.1 (ITB 0.9.0).
*Update: 5/14/2008 15:45 CDT: Tagged as v0-34-1*
The following modules were modified for changes to local (non-VDT) installations/upgrades. Tests are currently being run by Chris Green and a gratia instance on gratia07:/test/data/mysqldb/gratia using this software:
- gratia/build-scripts/Makefile
- gratia/common/configuration/cleanup_server_lib
- gratia/common/configuration/configure-collector
- gratia/common/configuration/update-gratia-local
Steps 3 through 6 were repeated for these modules. Note: The tagging was done as:
- cvs tag -R v0-34.1 gratia and the export done as:
- cvs export -d gratia-v0.34.1 -r v0-34.1 gratia and the build done as:
- cd gratia-v0.34.1/build-scripts
- source setup-jdk15.sh
- make release and the copy to the release area:
- scp ../target/*_v0.34.1.tar flxi07.fnal.gov:/afs/fnal.gov/files/expwww/gratia/html/Files/.
- updated the version number on the services release TWiki page
Also note these changes do not affect the current VDT 1.10.1 release of Gratia v0.34. The latest version was copied in as in step 6 and 7, but VDT does not need to access this version.
Update: 5/20/2008 14:42 CDT: Tagged as v0-34-2
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java
U collector/gratia-services/net/sf/gratia/storage/SummaryUpdater.java
U collector/gratia-services/net/sf/gratia/storage/Utils.java
U common/configuration/collector-pro.dat
U common/configuration/service-configuration.properties
U probe/build/gratia-probe.spec
U probe/common/Gratia.py
U probe/common/GRAM/JobManagerGratia.pm
U probe/condor/condor_meter.pl
U reporting/summary/PSACCTReport.py
cvs tag -R v0-34-2 gratia
5/20/2008 16:46 CDT
cvs tag -F v0-34-2 collector/gratia-services/net/sf/gratia/services/CollectorService.java
cvs tag -F v0-34-2 collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
cvs tag -F v0-34-2 common/configuration/summary-procedures.sql
5/21/2008 11:03 CDT
Since the osg_daily record does not have a jobid or anything similar, we must keep the
UserIdentity field in the md5 calculation otherwise we find many false duplicate
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java
Add support for DN comming from cron
(/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=fermigrid0.fnal.gov/CN=cron/CN=Keith Chadwick/CN=UID:chadwick).
In that case, use the 3rd CN
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java
Re-add VOName Correction link
U collector/gratia-administration/WebContent/dashboard.jsp
use full md5 for ps accounting too
U common/configuration/collector-pro.dat
cvs tag -R v0-34-3 gratia
5/22/2008 10:45 CDT Changes:
Re-add missing Host field lookup needed by psacct
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
U common/configuration/summary-procedures.sql
actually use the common name we just found
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java
cvs tag -R v0-34-4 gratia
5/27/2008 13:10 CDT Changes for gratia_psacct
report library is in production-osg only
reporting/gratia-reports/WebContent/reports/production-fermi/gratia-lib1.rptlibrary is no longer in the repository
reporting/gratia-reports/WebContent/reports/production-osgdaily/gratia-lib1.rptlibrary is no longer in the repository
U reporting/gratia-reports/WebContent/reports/production-osg/gratia-lib1.rptlibrary
cvs tag -R v0-35-5 gratia
One more change:
Give up trying to guess which CN is the "right" one: record them all.
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java
cvs tag -F v0-34-5 collector/gratia-services/net/sf/gratia/storage/JobUsageRecordUpdater.java
5/28/2008 12:16 CDT Changes:
Config change to address "APPARENT DEADLOCK" problem.
U common/configuration/hibernate.cfg.xml
cvs tag -R v0-34-6 gratia
More 5/28/2008 13:00 CDT
uncomment WeeklyUsageByVO and comment out WeeklyUsageByVORanked (as intended in previsous ci)
U common/configuration/create_build-stored-procedures-sql
Update DB version to trigger stored procedure reload.
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
Add collector_host to enable SSL operation. Configure ITB to run 5 threads.
Add conversion collector for gratia-fermi on 8884.
U common/configuration/collector-pro.dat
Need to cope with a previous incomplete upgrade (DB 34 did *not* reload
the triggers like it was supposed to IFF the existing DB version was
already 33).
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
cvs tag -F v0-34-6 common/configuration/create_build-stored-procedures-sql
cvs tag -F v0-34-6 collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
cvs tag -F v0-34-6 common/configuration/collector-pro.dat
5/29/2008 10:00 CDT Changes:
Summary table rebuild required to correct discrepancies with ResourceType filtering
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
Trigger filter logic from ResourceType should match summary table build.
U common/configuration/summary-procedures.sql
Expand the list of ResourceType values allowed in the summary table to Batch and RawCPU
U common/configuration/build-summary-tables.sql
cvs tag -R v0-34-7 gratia
6/3/2008 17:00 CDT Changes:
Remove (after verification) grouping from views and trigger reload of views.
gratia/common/configuration build-summary-view.sql
gratia/collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java,
Fix problem with old job data removal.
gratia/probe/common Gratia.py
More statistics when operating in verbose mode.
Only set Grid=Local if we're sure that the ClassAd attribute should have
been added by the JobManager (assuming it passed through same) but was
not (indicating a local job).
Improved logic testing whether we can rely on the check for the
JobManager in the case that the site is running Condor for managed fork
and PBS for standard batch jobs.
gratia/probe/condor condor_meter.pl
Turn verbose mode on by default and use DebugPrint.py
gratia/probe/condor condor_meter.cron.sh
Fix behavior upon receiving blank lines (would truncate input).
gratia/probe/common DebugPrint.py
Incorporate fixes for condor and common packages:
- Correct cleanup of no-longer-useful files in gratia/var/data.
- Improve DebugPrint.py in the case that input contains blank lines.
- Improve logic used in condor probe to decide whether we can use the absence
- of the GratiaJobOrigin ClassAd attribute to infer that a job is local.
- Condor probe is now verbose but prints to main Gratia log.
- Condor probe only assigns grid=Local to jobs it's really sure are local.
Bump version number to correct local permissions problem with DebugPrint.py.
gratia/probe/build gratia-probe.spec
cvs tag -R v0-34-8 gratia
6/6/2008 12:00 CDT Changes:
Fixed birt "Export Data": shows all columns (penelope)
gratia/reporting/gratia-reports/WebContent/reports/production-osg/WeeklyUsageByVO-ranked.rptdesign,
gratia/reporting/gratia-reports/WebContent/reports/production-osg/UsageBySiteByDate-ranked.rptdesign
gratia/reporting/gratia-reports/WebContent/reports/production-osg/2007WeeklyUsageByVO-ranked.rptdesign
First attempt to correct the parsing of the bundled records sent by replicator.
Also fix the content of the replication admin page (rowcount).
Clarify debug message list number of input message and records
Change the envelope for bunching to be record type independent. Fix the test for adding it
U collector/gratia-services/net/sf/gratia/services/ChecksumUpgrader.java
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java
U collector/gratia-services/net/sf/gratia/services/ReplicationDataPump.java
U collector/gratia-services/net/sf/gratia/storage/MetricRecordLoader.java
U collector/gratia-services/net/sf/gratia/storage/UsageRecordLoader.java
Updated to create a gratia-release file in the top level of every deployable
war file and in the gratia directory. The gratia-release file contains, for example:
Gratia release: v0.34.x
Build date: Thu Jun 5 14:40:53 CDT 2008
Build host: gratia06.fnal.gov
Build path: /home/weigand/cdcvs/gratia
Builder: uid=9789(weigand)
U gratia/build-scripts Makefile
Added the service.admin.DN.0 attribute so manual modification is not
required on each install. This needs to be changed later to use a
VOMS fermigrid role.
U gratia/common/configuration collector-pro.dat,
Unique index on old md5 column is no longer needed after upgrade is complete.
Update TableStatistics triggers to not attempt to put a null value into a non-null column.
U collector/gratia-services/net/sf/gratia/services/ChecksumUpgrader.java
U collector/gratia-services/net/sf/gratia/storage/DatabaseMaintenance.java
U common/configuration/JobUsage.hbm.xml
U common/configuration/post-install.sh
Fix problem with history replay.
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java
Probe changes not affecting collector:
U probe/build/gratia-probe.spec
U probe/common/GRAM/JobManagerGratia.pm
U probe/glexec/README
U probe/psacct/gratia-psacct
save only the part of bundle that describe the current record
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java
cvs tag -R v0-34-10 gratia
6/10/2008 12:00 CDT
Changes:
Remove Grid from attributes taken into account for checksum.
New column for GridDescription.
Improve debug messages when handling duplicates.
Correct minor problem in warning message.
U collector/gratia-services/net/sf/gratia/services/ListenerThread.java
U collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java
U collector/gratia-services/net/sf/gratia/storage/UsageRecordLoader.java
U common/configuration/JobUsage.hbm.xml
- cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/services/ListenerThread.java
- cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/storage/JobUsageRecord.java
- cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/storage/UsageRecordLoader.java
- cvs tag -F v0-34-10 common/configuration/JobUsage.hbm.xml
6/11/2008 09:00 CDT
Changes:
Refrain from overwriting the locally-generated ExtraXML attribute with
that received from an external source.
P collector/gratia-services/net/sf/gratia/services/ListenerThread.java
- cvs tag -F v0-34-10 collector/gratia-services/net/sf/gratia/services/ListenerThread.java
Status of checksum differences
Philippe and Chris have ascertained that the reason for the difference between the checksums is benign: the ExtraXML which is used in the checksum calculation is different due to legitimate differences in the make-up of the incoming record. The fact that there are no differences in checksum for anything received since 5/27 on gratiax33 is encouraging.
One change was made to ListenerThread.java in that previous behavior was to overwrite the ExtraXML produced from the bits that the receiving server couldn't understand with the incoming ExtraXML from the replicated (or history) record. This has been changed such that the lately-generated ExtraXml is used and any previous occupant of that title is ignored.
Steps 3 through 6 were repeated for these modules.
Note: the export done as yourself :
- cvs export -d gratia-v0.34.10 -r v0-34-10 gratia and copy it to the /home/gratia/gratia-releases area (as gratia user):
- cp -pr /home/weigand/cdcvs/gratia-v0.34.10 /home/gratia/gratia-releases/. and build it (as gratia user):
- cd /home/gratia/gratia-releases/gratia-v0.34.10/build-scripts
- source setup-jdk15.sh
- make release and make a copy of the tarballs in the /home/gratia/gratia-releases directory (as gratia user)
- cp -p /home/gratia/gratia-releases/gratia-v0.34.10/target/*_v0.34.10.tar /home/gratia/gratia-releases/tarballs/. and the copy to the release distribution area ( as yourself ):
- scp /home/gratia/gratia-releases/gratia-v0.34.10/target/*_v0.34.10.tar flxi07.fnal.gov:/afs/fnal.gov/files/expwww/gratia/html/Files/.
- updated the version number on the services release TWiki page
Status: Finally Completed v0.34.9 on 6/11/2008 09:00 CDT
Collectors and Databases Affected
The following Gratia collectors and databases will be converted with this release:
Shutdown and database backup.
The nightly backups will suffice for this implementation.
Small DB - Upgrade and implementation
The
upgrades should be single-threaded , that is, performed for each database schema one at a time.
We will perform these upgrades based on the size of the individual database schema, in ascending order.
The
DupRecord table (duplicates) should be removed before starting the upgrade.
The general procedure for this upgrade is:
- Stop the tomcat service
- service TOMCAT_INITD_SERVICE stop
- Install the new software on the Gratia tomcat instance:
- pswd=xxx
- source=/home/weigand/cdcvs/gratia-v0.34.10
- pgm=/home/weigand/cdcvs/gratia-v0.34.10/common/configuration/update-gratia-local
- $pgm -d $pswd -S $source -s DATABASE
- Tar the existing log files and save to a backup area. Then clean the log file directory.
- date=`date '+%Y%m%d'`
- cd TOMCAT_LOCATION/logs/
- tar zcf /data/gratia_tomcat_logs_backups/tomcat-TOMCAT_INITD_SERVICE.$date.tgz *
- rm -f *
- Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the TOMCAT_LOCATION/gratia/service-configuration.properties file with the following attributes. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
- service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
- service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
- service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
- service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
- service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
- service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
- service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
- Start the tomcat service
- service TOMCAT_INITD_SERVICE start
When the tomcat service initializes, it will detect any schema changes have been effected and a conversion process will begin.
- tail the log files in _TOMCAT_LOCATION/logs. When the conversion process completes the log will show the following message: - "INFO: Server startup in _xxx_ms"
- In your browser, access the gratia-administrative web service, select the System / Administration menu option in the left menu, Then scroll down to the Starting/Stopping Database Update Services section and select the Stop Update Services link
- Wait until the schema upgrade is complete , then, start the Gratia update services for the database schema just upgraded.
- In your browser, connect to the Gratia administrative services url for each of the databases.
- Select the System / Administration menu option in the left menu
- Then scroll down to the Starting/Stopping Database Update Services section and select the Start Update Services link.
- As the collector/tomcat host update service is started, monitor the tomcat logs files for any errors.
- Bring up the gratia administration web interface and verify that the collectors are processing the data on the status page.
- Bring up the gratia reporting web interface and verify that the reports look reasonable while still tail'ing the log files.
- Run the static reports cron as root and verify these reports are generated:
- 3 pdf files and 3 csv files
- If satisifed, you may continue to the next one.
Did not have to do the stop/start of update service as you those services never got started
Status v0.34. through v0.34.7
Issues/Problems:
- No host certificate. So administratiion functions disabled. Will have to wait until Monday to request host and http certificates from Steve.
- Installed host certificate on 5/12
- Static reports not working on fermi_transfer and osg_transfer. These databases are completely empty (zero records) and this is causing the problem.
- The update_gratia_local adds a cron entry for static reports every time it is run. You have to manually edit the crontab of the old entries.
- osg_daily catalina.out logs
- May 18, 2008 2:33:26 PM org.eclipse.birt.chart.script.ScriptHandler register
WARNING: An invalid Chart Java event handler class [class net.sf.gratia.reporting.GratiaBirtReportEH] has been detected and ignored.
- Reporting menu is entirely different.
- should the static reports even be run for these
- gratia_psacct
- Appears to be hung in the post-install.sh summary . Killed tomcat around 5/19 07:47 with an explicit kill and restart. It is hung again in the same step.
- fermi_osg
- Encountered this error with index12 :
FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) :
com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException:
Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
- This are the alter table.. statements in the gratia-0.log
380:FINE: Executing: alter table MetricRecord_Meta add unique index index12(md5)
382:FINE: Command: OK: alter table MetricRecord_Meta add unique index index12(md5)
388:FINE: Executing: alter table ProbeDetails_Meta add unique index index12(md5)
390:FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) :
com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException:
Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
470:FINE: Executing: alter table ProbeDetails_Meta add unique index index12(md5)
472:FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) :
com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException:
Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
590:FINE: Executing: alter table ProbeDetails_Meta add unique index index12(md5)
592:FINE: Command: Error: alter table ProbeDetails_Meta add unique index index12(md5) :
com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException:
Duplicate entry 'E6A2B4DC7D3C910EF4D584CAEB6F10FA' for key 2
618:FINE: Executing: alter table JobUsageRecord_Meta add index index17(md5v2
- osg_daily
- 5/20 08:00 noticed java exceptions
FINEST: fixDuplicatesOnce: resolving 3 duplicates
with checksum C7E72574A5A891270BF2F6ECA1127A10
May 20, 2008 9:12:45 AM net.sf.gratia.util.Logging debug
FINEST: fixDuplicatesOnce: deleting record 54646 in favor of record 54704
May 20, 2008 9:12:45 AM net.sf.gratia.util.Logging warning
WARNING: fixDuplicatesOnce: caught exception resolving duplicates
with checksum C7E72574A5A891270BF2F6ECA1127A10
com.mysql.jdbc.exceptions.MySQLSyntaxErrorException:
PROCEDURE gratia_osg_daily.del_JUR_from_summary does not exist
- Philippe reports that the checksum calculation is wrong for this instance and is looking into it
- stopped the instance 5/20 09:12
- gratia_itb
- hibernate got in a deadlock state. Recycle on 5/20 09:08. Waiting to see if it occurs again.
2008-05-19 08:22:50,075 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@e39068
-- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks!
2008-05-19 08:22:50,091 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@e39068
-- APPARENT DEADLOCK!!! Complete Status:
Managed Threads: 3
Active Threads: 3
Active Tasks:
2008-05-20 10:16:50,019 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@d7f3b9
-- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks!
2008-05-20 10:16:50,048 com.mchange.v2.async.ThreadPoolAsynchronousRunner [WARN]:
com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@d7f3b9
-- APPARENT DEADLOCK!!! Complete Status:
Managed Threads: 3
Active Threads: 3
Active Tasks:
Notes
- The tmp directory being used by MySql on gratia06 is defined on the command line and not by the my.cnf value
gratia 6415 14500 42 Apr25 ? 10-01:17:45
/fnal/ups/prd/mysql/v5_0_22/Linux-2-4/libexec/mysqld
--basedir=/fnal/ups/prd/mysql/v5_0_22/Linux-2-4
--datadir=/data/mysqldb/ --user=gratia
--pid-file=/data/mysqldb//gratia06.fnal.gov.pid --skip-locking
--port=3320 --socket=/data/mysqldb/mysql.sock
--tmpdir=/data/mysqltmp/tmp
- gratia_psacct
- Had to perform an InnoDB conversion in the beginning. (Start 12:59 Completed: 13:09, elapsed 01:11)
- Then it had to create the new table for the JobUsageRecord_Meta table column. (Start 13:11 Complete:14:26 elapsed 01:15)
- Adding the index (Start 15:21 ). No clue when it ended
- At 15:46 running gratia/build-summary-tables.sql
- fermi_osg
- Creating the JobUsageRecord_Meta table column and index (Start 5/19 09:32 Completed: 5/19 11:57 Elapsed: 02:25)
- Summary table creation (Start: 11:57 Completed: Elapsed: )
- Philippe changed the creation of the summary tables ( build-summary-tables.sql )
- This was to eliminate glexec records from the tables due to a large number of bad records in the main database when glexec first fired up.
- Ran from gratia-vm02:/data/tomcat-greenc_test area against the gratia07:/test/data/mysqldb/gratia database
- Started 5/19 14:40 Completed 5/20 04:52 Elapsed: 12:12
- Reduced time by 4 hours
Certificates on gratia08 and graita09
- Installed pacman
- Installed VDT CA-Certifcates
- mdkir /etc/grid-securtiy
- mkdir /usr/local/vdt-certificates.1101
- cd /usr/local/vdt-certificates.1101
- pacman -get VDT:CA-Certificates
- (answer yes to all questions and "r (root)" when asked where you want to install them. This puts them directly in /etc/grid-security.
- Enable the VDT services (crons in his release) to maintain the certificates and crls
- source /usr/local/vdt-certificates.1101/setup.sh
- vdt-control --on
Status v0.34.9
Since changes have been made to the checksum calculation and these instances have already been upgraded for the new md5 (md5v2) value, the upgrade to v0.34.9 will require the following. This is so a new checksum (md5v2) value is calculated.
The general procedure for this upgrade is:
- Stop the tomcat service
- service TOMCAT_INITD_SERVICE stop
- Tar the existing log files and save to a backup area. Then clean the log file directory.
- date=`date '+%Y%m%d'`
- cd TOMCAT_LOCATION/logs/
- tar zcf /data/gratia_tomcat_logs_backups/tomcat-TOMCAT_INITD_SERVICE.$date.tgz *
- rm -f *
- In MySql client ( as root ), we need to force the recalculation of the new checksum values:
- show index from JobUsageRecord_Meta;
- alter table JobUsageRecord_Meta drop index index17;
(index17 should be the md5v2 unique index. If not, drop the one that is.)
- update JobUsageRecord_Meta set md5v2 = null;
- delete from SystemProplist where car = 'gratia.database.disableChecksumUpgrade';
- Install the new software on the Gratia tomcat instance:
- pswd=xxx
- source=/home/gratia/gratia-releases/gratia-v0.34.10
- pgm=/home/gratia/gratia-releases/gratia-v0.34.10/common/configuration/update-gratia-local
- $pgm -d $pswd -S $source -s DATABASE
- Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the TOMCAT_LOCATION/gratia/service-configuration.properties file with the following attributes. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
- service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
- service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
- service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
- service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
- service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
- service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
- service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
- AND change these attributes as well to either gratia-fermi.fnal.gov (gratia08) or gratia.opensciencegrid.org (gratia09). Note : the port number should be correct, but double-check it.
- service.open.connection=http://localhost:8884
- service.secure.connection=https://localhost:8860
- Start the tomcat service
- service TOMCAT_INITD_SERVICE start
When the tomcat service initializes, it will detect any schema changes have been effected and a conversion process will begin.
- tail the log files in _TOMCAT_LOCATION/logs. When the conversion process completes the log will show the following message: - "INFO: Server startup in _xxx_ms"
- In your browser, access the gratia-administrative web service, select the System / Administration menu option in the left menu, Then scroll down to the Starting/Stopping Database Update Services section and select the Stop Update Services link
- Wait until the schema upgrade is complete , then, start the Gratia update services for the database schema just upgraded.
- In your browser, connect to the Gratia administrative services url for each of the databases.
- Select the System / Administration menu option in the left menu
- Then scroll down to the Starting/Stopping Database Update Services section and select the Start Update Services link.
- As the collector/tomcat host update service is started, monitor the tomcat logs files for any errors.
- Bring up the gratia administration web interface and verify that the collectors are processing the data on the status page.
- Bring up the gratia reporting web interface and verify that the reports look reasonable while still tail'ing the log files.
- Run the static reports cron as root and verify these reports are generated:
- 3 pdf files and 3 csv files
- If satisifed, you may continue to the next one.
gratia08 services
tomcat-qcd
Started 06/12 12:46 Complete:6/12 17:24 Elapsed: 04:38
Jun 12, 2008 5:19:13 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: main duplicate resolution phase complete in 1 iterations.
INFO: ChecksumUpgrader: resolved duplicates for a total of 65 duplicated checksums.
FINEST: fixDuplicatesOnce: starting duplicate resolution cycle
FINEST: fixDuplicatesOnce: this cycle detected 0 duplicated checksums
INFO: ChecksumUpgrader: make index on md5v2 unique (could take some time)
Jun 12, 2008 5:21:56 PM net.sf.gratia.util.Logging log
FINE: Executing: alter table JobUsageRecord_Meta drop index index12
Jun 12, 2008 5:24:01 PM net.sf.gratia.util.Logging log
FINE: Command: OK: alter table JobUsageRecord_Meta drop index index12
INFO: ChecksumUpgrader: reactivating listener threads
FINE: ListenerThread: ListenerThread: 0:Duplicate Check: true
tomcat-psacct
-rw-rw---- 1 gratia gratia 209715200 Jun 13 23:29 MasterSummaryData.ibd
-rw-rw---- 1 gratia gratia 155189248 Jun 14 10:48 NodeSummary.ibd
Jun 13, 2008 11:28:09 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-tables.sql ... OK
Jun 13, 2008 11:28:10 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-view.sql ... OK
2008-06-13 22:01:21,483 org.hibernate.tool.hbm2ddl.SchemaUpdate [INFO]: schema update complete
Restarted on 6/15
Jun 15, 2008 1:47:00 PM net.sf.gratia.util.Logging log
FINE: DatabaseMaintenance: calling post-install script for action "summary"
Jun 15, 2008 2:23:39 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-tables.sql ... OK
Jun 15, 2008 2:23:39 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-summary-view.sql ... OK
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-ps/gratia/build-ps-node-summary-table.sql ... OK
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: exitValue: 0
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: Executing: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion"
Jun 17, 2008 7:27:11 PM net.sf.gratia.util.Logging log
FINE: Command: Error: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion" : com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
java.io.EOFException
6/18 08:30 - The above error occurred because of the timeout on the connection. The
SystemProplist table has to be updated manually for this entry:
use gratia_psacct;
update SystemProplist set cdr=38 where car = "gratia.database.summaryTableVersion";
6/18 08:35 - restarted tomcat-psacct service
6/16/13:05 - Philippe questioning why the summary tables had to be rebuilt
Chris' reply:
For the first 0.34 upgrade: if the upgrade "failed" due to the connection timeout
during summary table production, the DB entry should have been updated.
If that didn't happen, that's where the problem is.
- summaryTableVersion is 33.
- Table reload required at v36
tomcat-fermi_osg
-rw-rw---- 1 gratia gratia 98304 Jun 14 12:48 SystemProplist.ibd
-rw-rw---- 1 gratia gratia 12582912 Jun 14 12:38 MasterSummaryData.ibd
(appears no more updates occurred against the summary table as all are glexec records
6/18 10:00 - checksum calculation is doing one of its resolutions. Appears to be 10 hours into this query. Do not want to do anything against gratia06 database until this completes. Apears to be all done in memory. Output of
show processlist :
| fermi_osg |
Execute |
36353 |
Sending data |
select md5v2, count(*) as `Count` from JobUsageRecord _Meta where md5v2 is not null group by md5v2 .. |
6/18 13:20
FINEST: fixDuplicatesOnce: this cycle detected 0 duplicated checksums
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: main duplicate resolution phase complete in 1 iterations.
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: resolved duplicates for a total of 34 duplicated checksums.
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: final pass on duplicates
Jun 18, 2008 1:16:51 PM net.sf.gratia.util.Logging debug
FINEST: fixDuplicatesOnce: starting duplicate resolution cycle
Jun 18, 2008 1:20:36 PM net.sf.gratia.util.Logging debug
FINEST: fixDuplicatesOnce: this cycle detected 0 duplicated checksums
Jun 18, 2008 1:20:36 PM net.sf.gratia.util.Logging info
INFO: ChecksumUpgrader: make index on md5v2 unique (could take some time)
gratia09 services
osg_transfer
Got this stacktrace shortly after startup:
*In catalina.out:* java.rmi.ServerException: RemoteException occurred in server thread; nested exception is: java.rmi.AccessException: Registry.Registry.rebind disallowed; origin /131.225.107.75 is non-local host *In gratia-0.log:* Jun 19, 2008 7:41:42 AM net.sf.gratia.util.Logging warning WARNING: java.rmi.server.ExportException: Port already in use: 21000; nested exception is: java.net.BindException: Address already in use
A look at
collector-pro.dat shows:
- gratia09 collectors:
gratia => { http_port => 8880, rmi_port => 17000, itb => { http_port => 8881, rmi_port => 18000, osg_daily => { http_port => 8884, rmi_port => 20000, osg_integration => { http_port => 8885, rmi_port => 21000, osg_transfer => { http_port => 8886, rmi_port => 21000,
- gratia08 collectors:
fermi_osg => { http_port => 8880, rmi_port => 17000, fermi_itb => { http_port => 8881, rmi_port => 18000, ps => { http_port => 8882, rmi_port => 19000, qcd => { http_port => 8883, rmi_port => 21000, fermi_transfer => { http_port => 8886, rmi_port => 22000, conversion => { http_port => 8884, rmi_port => 20000,
- The osg_transfer is the only one with conflicting ports. Changing manually to 22000 in /data/tomcat-osg_transfer and in CVS.
6/19 08:30
- Now got this error on startup in catalina.out:
java.rmi.ServerException: RemoteException occurred in server thread; nested exception is: java.rmi.AccessException: Registry.Registry.rebind disallowed; origin /131.225.107.75 is non-local host
- caused by this alias in the *service-configuration.properties file
service.rmi.rmibind=//gratia-transfer.opensciencegrid.org:22000 service.rmi.rmilookup=rmi://gratia-transfer.opensciencegrid.org:22000
- these were changed as well
monitor.subject=gratia-transfer.opensciencegrid.org:8886 service.open.connection=http://gratia-transfer.opensciencegrid.org:8886 service.secure.connection=https://gratia-transfer.opensciencegrid.org:8840
- Changed to gratia.opensciencegrid.org Changing manually in /data/tomcat-osg_transfer and in CVS.
Done at 6/19 09:00
osg_daily
No problems.
Jun 19, 2008 11:53:42 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: make index on md5v2 unique (could take some time) Jun 19, 2008 11:53:55 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: drop index on old md5 column FINE: Command: OK: alter table JobUsageRecord_Meta drop index index12 Jun 19, 2008 11:54:03 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: reactivating listener threads Jun 19, 2008 11:54:04 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: checksum upgrade operation complete
osg_integration
alter table JobUsageRecord_Meta drop index index17; elapsed: 00:10 update JobUsageRecord_Meta set md5v2 = null; Started 11:17 Ended: 11:19 Jun 19, 2008 11:23:50 AM net.sf.gratia.util.Logging log FINE: Executing: alter table JobUsageRecord_Meta add index index17(md5v2) Jun 19, 2008 11:27:19 AM net.sf.gratia.util.Logging info INFO: ChecksumUpgrader: updating all checksums in JobUsageRecord_Meta
gratia_itb
Post-mortem
At this time, this will appear to be random notes. After the conversion, they may be organized.
Main gratia DB - Upgrade to v0.34.9
The main Gratia database (
gratia ) served by the
gratia09:/data/tomcat-gratia instance will require a very long conversion time. So to maximize availability we will:
- create a current copy of the gratia06/gratia database on gratia07
- using a temporary Gratia collector on gratia08:/data/tomcat-conversion
- this collector will be upgraded to v0.34.9 allowing the new summary table and md5v2 checksums to be calculated.
- we will then replicate from gratia06 to gratia07 to "catch up" the database before we switch the production gratia09:/data/tomcat-gratia over to using it.
- switch the current gratia09:/data/tomcat-gratia to use the gratia07/gratia database
- perform the upgrade of the gratia06/gratia database to v0.34.9 using the Gratia collector on gratia08:/data/tomcat-conversion .
- While the checksum upgrades are occurring, we will replicate from gratia07 to gratia06 so the "catch up" time to make it current is reduced.
- when the upgrade is complete (inclusive of the conversion to the new md5 key), we will switch the gratia09:/data/tomcat-gratia instance back to using the gratia06/gratia database
During the upgrade period, this will be the tomcat/database environement for the
gratia database:
| Tomcat instance |
Database instance |
| gratia09:/data/tomcat-gratia port 8880 |
gratia07:/data/mysqldb/gratia port 3320 |
| gratia08:/data/tomcat-conversion port 8884 |
gratia06:/data/mysqldb/gratia port 3320 |
The above is a high level summary of the process. The details are contained in subsequent sections.
Create a copy of the current database on gratia07
On 5/28, the
gratia07:/data/mysqldb was re-partitioned into a single 200Gb area to provide the required space for the
gratia database.
Create the gratia07:gratia database using the ZRM backup
This is in process now (5/30) and a finalized procedure will be included when it completes.
Done - 6/3/08 08:00 CDT
Configure gratia08:/data/tomcat-conversion to use the gratia07:gratia database
On
gratia08, edit the
/data/tomcat-conversion/gratia/service-configuration.properties file for this resulting attribute:
- service.mysql.url=jdbc:mysql://gratia07.fnal.gov:3320/gratia
Done - 6/3/08 12:37 CDT (John Weigand) Re-confirmed - 6/11/08 10:30 CDT (John Weigand)
Upgrade gratia08:/data/tomcat-conversion to v0-34-9
On
gratia08 :
- Stop the tomcat service
- service tomcat-conversion stop
- Done - 6/11/08 14:48 CDT
On
gratia07 :
- Disable the checksum upgrade on gratia07/gratia database. In a MySql client:
- update SystemProplist set cdr=1 where car='gratia.database.disableChecksumUpgrade';
- Done - 6/11/08 14:51 CDT
Back on
gratia08 :
- Perform the upgrade to v0.34.9
- pswd=xxx
- source=/home/gratia/gratia-releases/gratia-v0.34.9
- pgm=/home/gratia/gratia-releases/gratia-v0.34.9/common/configuration/update-gratia-local
- $pgm -d $pswd -S $source -s conversion
- Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the TOMCAT_LOCATION/gratia/service-configuration.properties file with the following attributes. You will also need to change 2 connection attributes, as well. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
- service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
- service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
- service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
- service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
- service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
- service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
- service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
-
- service.open.connection=http://gratia-fermi.fnal.gov:8884
- service.secure.connection=https://gratia-fermi.fnal.gov:8860
- IMPORTANT IMPORTANT... THESE MUST BE CHANGED
- service.mysql.url=jdbc:mysql:// gratia07.fnal.gov:3320/gratia in the service-configuration.properties file
- in post-install.sh, change the mysql client execution from gratia-db01 to gratia07.
- Start the tomcat service
- service tomcat-conversion start
Done - 6/11/08 14:55 CDT First shot was against the production database. I did not realize the collector-pro.dat file set the mysql to gratia_db01:3320/gratia. Chris bailed me out and confirmed that no damage had occured. Chris changed the collector-pro.dat for the conversion instance to garbage data to avoid this on future releases forcing manually changing the file. I will add this to the upgrade instructions for the gratia09:tomcat-gratia . Re-Done - 6/11/08 15:21 CDT
First thing it does is add the GridDescription to the JobUsageRecord_Meta table.
In MySql client ( show processlist; ):
| Id |
User |
Host |
db |
Command |
Time |
State |
Info |
| 2786 |
gratia |
gratia08.fnal.gov:11262 |
gratia |
Query |
57891 |
copy to tmp table |
alter table JobUsageRecord _Meta add column GridDescription varchar(255) |
Started: 6/11 15:20 CDT
Status as of 6/12 07:27 CDT
- -rw-rw---- 1 gratia gratia 23G Jun 5 09:08 JobUsageRecord_Meta.ibd
- -rw-rw---- 1 gratia gratia 14K Jun 11 15:20 #sql-3275_ae2.frm
- -rw-rw---- 1 gratia gratia 16G Jun 12 07:27 #sql-3275_ae2.ibd
GridDescription add to JobUsageRecord _Meta table
- Started - 6/11 15:21 CDT Ended - 6/12 22:30 CDT Elapsed - 31:09 md5v2 and index add to JobUsageRecord _Meta
- Started - 6/12 23:26 CDT Ended - 6/14 10:07 CDT Elapsed -34:41 MasterSummary table creation (started tomcat-conversion service)
- Started - 6/14 10:45 CDT Ended - 6/12 CDT Elapsed
Jun 14, 2008 10:55:21 AM net.sf.gratia.util.Logging log
FINE: Executing: /data/tomcat-conversion/gratia/post-install.sh summary
ABOVE ONE WAS AGAINST THE gratia06 database as the post-install.sh script needed to be changed too
Jun 14, 2008 11:21:24 AM net.sf.gratia.util.Logging log
FINE: Executing: /data/tomcat-conversion/gratia/post-install.sh summary
<b>Jun 15, 2008 3:01:31 AM net.sf.gratia.util.Logging log
FINE: ERROR - ERROR 1206 (HY000) at line 45: The total number of locks exceeds the lock table size</b>
In the grati07.fnal.gov.err file....
080615 2:47:28 InnoDB: WARNING: over 67 percent of the buffer pool is occupied by
InnoDB: lock heaps or the adaptive hash index! Check that your
InnoDB: transactions do not set too many row locks.
InnoDB: Your buffer pool size is 512 MB. Maybe you should make
InnoDB: the buffer pool bigger?
InnoDB: Starting the InnoDB Monitor to print diagnostics, including
InnoDB: lock heap and hash index sizes.
Jun 15, 2008 13:30
Set increased my.cnf variable:
innodb_buffer_pool_size = 512M
to
innodb_buffer_pool_size = 768M
also stopped mysql_test instance... restarted mysql ... restarted tomcat-conversion
Jun 15, 2008 1:34:57 PM net.sf.gratia.util.Logging log
FINE: Executing: /data/tomcat-conversion/gratia/post-install.sh summary
Jun 16, 2008 9:27:50 AM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-conversion/gratia/build-summary-tables.sql ... OK
Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
FINE: OUTPUT>post-install.sh: loading /data/tomcat-conversion/gratia/build-summary-view.sql ... OK
Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
FINE: exitValue: 0
Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
FINE: Executing: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion"
Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging log
FINE: Command: Error: select count(*) from SystemProplist where car = "gratia.database.summaryTableVersion" : com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
** BEGIN NESTED EXCEPTION **
java.io.EOFException
Apparently connection timed-out around 6/16 09:27. The summary table and views were updated successfully. Just failed in the update of the SystemProplist table entries. Chris updated them manually. Then we recycled the tomcat-conversion service on gratia08.
Jun 16, 2008 9:27:51 AM net.sf.gratia.util.Logging warning
WARNING: Command: Error: insert into SystemProplist(car,cdr) values("gratia.database.summaryTableVersion", "38")
root 7218 1 20 10:05 ? 00:32:35 /data/jre/bin/java
WAIT until the summary tables have been created. Then you can proceed to the next step of "catching up" the
gratia07 database.
Bring the gratia07 database up to current
Since the
gratia07 database is based on a nightly backup and production has been collecting data from the CE nodes, it needs to be brought up to current by replicating from
gratia06 to
gratia07 using the administrative services
Replication menu.
- on http://gratia.opensciencegrid.org:8880/gratia-administration
- both Job Usage and Metric replication services should be activated to replicate to:
- http://gratia-fermi.fnal.gov:8884
- The starting dbid should be determined ( once only after the backup is restored) using the mysql client on gratia07
- select max(dbid) from JobUsageRecord_Meta;
- select max(dbid) from MetricRecord_Meta;
Partially done - 6/3/08 13:45 CDT (John Weigand)
select max(dbid) from JobUsageRecord_Meta; +-----------+ | max(dbid) | +-----------+ | 75242300 | +-----------+ select max(dbid) from MetricRecord_Meta; +-----------+ | max(dbid) | +-----------+ | 1506 | +-----------+ *On gratia06:* update Replication set dbid=75242300 where openconnection='http://gratia-fermi.fnal.gov:8884' and replicationid=52; update Replication set dbid=1506 where openconnection='http://gratia-fermi.fnal.gov:8884' and replicationid=54;
Starting point (approx 6/3/08 14:00 CDT)
| gratia06 max(dbid) |
75,851,594 |
| gratia07 max(dbid) |
75,242,300 |
| catch up |
609,294 |
Catch up complete around 6/3/08 23:00 CDT
| hour UTC |
updates |
| 080604 12 |
7715 |
| 080604 11 |
8224 |
| 080604 10 |
8435 |
| 080604 09 |
6962 |
| 080604 08 |
7869 |
| 080604 07 |
8005 |
| 080604 06 |
7753 |
| 080604 05 |
6874 |
| 080604 04 |
7944 |
| 080604 03 |
9629 |
| 080604 02 |
14588 |
| 080604 01 |
92283 |
| 080604 00 |
89794 |
| 080603 23 |
87849 |
| 080603 22 |
70699 |
| 080603 21 |
89476 |
| 080603 20 |
89201 |
| 080603 19 |
32873 |
2nd round of "catchup" statistics
- started at 6/16 10:10
- 2M records to process
- at 6/16 12:46 these are the counts.
| Period |
JobUsage |
Metric |
| 1 hours ago |
43173 |
0 |
| 2 hour ago |
35046 |
0 |
| 3 hours ago |
7590 |
25 |
Wait until the
gratia07 is caught up to the
gratia06 database. You can monitor this by:
- on gratia06
- select max(dbid) from JobUsageRecord_Meta;
- select max(dbid) from MetricRecord_Meta;
- on gratia07
- select max(dbid) from JobUsageRecord_Meta;
- select max(dbid) from MetricRecord_Meta;
When they are close, stop the
administrative update service on
The replication services should still be running to ensure we "catch up" the last of the updates on
gratia06 to
gratia07. The replication activity shall be complete when both replication data pumps (as shown by
gratia-0.log on gratia08) have been through at least one check-duplicate cycle without uploading records.
Then, turn off both the
Job Usage and
Metric replication services on:
Take note of max(dbid) in
JobUsageRecord_Meta and
MetricRecord_Meta on
gratia07, as these numbers will be used to set up replication from
gratia07 back to
gratia06 while checksums are being upgraded on
gratia06.
Update the Replication services on gratia06 to gratia07
On
gratia06, if there are
Job Usage and
Metric replication services active, we will need to create entries in the
gratia07 instance as this will shortly become our production database.
When creating these entries, we will need to insure that we start with the last record replicated from
gratia06, which may not have the same dbid on gratia06 as it did on gratia07.
Instructions for doing this will be provided later if this needs to be performed for this release.
This should be done
before we switch the
gratia09:/data/tomcat-gratia to use the
gratia07/gratia database.
The entries should be added and
only a test that the entry is correct made at this time. We will set the
starting dbid later and activate the entry after we switch the tomcat collectors.
Stop gratia09:/data/tomcat-gratia and gratia08:/data/tomcat-conversion
Stop each collector:
- on gratia08
- service tomcat-conversion stop
- on gratia09
- service tomcat-gratia stop
Upgrade the current gratia09:/data/tomcat-gratia to v0.34.9
REMEMBER:
- on gratia07 my.cnf, we need to eliminate the binary log files.... NO BACKUPS WILL NOT WORK W/O BINARY LOGS... AT LEAST SO FAR.
- on gratia06 my.cnf, we need to change the buffer size as we did on gratia07. Also need to eliminate the binary log files. This will mean stopping all collectors and bouncing the mysql. BE CAREFUL THAT BACKUPS HAVE COMPLETED.
- backups on gratia07
On gratia09:
- Perform the upgrade to v0.34.9
- pswd=xxx
- source=/home/gratia/gratia-releases/gratia-v0.34.9
- pgm=/home/gratia/gratia-releases/gratia-v0.34.9/common/configuration/update-gratia-local
- $pgm -d $pswd -S $source -s gratia
- Effective with this release, there is an administrative login process that, by default, does not allow access to the admin functions. You will need to update the /data/tomcat-gratia/gratia/service-configuration.properties file with the following attributes. You will also need to change 2 connection attributes, as well. (Note that if you forget to do this, the administrator can be added at anytime without restarting the tomcat service):
- service.admin.DN.0=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Philippe G. Canal/CN=UID:pcanal
- service.admin.DN.1=/DC=org/DC=doegrids/OU=People/CN=Christopher H. Green 851859
- service.admin.DN.2=/DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
- service.admin.DN.3=/DC=org/DC=doegrids/OU=People/CN=Steven Timm 74183
- service.admin.DN.4=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/CN=UID:timm
- service.admin.DN.5=/DC=org/DC=doegrids/OU=People/CN=Dan Yocum 346615
- service.admin.DN.6=/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Dan Yocum/CN=UID:yocum
-
- service.open.connection=http://gratia.opensciencegrid.org:8880
- service.secure.connection=https://gratia.opensciencegrid.org:8845
- Change the post_install.sh script to point to gratia07.fnal.gov
Started 6/24 08:17 Done 6/24 08:24
Switch gratia09:/data/tomcat-gratia to use the gratia07/gratia database
Update the
service-configuration.properties file to change the database that the production instance points to:
IMPORTANT IMPORTANT... THIS MUST BE CHANGED
- In gratia09:/data/tomcat-gratia/gratia/service-configuration.properties
- service.mysql.url=jdbc:mysql://gratia07.fnal.gov:3320/gratia Done 6/24 08:24
- In gratia09:/data/tomcat-gratia/gratia/post-install.sh, change the mysql client execution from gratia-db01 to gratia07.
- Start the gratia09 production collector only (it will be using the gratia07 database
- service tomcat-gratia start
At this point we have production gratia reporting and services back on-line and available. Verify there are no problems by tailing the log files in
gratia09:/data/tomcat-gratia/logs .
Done 6/24 08:30
Problems:
- catalina.out exceptions
command: select T.SiteName, P.currenttime, P.probename from Probe P, Site T where P.active = 1 and T.siteid = P.siteid order by T.SiteName,P.currenttime desc java.lang.ArrayIndexOutOfBoundsException
- gratia-rmi-servlet0.log - coming every 10 minutes. Chris thinks from a probe.
INFO: RMIHandlerServlet: Error: Problematic req: org.apache.catalina.connector.RequestFacade@564a38 Jun 24, 2008 2:29:25 PM net.sf.gratia.util.Logging info
- A v0.34.9a was installed (not put in CVS) as 11:35am to trouble-shoot the problem.
- A new v0.34.a installed (11:47 - 11:51) with problem fixed hopefully
- Static reports appear to be truncating... should be landscape to fit in or scaled differently.
If all is well, we can proceed to the v0.34.9 upgrade of the
gratia06:gratia database.
Backups
We will need to take backups on the
gratia07 database for the
gratia database.
And also
stop the backups on
gratia06 for the
gratia database.
Crons
This section can be delayed depending on the time of day.
Cron processes during upgrade
There are several processes (cron) that will need to be copied/modified on
gratia06 and
gratia09 for reports and interfaces.
- gratia06 - user gratia
- APEL-WLCG interface
dir=/home/gratia/interfaces/apel-lcg; cd $dir; ./lcg.sh --config=lcg.conf --date=current --update
This cron process can remain on gratia06. The only modification required is to the /home/gratia/interfaces/apel-lcg/lcg-db.conf file for the following parameter:
- GratiaHost? gratia07.fnal.gov
Done - 6/24 13:19
- OSG daily reports
/home/gratia/gratia-summary/gratiaSum.cron.sh
/home/gratia/gratia-summary/daily_mail_cron.sh
/home/gratia/gratia-summary/range_mail_cron.sh --weekly
/home/gratia/gratia-summary/range_mail_cron.sh --monthly
/home/gratia/gratia-summary/rungratia.sh --production
- CMS weekly reports
/home/gratia/gratia-summary/cms_mail_cron.sh --weekly
Change in /home/gratia/gratia-summary/PSACCTReport.py the
gMySQLConnectString = " -h gratia-db01.fnal.gov -u reader --port=3320 --password=reader "
Done - 6/24 13:20
- gratia06 - user root
Database backups on gratia06 will need to be changed to exclude the gratia database. This includes the purging of log files by the zrm process. /usr/local/bin/gratia_backup_cron.sh /usr/bin/mysql-zrm --action purge
Done 06/24 12:43 - commented the gratia database backup.
However, could not really stop the purging. So copied them to /backup/mysqldb/gratia/zrm.20080624.backup to preserve them.
Done 6/24 13:05
- gratia09 - user root
The static reports cron on gratia09 does not