Gratia Release v1.08.0 (July 15, 2011)

Overview

The main purpose of this release (v1.08.0) is to enhance the database schema and to add a record of the whole Gratia ecosystem's health.

Main Features:

  • Introduced support for monitoring and display any potential backlog of any of the Gratia components (See probe details and collector details)
  • Increased the range of the dbid for the raw data (from 32 bits to 64 bits integer), virtually removing any size limitation (beside the limitation of the database engine or hardware).
  • Completed revamping of the gathering of information about the status and performance of the Collector (see details below)

Notices

  • First restart will be slow due to the significant database schema upgrade required.
  • The duplicate detection on SE, CE, and Subcluster was replaced at the request of HCC and OSG (due to the fact we really don't want duplicate detection here but "transition detection").

Report Improvements

  • Static reports upgrade to no longer produce non-truncated PDF.
  • Daily pre-generated reports are now PDF files.
  • Upgraded to BIRT 2.6.1

Probe Improvements

  • Add new functionality to allow delegation of the lifetime of transient input files to Gratia so that there are deleted only in case the record is properly saved and backed-up. This feature is now used in the Condor probe.
  • Add upload of information about the amount of data still needing to be processed or uploaded by a Probe to the Collector.
  • Add interface (GratiaCore?::RegisterEstimatedServiceBacklog) to to let the probe tell the system how much work is left to do (Individual probes still need to be updated to take advantage of this feature).
  • Update GratiaCore? to upload both the amount
  • Allow delegation of lifetime of input file to the Gratia Core library for easier deletion only in case of full success ; this fixes a problem with file accumulation with the condor probe.
  • Checking on number of inodes and disk space used by the urCollector par of the PBS/LSF Probe (to prevent overflow)
  • Extra status reporting
  • Support for HTPC jobs
  • Support for CREAM (by supporting blahp.log to provide certificate information).
  • Repaired SGE probe.
  • Merged the GlideInWMS? probe into the Condor probe.
  • Added support in the PBS probe for ‘array jobs’.
  • Added support for Torque stores old accounting logs in .gz files.
  • Removed dependency on SQLAlchemy and setup tools.
  • Add support for ancient python version (2.3.4)
  • Several fixed for dcache probe.

Collector Improvements

  • General improvement in the response time both of the administration page and of the enabling/disabling of the Collector features.
  • Housekeeping is now automatically paused if the number of record in one of the queue reach the value of 'max.housekeeping.nrecords' in the configuration file. The housekeeping service is restarted if the number of record in each queues goes back under 'min.housekeeping.nrecords'.
  • Increased the range of the dbid for the raw data (from 32 bits to 64 bits integer)
    • The collector schema will be automatically updated the first time the collector is restarted, this will involve (essentially) a complete rewrite of all the data table and thus can be quire time consuming.
  • Improved XML error handling when parsing the incoming data (i.e. eliminate potential data loss in case of non standard record being uploaded)
  • New administration page listing the performance of Collector and containing:
    • current uploading rate
    • current incoming rate
    • recovery estimate
    • recovery estimate including estimated backlog on the service side
    • housekeeping status and recovery rate
  • New administration pages detailing the current backlog content:
    • Backlog of any probe or collector sending information to the local Collector.
    • Historical information on those backlog
    • Details on which probes still has data files in the local queue
  • Improve information available in Status page, add collector-status.html page to get the state of the sub-system. See documentation page for details.
  • Greatly improved response time and stability of the Status page.

Technical information

  • The file name used to stage the incoming data in the Collector's thread directory now encodes the name of the origin and the number of records (i.e. job#randomnumber#.#origin#.#numberOfRecords#.xml where the origin is the sender collector or probe with all : and / replaced with underscores.)
  • The way the Collector gathers and displays the statistics and performance information has been completely revamped.
    • The Status and MonitorStatus administration pages have been updated to take advantage of the new tables and are now fast even under heavy load.
    • 4 tables now participates: TableStatistics, TableStatisticsSnapshots, TableStatisticsHourly, TableStatisticsDaily
      • TableStatistics contains for each record type (JobUsage?, Metric, etc.) the number of records that are currently in the database and the number of records that have been processed (even if they have been deleted since). For each, the table not only the number of good record but also the number of duplicates and the number of ‘errorneous’ records (‘Qualifier’). The main new feature in this table is the addition of the ‘lifetime’ number in addition to the current numbers.
      • TableStatisticsSnapshots contains a copy of the content of TableStatistics? taken every five minutes (customizable) and kept for a day (default)
      • TableStatisticsHourly contains a summary of the snapshots taken during a given hour. For each entry (current/lifetime,RecordType,Qualifier), the table records the exact start and end time it covers, the minimum, maximum and average number of records during the period. The default is to keep those hourly summary for one year.
      • TableStatisticsDaily contains a summary of the snapshots taken during a given day. For each entry (current/lifetime,RecordType,Qualifier), the table records the exact start and end time it covers, the minimum, maximum and average number of records during the period. The default is to keep those daily summary indefinitely. * A new class, TableStatisticsManager, is introduced to manage the lifetime of the rows in the new statistics tables and to do/schedule the copying and summarizing. 2 new MYSQLl functions are also introduced for this purpose: table_statictics_hourly_summary and table_statictics_daily_summary.
  • The Collector now and displays the current and historical information about its backlog and the backlog of its providers.
    • The backlog details administration pages have been created to take advantage of the new tables
    • 4 tables now participates: BacklogStatistics, BacklogStatisticsSnapshots, BacklogStatisticsHourly, BacklogStatisticsDaily
      • BacklogStatistics contains for each providers the number of records and xml files (and tar files) currently in their Gratia as well as an estimate of the amount of information about the service (for example number of batch jobs) the Probe still need to process. In addition, the table records the bundle size used by the provider to upload its records and the maximum size allowed for its queue.
      • BacklogStatisticsSnapshots contains a copy of the content of BacklogStatistics? taken every fifteen minutes (customizable) and kept for a day (default).
      • BacklogStatisticsHourly contains a summary of the snapshots taken during a given hour. For each entry (EntityType?/Name), the table records the exact start and end time it covers, the minimum, maximum and average number of records during the period. The default is to keep those hourly summary for one year.
      • BacklogStatisticsDaily contains a summary of the snapshots taken during a given day. For each entry (EntityType?/Name)), the table records the exact start and end time it covers, the minimum, maximum and average number of records during the period. The default is to keep those daily summary indefinitely.
  • A new class, BacklogtatisticsManager, is introduced to manage the lifetime of the rows in the new statistics tables and to do/schedule the copying and summarizing. 2 new MYSQLl functions are also introduced for this purpose: backlog_statictics_hourly_summary and backlog_statictics_daily_summary.
  • These 2 new set of features are customizable via the following configuration entry:
#
# Table Statistics history 
#
# frequency of the snapshots, 0 disable the history recording.
# In number of minutes between snapshots
#
tableStatistics.snapshots.wait = 5
#
# How long to keep the individual snap shots of the table statistics
#
service.lifetime.TableStatisticsSnapshots = 1 day
#
# How long to keep the hourly summary of the table statistics
#
service.lifetime.TableStatisticsHourly = 1 year
#
# How long to keep the daily summary of the table statistics
#
service.lifetime.TableStatisticsDaily = UNLIMITED
#
# input queue control - whether or not to monitor queue sizes.
#
#
# Backlog Statistics history 
#
# frequency of the snapshots, 0 disable the history recording.
# In number of minutes between snapshots
#
backlogStatistics.snapshots.wait = 15
#
# How long to keep the individual snap shots of the backlog statistics
#
service.lifetime.BacklogStatisticsSnapshots = 1 day
#
# How long to keep the hourly summary of the backlog statistics
#
service.lifetime.BacklogStatisticsHourly = 1 year
#
# How long to keep the daily summary of the backlog statistics
#
service.lifetime.BacklogStatisticsDaily = UNLIMITED
#
# input queue control - whether or not to monitor queue sizes.
#

Anticipated downtime

It is expected that this release will require the Gratia services and reporting to be unavailable beginning at:
  • Start: July 15, 2011 hh:mm CST
  • Available: July 15, 2011 hh:mm CST

The changes affecting downtime for this release are:

  1. Length of time to make a backup of the database to the backup area using mysqlhotcopy
  2. Installation and validation on the 6 Gratia schemas

Collectors and Databases Affected

The following Gratia collectors and databases will be converted with this release:

Schema MySql port Collector URL Collector host Size (bytes) Size (rows)
fermi_itb gratia06.fnal.gov:3320 gratia-fermi.fnal.gov:8881 gratia08.fnal.gov: 868K 1
fermi_osg gratia06.fnal.gov:3320 gratia-fermi.fnal.gov:8880 gratia08.fnal.gov: 5.0G 2,554,484
gratia gratia06.fnal.gov:3320 gratia.opensciencegrid.org:8880 gratia09.fnal.gov 60.0G 30,226,750
gratia_itb gratia06.fnal.gov:3320 gratia.opensciencegrid.org:8881 gratia09.fnal.gov 3.5 9.2G 2,796,812
gratia_osg_integration gratia06.fnal.gov:3320 gratia.opensciencegrid.org:8885 gratia09.fnal.gov 3.5G 905,703
gratia_qcd gratia06.fnal.gov:3320 gratia-fermi.fnal.gov:8883 gratia08.fnal.gov: 2.2G 817,080


The following Gratia collectors and databases will NOT be converted with this release. These repositories contained specialized reports that have not as yet been upgraded to the new Birt V2.2 software. However, they will be taken out-of-service while the other databases are being updated.

Schema MySql port Collector URL Collector host Size (bytes) Size (rows)
gratia_psacct gratia06.fnal.gov:3320 gratia08.fnal.gov:8882 gratia08.fnal.gov: 6.6G 4,144,690
gratia_osg_daily gratia06.fnal.gov:3320 gratia.opensciencegrid.org:8884 gratia09.fnal.gov 65.0M 56,300

Build the v1.08.0 for distribution

  1. Make sure your build area contains all committed changes.
    • svn status
  2. In gratia/build-scripts/Makefile , change the version_default to:
    • version_default = v1.09
    • commit the change
  3. Tag the release (for all committed changes):
  4. Into a new area, export the tagged release:
  5. Build it for the release (this insures that tar files are produced for VDT):
    • cd gratia-v1.08.0/build-scripts
    • source setup-jdk15.sh
    • make release
  6. Copy the built tar files to the release area:
    • scp ../target/*_v1.08.0.tar flxi07.fnal.gov:/afs/fnal.gov/files/expwww/gratia/html/Files/
  7. Update the version number on the services release TWiki page:
    • Edit and update the TWiki variable ReleaseVersion.

Shutdown and database backup.

On the tomcat/collector nodes

  1. comment out the root user cron entry for the static reports:

    On gratia08 :

    42 0 * * * '/data/tomcat-fermi_itb/gratia/staticReports.py' '/data/tomcat-fermi_itb' 'http://gratia-fermi.fnal.gov:8881/gratia-reporting/' 
    42 0 * * * '/data/tomcat-qcd/gratia/staticReports.py' '/data/tomcat-qcd' 'http://gratia-fermi.fnal.gov:8883/gratia-reporting/' >/dev/null 
    42 0 * * * '/data/tomcat-fermi_osg/gratia/staticReports.py' '/data/tomcat-fermi_osg' 'http://gratia-fermi.fnal.gov:8880/gratia-reporting/' 
    

    On gratia09 :

    42 0 * * * '/data/tomcat-osg_integration/gratia/staticReports.py' '/data/tomcat-osg_integration' 'http://gratia.opensciencegrid.org:8885/gratia-reporting/' 
    42 0 * * * '/data/tomcat-itb/gratia/staticReports.py' '/data/tomcat-itb' 'http://gratia.opensciencegrid.org:8881/gratia-reporting/' 
    42 0 * * * '/data/tomcat-gratia/gratia/staticReports.py' '/data/tomcat-gratia' 'http://gratia.opensciencegrid.org:8880/gratia-reporting/' 
    
  2. Disable init.d services as root user:

    On gratia08:

    • chkconfig tomcat-ps off
    • chkconfig tomcat-qcd off
    • chkconfig tomcat-fermi_osg off
    • chkconfig tomcat-fermi_itb off

    On gratia09 :

    • chkconfig tomcat-gratia off
    • chkconfig tomcat-osg_daily off
    • chkconfig tomcat-osg_integration off
    • chkconfig tomcat-itb off

  3. For ALL of the Gratia collectors ( including those not being upgraded ) , stop the Gratia update services:
    • In your browser, connect to the Gratia administrative services url for each of the databases.
    • Select the System / Administration menu option in the left menu
    • Then scroll down to the Starting/Stopping Database Update Services section and select the Stop Update Services link.

    Note: Effective with Gratia v0.31, this step is not necessary. For any pre-v0.31 installation, this step is still necessary.

  4. Stop the tomcat init.d service for__ALL__ Gratia collectors.

    On gratia08:

    • service tomcat-fermi_itb stop
    • service tomcat-fermi_osg stop
    • service tomcat-ps stop
    • service tomcat-qcd stop

    On gratia09 :

    • service tomcat-gratia stop
    • service tomcat-itb stop
    • service tomcat-osg_daily stop
    • service tomcat-osg_integration stop

  5. This is optional, but probably a good idea to save off the logs under each tomcat instance and empty the log directory. It will facilitate catching any errors that may occur (of course there won't be any, but a good idea anyway). This can be performed while the database backups are being performed.

    On gratia09 :

    • date=`date '+%Y%m%d'`
    • cd /data/tomcat-osg_integration/logs
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-osg_integration.$date.tgz *
    • rm -f *
    • cd /data/tomcat-itb/logs/
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-itb.$date.tgz *
    • rm -f *
    • cd /data/tomcat-gratia/logs/
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-gratia.$date.tgz *
    • rm -f *

    On gratia08 :

    • date=`date '+%Y%m%d'`
    • cd /data/tomcat-fermi_itb/logs/
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-fermi_itb.$date.tgz *
    • rm -f *
    • cd /data/tomcat-qcd/logs/
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-qcd.$date.tgz *
    • rm -f *
    • cd /data/tomcat-fermi_osg/logs/
    • tar zcf /data/gratia_tomcat_logs_backups/tomcat-fermi_osg.$date.tgz *
    • rm -f *

On the MySql server node ( gratia06 )

  1. comment out the cron entries for:

    gratia user cron jobs.

    0 0 1-15 * *   dir=/home/gratia/interfaces/apel-lcg; cd $dir; ./lcg.sh --config=lcg.conf --date=previous --update
    30 01 * * *   dir=/home/gratia/interfaces/apel-lcg; cd $dir; ./lcg.sh --config=lcg.conf --date=current --update
    

    root user cron entry.

    43 2 * * * /usr/local/bin/mysqlhotcopy_cron.sh > /var/log/mysqlhotcopy.out 2> &1
    
  2. On the MySql server node ( gratia06 ):

    Take a back up the database instances (this will include all schema) using part of the msqlhotcopy_cron script from the command line as root user.
    Backup will only being done to /backup/mysqldb on gratia06 in order to reduce the downtime.

    • mysqlhotcopy -p lisp01 --addtodest fermi_itb /backup/mysqldb
    • mysqlhotcopy -p lisp01 --addtodest gratia_osg_daily /backup/mysqldb
    • mysqlhotcopy -p lisp01 --addtodest gratia_qcd /backup/mysqldb
    • mysqlhotcopy -p lisp01 --addtodest gratia_osg_integration /backup/mysqldb
    • mysqlhotcopy -p lisp01 --addtodest fermi_osg /backup/mysqldb
    • mysqlhotcopy -p lisp01 --addtodest gratia_psacct /backup/mysqldb
    • mysqlhotcopy -p lisp01 --addtodest gratia_itb /backup/mysqldb
    • mysqlhotcopy -p lisp01 --addtodest gratia /backup/mysqldb

    Backup times:

    Schema Expected Duration Actual Duration
    fermi_itb 1 sec  
    gratia_osg_daily 6 secs  
    gratia_qcd 64 secs  
    gratia_osg_integration 99 secs  
    fermi_osg 132 secs  
    gratia_psacct 179 secs  
    gratia_itb 461 secs (8 min)  
    gratia 2900 secs (48+ min)  

Upgrade and implementation

The upgrades should be single-threaded , that is, performed for each database schema one at a time.

We will perform these upgrades based on the size of the individual database schema, in ascending order.

On the tomcat/collector nodes

  1. Install the new software on a Gratia tomcat instance:
    pswd=xxx
    source=/home/weigand/cdcvs/gratia-v1.08.0
    pgm=/home/weigand/cdcvs/gratia-v1.08.0/common/configuration/update-gratia-local
    On gratia09:
    • $pgm -d $pswd -S $source osg_integration
    • $pgm -d $pswd -S $source itb
    • $pgm -d $pswd -S $source gratia

    On gratia08:
    • $pgm -d $pswd -S $source fermi_itb
    • $pgm -d $pswd -S $source qcd
    • $pgm -d $pswd -S $source fermi_osg

  2. Start the gratia tomcat services:
    • service <tomcat service> start
    • When the tomcat service initializes, it will detect any schema changes have been effected and a conversion process will begin.
    • tail the catalina.out log. When the conversion process completes the log will show the following message:
      "INFO: Server startup in _xxx_ms"

  3. Then, start the Gratia update services for the database schema just upgraded.
    • In your browser, connect to the Gratia administrative services url for each of the databases.
    • Select the System / Administration menu option in the left menu
    • Then scroll down to the Starting/Stopping Database Update Services section and select the Start Update Services link.
    • As each collector/tomcat host update service is started, monitor the tomcat logs files for any errors.
    • Bring up the gratia administration web interface and verify that the collectors are processing the data.
    • Bring up the gratia reporting web interface and verify that the reports look reasonable while still tail'ing the log files.

  4. If all looks good, stop the Gratia update services for the database schema just upgraded.

  5. Run the static reports cron as root and verify these reports are generated:
    • ./gratia-reports/reports-static/UsageByVOByDate-ranked.pdf
    • ./gratia-reports/reports-static/UsageBySiteByDate-ranked.pdf
    • ./gratia-reports/reports-static/WeeklyUsageByVO-ranked.pdf

    On gratia08 :

    42 0 * * * '/data/tomcat-fermi_itb/gratia/staticReports.py' '/data/tomcat-fermi_itb' 'http://gratia-fermi.fnal.gov:8881/gratia-reporting/' 
    42 0 * * * '/data/tomcat-qcd/gratia/staticReports.py' '/data/tomcat-qcd' 'http://gratia-fermi.fnal.gov:8883/gratia-reporting/' 
    42 0 * * * '/data/tomcat-fermi_osg/gratia/staticReports.py' '/data/tomcat-fermi_osg' 'http://gratia-fermi.fnal.gov:8880/gratia-reporting/' 
    

    On gratia09 :

    42 0 * * * '/data/tomcat-osg_integration/gratia/staticReports.py' '/data/tomcat-osg_integration' 'http://gratia.opensciencegrid.org:8885/gratia-reporting/' 
    42 0 * * * '/data/tomcat-itb/gratia/staticReports.py' '/data/tomcat-itb' 'http://gratia.opensciencegrid.org:8881/gratia-reporting/' 
    42 0 * * * '/data/tomcat-gratia/gratia/staticReports.py' '/data/tomcat-gratia' 'http://gratia.opensciencegrid.org:8880/gratia-reporting/' 
    
  6. If satisifed, stop the tomcat service for that tomcat database schema so you can proceed to the next one.
    • service <tomcat service> stop

Activating the new release

On the tomcat/collector nodes

  1. start the tomcat service for the Gratia collectors.

    On gratia08:

    • service tomcat-fermi_itb start
    • service tomcat-fermi_osg start
    • service tomcat-ps start
    • service tomcat-qcd start

    On gratia09:

    • service tomcat-gratia start
    • service tomcat-itb start
    • service tomcat-osg_daily start
    • service tomcat-osg_integration start
  2. As each collector/tomcat host service is started, monitor the tomcat logs files for any errors.

  3. Bring up the gratia administrative web interface and verify that the collectors are processing the data.
    If they are not processing any data, verify the Gratia update services are active.
    • In your browser, connect to the Gratia administrative services url for each of the databases.
    • Select the System / Administration menu option in the left menu
    • Then scroll down to the Starting/Stopping Database Update Services section and and view the status.
  4. Enable init.d services for all tomcats as root user:

    On gratia08:

    • chkconfig tomcat-ps on
    • chkconfig tomcat-qcd on
    • chkconfig tomcat-fermi_osg on
    • chkconfig tomcat-fermi_itb on

    On gratia09:

    • chkconfig tomcat-gratia on
    • chkconfig tomcat-osg_daily on
    • chkconfig tomcat-osg_integration on
    • chkconfig tomcat-itb on
  5. Uncomment/verify the root user cron entry for the static reports:

    On gratia08 :

    42 0 * * * '/data/tomcat-fermi_itb/gratia/staticReports.py' '/data/tomcat-fermi_itb' 'http://gratia-fermi.fnal.gov:8881/gratia-reporting/' 
    42 0 * * * '/data/tomcat-qcd/gratia/staticReports.py' '/data/tomcat-qcd' 'http://gratia-fermi.fnal.gov:8883/gratia-reporting/' 
    42 0 * * * '/data/tomcat-fermi_osg/gratia/staticReports.py' '/data/tomcat-fermi_osg' 'http://gratia-fermi.fnal.gov:8880/gratia-reporting/' 
    

    On gratia09 :

    42 0 * * * '/data/tomcat-osg_integration/gratia/staticReports.py' '/data/tomcat-osg_integration' 'http://gratia.opensciencegrid.org:8885/gratia-reporting/' 
    42 0 * * * '/data/tomcat-itb/gratia/staticReports.py' '/data/tomcat-itb' 'http://gratia.opensciencegrid.org:8881/gratia-reporting/' 
    42 0 * * * '/data/tomcat-gratia/gratia/staticReports.py' '/data/tomcat-gratia' 'http://gratia.opensciencegrid.org:8880/gratia-reporting/' 
    

On the MySql server node ( gratia06 )

  1. Uncomment the root user cron job(s) for gratia backups
    43 2 * * * /usr/local/bin/mysqlhotcopy_cron.sh > /var/log/mysqlhotcopy.out 2>&1
    
  2. Uncomment the gratia user cron job(s).
    0 0 1-15 * *   dir=/home/gratia/interfaces/apel-lcg; cd $dir; ./lcg.sh --config=lcg.conf --date=previous --update
    30 01 * * *   dir=/home/gratia/interfaces/apel-lcg; cd $dir; ./lcg.sh --config=lcg.conf --date=current --update
    

Post-mortem

At this time, this will appear to be random notes. After the conversion, they may be organized.

Major updates

Topic revision: r13 - 13 Jul 2011 - 21:24:09 - PhilippeCanal
Hello, TWikiGuest
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..