Mon ALISA

Why would you use the optional MonALISA service?

MonALISA gives options for monitoring running jobs on a CE. Although it has been made optional and is known to be heavy on resource use, some VOs still like to have it running on the resources they support. At the moment there is no other batch-system-agnostic monitoring system that can graphically show the number of jobs per vo both running and queued on a CE.

The MonALISA (Monitoring Agents in A Large Integrated Services Architecture) system provides a distributed service for monitoring, control and global optimization of complex systems. It is based on an ensemble of autonomous multi-threaded, agent-based subsystems which are registered as dynamic services and are able to collaborate and cooperate in performing a wide range of monitoring and decision tasks in large scale distributed applications, and to be discovered and used by other services or clients that require such information.

Known Problems

Installing the MonALISA software package

MonALISA is downloaded in the CE node distribution but not automatically activated.

Configuring MonALISA

During the OSG CE configuration ($VDT_LOCATION/monitoring/configure-osg.py), you will be asked if you wish to activate/run MonALISA and population of any additional information that MonALISA might need will be requested in that script. The VDT configure_monalisa script will be run internally at that time. So, you should never run the configure_monalisa script independently.

The MonALISA configuration files

From the OSG perspective, all changes to the 'real' MonALISA configuration files should be done using the configure-osg.py script. This section is provided only as general information about these files.

Additional information about these variables and other MonALISA configuration variables can be viewed at the MonALISA web site under the "Documentation" menu item.

The variables/parameters that are updated by the configure-osg.py script are noted in each section and should not be updated manually.

ml_env

The file for the global configuration is $VDT_LOCATION/MonaLisa/Service/CMD/ml_env. Some key configuration parameters in this file are:

These variables are updated by configure-osg.py

MONALISA_USER The name of the user that is running the service. The daemon account is the default and should be satisfactory for most installations. MonALISA should never be run under the root account. The /etc/init.d/MLD script will su to MONALISA_USER before starting.
SHOULD_UPDATE This parameter tells MonALISA to check for updates when it is started. (OSG recommends this option be set to false.) When MonALISA is started:
%u20AC If "true", it will first check for updates, install them and then start MonALISA.
%u20AC If "false" it will not check for updates before starting.
MonaLisa_HOME Path to your MonALISA installation directory.
FARM_NAME The name for your farm (as entered in the configuration above). (e.g FARM_NAME="OSG_INSTALL_TEST"). FARM_NAME in MonALISA and OSG OSG Resource Name are synonomous.

These variables are not updated by configure-osg.py

FARM_CONF_FILE The file used at the startup of the services to define the clusters, nodes and the monitor modules to be used for collecting metrics. It should be in the ${FARM_HOME} directory. (e.g FARM_CONF_FILE="${MonaLisa_HOME}/Services/VDTFarm/vdtFarm.conf").
JAVA_HOME The path to your current JDK.
JAVA_OPTS An optional parameter to pass parameters directly to the Java Virtual Machine (e.g JAVA_OPTS="-Xmx=128m").

site_env

The file for the site (local) configuration is $VDT_LOCATION/MonaLisa/Service/CMD/site_env. This file is used to configure the location for the local batch queuing system used by VO_Modules. An example of this file is shown below for a CE node using the Condor job manager:

ALERT! IMPORTANT
All changes to the variables in this file should be done using the configure-osg.py script.

  VDT_LOCATION=/storage/local/data1/osg
  export VDT_LOCATION
  . $VDT_LOCATION/setup.sh
  FBS_LOCATION=
  export FBS_LOCATION
  LSF_LOCATION=
  export LSF_LOCATION
  CONDOR_LOCATION=/storage/local/data1/osg/condor
  export CONDOR_LOCATION
  GLOBUS_LOCATION=/storage/local/data1/osg/globus
  export GLOBUS_LOCATION
  SGE_LOCATION=
  export SGE_LOCATION
  CONDOR_CONFIG=/storage/local/data1/osg/condor/etc/condor_config
  export CONDOR_CONFIG
  SGE_ROOT=
  export SGE_ROOT
  PBS_LOCATION=
  export PBS_LOCATION
  CONDOR_LOCAL_DIR=/storage/local/data1/osg/condor/local.cmssrv09
  export CONDOR_LOCAL_DIR

HELP NOTE
If you are using a batch system other than Condor, configure-osg.py should set the correct variables in this file correctly.

vdtFarm.conf

This file, $VDT_LOCATION/MonaLisa/Service/VDTFarm/vdtFarm.conf, specifies the modules which are used to collect the monitoring metrics and the frequency at which they are executed.

HELP NOTE
The modules that are turned on/off (uncommented/commented) by the configure-osg.py script are noted. Do not change these manually except when trouble-shooting as they will be reset on the next execution of the configure-osg.py.

Here is a simple snapshot.

  ###########################################################
  ###    Modules used to monitor the master node
  #
  
  *Master
  >localhost
  monProcLoad%30
  monProcStat%30
  monProcIO%30
  
  ###########################################################
  ### new VO Modules for Jobs and IO 
  # 
  # The module will use the local job manager
  # defined in CMD/site_env. It works with
  # Condor, PBS, LSF, FBS, SGE
  
  *osgVO_JOBS{monOsgVoJobs, localhost, }%180          (Set by configure-osg.py)
  
  #
  #   
  #  If you have Condor installed at your site and you
  #  would like to monitor all Condor servers
  #  please comment the line above and uncomment the following one
  #
  #*osgVO_JOBS{monOsgVoJobs, localhost, condoruseglobal }%180
  #
  # If you would like to export the monitoring information only from
  # a specified set of Condor servers, and not from all of them
  # use the following format
  # The following line will query only the servers IP1, IP2 and IP3
  #
  #*osgVO_JOBS{monOsgVoJobs, localhost, server=IP1, server=IP2, server=IP3}%180
  #
  #
  # The new VO IO module is using grid FTP logs  
  
  *osgVO_IO{monOsgVO_IO, localhost, }%180              (Set by configure-osg.py)
  
  # The old VO modules are not necessary
  #*VO_JOBS{monVoJobs,localhost, }%300
  #*VO_IO{monVOgsiftpIO, localhost, }%300
  #
  #
  ###########################################################
  ###   Configuration to monitor the worker nodes
  #
  # If your site uses Ganglia to monitor the nodes
  # left this line uncommented.
  # 
  # Replace localhost with the node where "gmond" runs.
  # If gmond runs on a different port than the standard port
  # 8649, please replace 8649 in the following line
  #
  *PN_vdt{monIGangliaTCP, localhost, 8649}%30                     (Set by configure-osg.py)
  # 
  # If you want to monitor the worker nodes based on Condor_status 
  # using only the local manager please uncomment the following line
  # 
  #*PN_Condor{monPN_Condor, localhost, }%120                   
  #
  
  # If you want to monitor the worker nodes based on Condor_status 
  # using a list of condor managers please uncomment the following line
  # and provide the names or IPs for the servers.
  #
  #*PN_Condor{monPN_Condor, localhost, server=IP1, server=IP2, server=IP3 }%120
  #
  # 
  # If you want to use PBS to get monitoring information for the worker node
  # please uncomment this line
  #
  #*PN_PBS{monPN_PBS, localhost, }%120
  # 
  # PBS module accepts multiple servers and the syntax is as for the condor module
  #
  ###########################################################
  ###         The ABping module
  #   It is used to perform simple network measurements and it is configured dynamically
  #   
  *ABPing{monABPing, localhost, " "}
  #
  # 
  ###########################################################
  ###      ApMon module  
  #        Used to collect monitoring information from the CMS/ARDA jobs 
  ^monXDRUDP{ParamTimeout=3600,NodeTimeout=3600,ClusterTimeout=3600,ListenPort=58884,AppendIPToNodeName=true}%20
  #
  #
  #
  ###########################################################
  ###     Tracepath module 
  #       Used to perform monitoring for network topology
  #       It is configured dynamically 
  #*Tracepath{monTracepath, localhost, " "}
 

Operational Notes

Starting and Stopping MonALISA

MonALISA should always be started and stopped using the /etc/rc.d services only. For an OSG installation, this is done using the vdt-control script as documented in the Starting Services topic.

   source $VDT_LOCATION/setup.sh
   vdt-control [--on | --off] MLD

By default, MonALISA is run under the daemon account. This can be changed (actually manually overridden) but it is not advisable to do so as most of files and directories require read and write permissions by the MonALISA user and ownership has been changed to the daemon account.

Identifying the MonALISA processes

MonALISA is a java application and can be identified as follows:

 > ps -efwww |grep -i $VDT_LOCATION/MonaLisa
  .... output....
daemon   11062  1.2  1.8 276328 75108 ?      Sl   Mar22   7:33 /storage/local/data1/osg/jdk1.4/bin/java 
      -DMonaLisa_HOME=/storage/local/data1/osg/MonaLisa -Djava.util.logging.config.class=lia.Monitor.monitor.LoggerConfigClass 
      -Djava.security.policy=/storage/local/data1/osg/MonaLisa/policy/policy.all 
      -Dlia.Monitor.SKeyStore=/storage/local/data1/osg/MonaLisa/Service/SSecurity/FarmMonitor.ks -Dlia.Monitor.update=false 
      -Dlia.Monitor.Farm.HOME=/storage/local/data1/osg/MonaLisa/Service/VDTFarm 
      -Dlia.monitor.updateURLs=http://monalisa.cacr.caltech.edu/FARM_ML,http://monalisa.cern.ch/MONALISA/FARM_ML -
      -Dlia.Monitor.ConfigURL=file:/storage/local/data1/osg/MonaLisa/Service/VDTFarm/ml.properties -Xms64m -Xmx64m -jar 
      /storage/local/data1/osg/MonaLisa/Service/lib/JFarmMonitor.jar ITB_INSTALL_TEST
     /storage/local/data1/osg/MonaLisa/Service/VDTFarm/vdtFarm.conf

HELP NOTE
Depending on your kernel configuration, you may see many threads (the highest count thus far is 98) or you may see just one as shown above. Do not be concerned.

MonALISA uses a Postgres database for local persistent storage of the metrics collected and runs as an independent process (not a child or thread) of MonALISA. If you are running other Postgres applications on your platform, it may be difficult to distinguish which instance is for what application. This following is a display from a machine on which the MonALISA Postgres is the only one running:

> ps -efwww |grep -i postgres
.... output ...
daemon   11497 11494  0 Mar22 ?        00:00:00 postgres: writer process                         
daemon   11498 11494  0 Mar22 ?        00:00:00 postgres: stats buffer process                   
daemon   11499 11498  0 Mar22 ?        00:00:00 postgres: stats collector process                
daemon   11536 11494  0 Mar22 ?        00:01:00 postgres: mon_user mon_data 127.0.0.1(58721) idle

ALERT! IMPORTANT
You should never have to explicitly kill/stop the MonALISA Postgres processes. This is handled in the /etc/init.d/MLD script (using the vdt-control script).

A little subtlety that you might see when stopping MonALISA, and may raise some concern on your part, is an "ML_SER stopedb" that shows up:

> vdt-control --off MLD
Trying to stop MonaLisa ....STOPPED
Trying to stop epgsqldb...OK

> ps -ef  | grep -i monalisa 
daemon   21034     1  0 10:33 ?        00:00:00 /bin/sh -c sleep 45; /storage/local/data1/osg/MonaLisa/Service/CMD/ML_SER stopedb

Do not be concerned even when you start MonALISA immediately. This process is spawned whenever MonALISA is stopped and will eventually stop the Postgres database server if it has not already been stopped. The latency may be there as a safeguard for the main MonALISA repositories processes that pull from the local databases.

Some things to look for in your MonALISA log file.

After MonALISA has started, here are a couple items to look for related to the VDT_LOCATION/monitoring/osg-user-vo-map.txt file. The MonALISA log file is $VDT_LOCATION/MonaLisa/Service/VDTFarm/ML0.log.

  • INFO: osgVO_IO-monOsgVO_IO:  [ computeVoAcctsDiff ] Dead VOs [  No VOs removed ]
    If a VO was identified as 'dead', it means you user-vo-map file contained a VO that no users will be mapped to.
  • osgVO_IO-monOsgVO_IO[validateMappings]: No VO account for unix account (des)
    osgVO_IO-monOsgVO_IO[validateMappings]: ignoredUNIXAccts = des

    Says the 'des' UNIX account will not map to a VO.

VO Modules

For a description of the metrics collected by the VO Modules in MonALISA, please review the following documentation. The documentation is general in nature and the only part you should be concerned with is the description of the metrics collected.

Impact Of MonALISA On Host Services

CPU consumption on the host (CE) server is collected using the 'ps' command. If you are showing 0 (zero) CPU on that node, there may be a problem with the command.

The MonALISA monitoring agents should have a minimal impact on the CPU load of the host (CE) servers. The job and gridftp metrics collected by the VO_Modules processes have generally been the biggest consumers of CPU on the host server.

Potential things to look at are:

  • The level of CPU used when your job manager status command is executed as a part of the VO Modules..
  • The size of your globus-gatekeeper.log and gridftp.log files. Both of these log files are parsed to collect the job and gridftp metrics. If logrotate (or your own script) is performed daily on these logs, CPU utilization will be minimal. These files should be rotated at midnight and no more that once per day.
  • Some sites require all E-mail to be sent through a fixed outbound E-mail gateway. MonALISA is not using a normal sendmail process to send its E-mail and thus any such configuration will be ignored and the mail will not be delivered. It is possible to disable MonALISA sending E-mail by adding the following lines to $VDT_LOCATION/MonaLisa/Service/VDTFarm:
    • lia.util.mail.DEVEL_STATION=true

Verifying that your MonALISA client is reporting.

To verify your CE node is reporting to MonALISA, in a web browser, go to the MonaALISA Home page:
  • select Repositories in the menu on the left
  • select the Repository name that corresponds to the OSG Group you specified when you ran the configure-osg.py script to configure your CE node.

You should then see a world map showing all the CE nodes reporting to MonALISA for that group. To view the status of your individual site, select Site status from the menu on the left.

Topic revision: r28 - 01 Jul 2010 - 20:07:10 - SuchandraThapa
Hello, TWikiGuest
Register

Introduction

Installation and Update Tools

Clients

Compute Element

Storage Element

Other Site Services

VO Management

Software and Caches

Central OSG Services

Additional Information

Community
linkedin-favicon_v3.icoLinkedIn
FaceBook_32x32.png Facebook
campfire-logo.jpgChat
 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..