ITG Unix Support
>    
     |  List directory  |  History  |  Similar  |  Print version  

HPC > IonMan Cluster > Restarting services on the cluster

Restarting services on the cluster

Introduction

Sometimes unexpected problems (e.g. running out of disk space) may bring down essential system services on a cluster without being severe enough to justify the disruption caused by rebooting. Once the original cause of failure has been corrected, these services may not come back on their own, so they must be restarted manually. This page gives general instructions on selecting services and restarting them.

Selecting services to be restarted

Selecting services to be restarted can be done using a little detective work, and some knowledge of what kinds of services the system should be offering and what seems not to be working. Many Linux systems, including SUSE, have the chkconfig command, which can be used to see what services should be running at a given run level.

First find the default run level for the system.

grep id: /etc/inittab

This should return a line such as

id:3:initdefault:

which indicates a default run level of 3. Now use this number to find the services that should be running at this level:

chkconfig --list | grep 3:on

This should return several lines, e.g.:

alsasound                 0:off  1:off  2:on   3:on   4:off  5:on   6:off
autofs                    0:off  1:off  2:off  3:on   4:off  5:on   6:off
cron                      0:off  1:off  2:on   3:on   4:off  5:on   6:off
fbset                     0:off  1:on   2:on   3:on   4:off  5:on   6:off
heartbeat                 0:off  1:off  2:off  3:on   4:off  5:on   6:off
hwscan                    0:off  1:off  2:on   3:on   4:off  5:on   6:off
irq_balancer              0:off  1:on   2:on   3:on   4:off  5:on   6:off
jobhunter-DEV             0:off  1:off  2:off  3:on   4:off  5:on   6:off
jobhunter-PRD             0:off  1:off  2:off  3:on   4:off  5:on   6:off
jobhunter-TST             0:off  1:off  2:off  3:on   4:off  5:on   6:off
kbd                       0:off  1:on   2:on   3:on   4:off  5:on   6:off
mysql                     0:off  1:off  2:off  3:on   4:off  5:off  6:off
network                   0:off  1:off  2:on   3:on   4:off  5:on   6:off
nfs                       0:off  1:off  2:off  3:on   4:off  5:on   6:off
nfsboot                   0:off  1:off  2:off  3:on   4:off  5:on   6:off
nfsserver                 0:off  1:off  2:off  3:on   4:off  5:on   6:off
osirisd                   0:off  1:off  2:off  3:on   4:on   5:on   6:off
portmap                   0:off  1:off  2:off  3:on   4:off  5:on   6:off
postfix                   0:off  1:off  2:off  3:on   4:off  5:on   6:off
pvfs2-client              0:off  1:off  2:off  3:on   4:off  5:on   6:off
pvfs2-server              0:off  1:off  2:off  3:on   4:off  5:on   6:off
random                    0:off  1:off  2:on   3:on   4:off  5:on   6:off
resmgr                    0:off  1:off  2:on   3:on   4:off  5:on   6:off
running-kernel            0:off  1:off  2:on   3:on   4:off  5:on   6:off
rwhod                     0:off  1:off  2:off  3:on   4:off  5:on   6:off
smb                       0:off  1:off  2:off  3:on   4:off  5:on   6:off
sshd                      0:off  1:off  2:off  3:on   4:off  5:on   6:off
syslog                    0:off  1:off  2:on   3:on   4:off  5:on   6:off
xinetd                    0:off  1:off  2:off  3:on   4:off  5:on   6:off
xntpd                     0:off  1:off  2:on   3:on   4:off  5:on   6:off
zabbix_agentd             0:off  1:off  2:off  3:on   4:off  5:off  6:off

Most of these services run in the background as daemons, such as jobhunter-DEV, smb, sshd, etc. A few, such as kbd, run for a short time when the run level is entered (usually at boot), and then exit. It should not be necessary to restart the latter. Each of these services has a corresponding shell script of the same name in the /etc/init.d directory. You may examine these initialization scripts in order to familiarize yourself with what services they start, and determine whether you need to restart them. It is generally safe, however, to restart any of them, but can be time-consuming to restart services that don't need it.

Examining services to be restarted

Now that you have a list of services to be restarted, it is almost time to restart them. You may wish to examine their states first, however. The syntax of these services' init scripts is generally the following:

/etc/init.d/service-name command

If you want to try to see the status of a given service use the "status" command. An example of this for the jobhunter-DEV service is this:

/etc/init.d/jobhunter-DEV status

which may return the following line:

Checking for service JobHunter [DEV]                                  stopped

The "stopped" indicates that this service is not running, which indicates it probably needs to be restarted. If it is marked as "running" it indicates that the service is running, BUT THAT DOES NOT NECESSARILY CONFIRM THAT IT IS RUNNING PROPERLY. For example, on a past problem with a filled disk partition this service showed up in the "running" state, but still needed to be restarted once disk space was freed before it would work again.

Also note that not all services support the "status" command. Most do, but it is per-service whether they do; check the init script in /etc/init.d. You can also check the status of a service by looking for its process in the list of running processes:

ps -Af

Many, but not all, services spawn processes with the same name as the service. Check the init script for clues if you cannot find the proper process name (or it might just indicate it's not running).

Restarting services

Again the service init script is used to restart the service, jobhunter-TST in this example:

/etc/init.d/jobhunter-TST restart

This might return the following output if successful:

Shutting down JobHunter [TST]                                         done
Starting JobHunter [TST]: startproc: Empty pid file /var/run/jobhunter-TST.pid for /data/TST/bin/jobhunter-TST
JobHunter 1.4-843 logging to /data/TST/jobs/jobhunter.log
Running on ionman-n1 [TST]

One final caution: some init scripts do not support the "restart" command. If this is the case a restart can be done simply by a combination of "stop" and "start" commands:

/etc/init.d/jobhunter-TST stop
/etc/init.d/jobhunter-TST start

Final checks

Remember that since these clusters operate cooperatively, a problem on one node can cause problems on other nodes. For example, while one node may provide jobhunter, pvfs may be on another. Check the other nodes as well.

See the end of one of the following documents for more information on verifying essential cluster services.

 

Reference http://wiki.chem.indiana.edu/HPC/RestartingServicesOnTheCluster
Rights rw-rw-r--   sacreps   ITG

Prev. Powering the IonMan cluster down and back up   IonMan To Do List Next