From: <¨®¨¦ Microsoft Internet Explorer 5 ¡À¡ê¡ä?> Subject: An introduction to PORTABLE BATCH SYSTEM (PBS) : System configuration Date: Sat, 12 Nov 2005 10:12:33 +0800 MIME-Version: 1.0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Location: http://hpc.sissa.it/pbs/pbs-4.html X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506
Without any specification, the installation phase will produce a = working PBS=20 system with the following defaults:=20
Once that the system has been built and installed, the Server and = Moms must=20 be configured and the scheduling policy must be implemented. These items = are=20 closely coupled. Managing which and how many jobs are scheduled into = execution=20 can be done in several methods. Each method has an impact on the = implementation=20 of the scheduling policy and server attributes. An example is the = decision to=20 schedule jobs out of a single pool (queue) or divide jobs into one of = multiple=20 queues each of which is managed differently. If you want to run jobs on = more=20 than a single computer, you will need to install the execution daemon = (pbs_mom)=20 on each host where jobs are expected to execute. If you are running the = default=20 scheduler, fifo, you will need to fill a nodes file = (PBS_HOME/server_priv/nodes)=20 with one entry for each execution host specifying, if appropriate, the = number of=20 processors per host. For example:
node1 np=3D4 node2 np=3D4 node3 np=3D2 node4 np=3D2=20If you write your own Scheduler, it can be told in ways other than = the=20 Server's nodes file on which hosts jobs could be run.=20
qmgr [-a] [-c command] [-e] [-n] [-z] [server...]=20
The qmgr command provides an administrator interface to the batch =
system. The=20
command reads directives from standard input. The syntax of each =
directive is=20
checked and the appropriate request is sent to the batch server or =
servers. The=20
list or print subcommands of qmgr can be executed by general users. =
Creating or=20
deleting a queue requires PBS Manager privilege. Setting or unsetting =
server or=20
queue attributes requires PBS Operator or Manager privilege. The server =
operands=20
identify the name of the batch server to which the administrator =
requests are=20
sent. Each server conforms to the following syntax:=20
host_name[:port] where host_name is the network =
name of=20
the host on which the server is running and port is the port number to =
which to=20
connect. If server is not specified, the administrator requests =
are=20
sent to the local server.=20
A qmgr directive is one of the following forms:
command = server [names] [attr OP value[,attr OP value,...]] command queue [names] [attr OP value[,attr OP value,...]] command node [names] [attr OP value[,attr OP value,...]]
Where, command is the command to perform on a object. Commands = are:=20
sets the active objects. If the active objects are specified, and = the name=20 is not given in a qmgr cmd the active object names will be used.
is to create a new object, applies to queues and nodes.
is to destroy an existing object, applies to queues and nodes.
is to define or alter attribute values of the object.
is to clear the value of attributes of the object. Note, this form = does not=20 accept an OP and value, only the attribute name.
is to list the current attributes and associated values of the = object.
is to print all the queue and server attributes in a format that = will be=20 usable as input to the qmgr command.
names is a list of one or more names of specific objects The =
name list=20
is in the form: [name][@server][,queue_name[@server]...] =
with no=20
intervening white space. The name of an object is declared when the =
object is=20
first created. If the name is @server, then all the objects of specified =
type at=20
the server will be effected.=20
attr specifies the name of an attribute of the object which is = to be=20 set or modified. If the attribute is one which consist of a set of = resources,=20 then the attribute is specified in the form: = attribute_name.resource_name=20
OP operation to be performed with the attribute and its value: =
set the value of the attribute. If the attribute has a existing = value, the=20 current value is replaced with the new value.
increase the current value of the attribute by the amount in the = new value.=20
decrease the current value of the attribute by the amount in the = new value.=20
value the value to assign to an attribute. If the value = includes white=20 space, commas or other special characters, such as the # character, the = value=20 string must be inclosed in quote marks (").=20
All three of the daemon processes, Server, Scheduler and Mom, must = run with=20 the real and effective uid of root. Typically, the daemons are started = from the=20 systems boot files, e.g. /etc/rc.local. However, it is recommended that = the=20 Server be brought up "by hand" the first time and configured before = being run at=20 boot time.=20
Mom should be started at boot time. Typically there are no required = options.=20 It works best if Mom is started before the Server on every node so they = will be=20 ready to respond to the Server's "are you there?" ping. Start Mom with = the line:=20
{sbindir}/pbs_mom [options]=20
in the /etc/rc2 or equivalent boot file. If the Server or Scheduler = are=20 running on a different host, the host name(s) must be specified in Mom's = configuration file; see the pbs_mom = configuration=20 section.=20
The initial run of the Server or any first time run after recreating = the home=20 directory must be with the -t create option:=20
{sbindir}/pbs_server -t create=20
This option directs the Server to discard any existing configuration = files,=20 queues and jobs, and initialize configuration files to the default = values. This=20 is best done by hand. At this point it is necessary to configure the = Server. See=20 the pbs_server = configuration section.=20
After the Server is configured it may be placed into service. = Normally it is=20 started in the system boot file via a line such as:=20
{sbindir}/pbs_server [options]=20
The -t start_type option may be specified where start_type is one of = the=20 options (hot|warm|cold) specified in the pbs_server man page. The = default is=20 warm.=20
The Scheduler should also be started at boot time. Start it with an = entry in=20 the /etc/rc2 or equivalent file:=20
{sbindir}/pbs_sched [options]=20
There are no required options for the default fifo scheduler.=20
The function of pbs_mom is to place jobs into execution as directed =
by the=20
server, establish resource usage limits, monitor the job's usage, and =
notify the=20
server when the job completes. If they exist, pbs_mom will execute a =
prologue=20
script before executing a job and an epilogue script after executing the =
job.=20
The next function of pbs_mom is to provide information about the status =
of=20
running jobs, memory available etc. as response of a resource =
monitor=20
request typically submitted by the PBS scheduler. Pbs_mom will =
record a=20
diagnostic message in a log file for any error occurrence. The log files =
are=20
maintained in the mom_logs directory below the home directory of the =
server=20
(default /usr/spool/PBS/mom_logs). If the log file cannot =
be=20
opened, the diagnostic message is written to the system console.=20
Mom must know the name of the server that manages it: it must be = declared in=20 the file PBS_HOME/server_name. The Mom's configuration is achieved via a = configuration file which is reads at initialization time and when Mom = receive a=20 SIGHUP signal. This file is described in the pbs_mom(8) man page as well = as in=20 the following section. If the -c option is not specified when Mom is = run, she=20 will open PBS_HOME/mom_priv/config if it exists. If it does not, Mom = will=20 continue anyway. The configuration file must be "secure": it must be = owned by a=20 user id and group id less than 10 and not be world writtable.=20
The file provides several types of run time information to pbs_mom: = static=20 resource names and values, external resources provided by a program to = be run on=20 request via a shell escape, and values to pass to internal set up = functions at=20 initialization (and re-initialization).=20
Each item type is on a single line with the component parts separated = by=20 white space. If the line starts with a hash mark (pound sign, #), the = line is=20 considered to be a comment and is skipped. An example of configuration = file is:
$logevent 0x0ff #enables logging of all events =
except debug events
$clienthost fe.widget.com #mom will accept privileged connections from =
this host
#typically host where server and scheduler =
run =20
$restricted *.widget.com #mom will accept connections from this host
#typically hosts on which a monitoring tool
#(as xpbsmon) can be run
$ideal_load 2.0 #When the load average on the node drops =
below this value
#Mom inform the server that the node is no =
longer busy
$max_load 3.5 #When the load average on the node exceeds =
this value
#Mom inform the server that the node is busy
$cputmult 1.3 #factor used to adjust cpu time usage by to =
job to allow
#comparison with different cpu performance =
nodes
$wallmult 1.3 #factor used to adjust wall time usage of =
the job to allow
#comparison with different cpu performance =
nodes
$usecp bevyboss.widget.com:/u/home /r/home #Inform mom to use cp =
instead of rcp or scp
#to transfer file from/to that destination =
because it's NFS mounted
tape8mm 2 #inform the mom about the value of a static =
resource
#(e.g. number of resources)
The directories and files involved are:=20
$PBS_SERVER_HOME/mom_priv the default directory for configuration = files,=20 typical (/usr/spool/PBS)/mom_priv.=20
$PBS_SERVER_HOME/mom_logs directory for log files recorded by the = server.=20
$PBS_SERVER_HOME/mom_priv/config the default configuration file=20
$PBS_SERVER_HOME/mom_priv/prologue the administrative script to be = run before=20 job execution.=20
$PBS_SERVER_HOME/mom_priv/epilogue the administrative script to be = run after=20 job execution.=20
Server management consist of configuring the Server attributes and=20 establishing queues and their attributes. Unlike Mom and the Job = Scheduler, the=20 Job Server (pbs_server) is configured while it is running, except for = the nodes=20 file. Configuring server and queue attributes and creating queues is = done with=20 the qmgr = command. This=20 must be either as root or as a user who has been granted PBS Manager = privilege.=20 Exactly what needs to be set depends on your scheduling policy and how = you chose=20 to implement it. The system needs at least one queue established and = certain=20 server attributes initialized.=20
The following are the "minimum required" server attributes and the=20 recommended attributes; see the pbs_server_attributes man page = for a=20 complete list of server attributes. They are set via the set server (s = s)=20 subcommand to the qmgr=20 command.=20
default_queue Declares the default queue to which jobs are = submitted=20 if a queue is not specified on the qsub command. The = queue must=20 be created first. Example: Qmgr: c q dque queue_type=3Dexecution Qmgr: s = s=20 default_queue=3Ddque=20
acl_hosts A list of hosts from which jobs may be submitted. = Example:=20 Qmgr: s s acl_hosts=3D*.foo.bar.com,boss.hq.bar.com=20
acl_host_enable Enables the Server's host access control list, = see=20 above. Qmgr: s s acl_host_enable=3Dtrue=20
default_node Defines the node on which jobs are run if not = otherwise=20 directed.Example: Qmgr: s s default_node=3Dbig=20
managers Defines which users, at a specified host, are granted = batch=20 system administrator privilege. For example,Qmgr: s s=20 managers=3Dme@*.foo.bar.com,sam@big.foo.bar.com=20
node_pack Defines the order in which multiple cpu cluster = nodes are=20 allocated to jobs.=20
resources_defaults This attribute establishes the resource = limits=20 assigned to jobs that were submitted without a limit and for which there = are no=20 queue limits. See the pbs_resources_* man page for your system type (* = is irix6,=20 linux, solaris5, ...). Example Qmgr: s s resources_defaults.cput=3D5:00 = Qmgr: s s=20 resources_defaults.mem=3D4mb=20
resources_max This attribute sets the maximum amount of = resources=20 which can be used by a job entering any queue on the Server. This limit = is=20 checked only if there is not a queue specific resources_max attribute = defined=20 for the specific resource.=20
There are two types of queues defined by PBS, routing and execution. = A=20 routing queue is a queue used to move jobs to other queues which may = even exist=20 on different PBS Servers. Routing queues are similar to the old NQS pipe = queues.=20 A job must reside in an execution queue to be eligible to run. The job = remains=20 in the execution queue during the time it is running.=20
A Server may have multiple queues of either or both types. A Server = must have=20 at least one queue defined. Typically it will be an execution queue; = jobs cannot=20 be executed while residing in an routing queue.=20
Queue attributes fall into three groups: those which are applicable = to both=20 types of queues, those applicable only to execution queues, and those = applicable=20 only to routing queues. If an "execution queue only" attribute is set = for a=20 routing queue, or vice versa, it is simply ignored by the system. = However, as=20 this situation might indicate the administrator made a mistake, the = Server will=20 issue a warning message about the conflict. The same message will be = issued if=20 the queue type is changed and there are attributes that do not apply to = the new=20 type.=20
Not all of the Queue Attributes are discussed here, only what is = needed to=20 get a reasonable system up and running. See the = pbs_queue_attributes man=20 page for a complete list of queue attributes.=20
queue_type Must be set to either execution or routing (e or r = will=20 do). The queue type must be set before the queue can be enabled. = Example: Qmgr:=20 s q dque queue_type=3Dexecution=20
enabled If set to true, jobs may be enqueued into the queue. = If false,=20 jobs will not be accepted.=20
started If set to true, jobs in the queue will be processed, = either=20 routed by the Server=20
route_destinations (Only for routing queues) List the local = queues or=20 queues at other Servers to which jobs in this routing queue may be sent. = For=20 example: Qmgr: s q routem = route_destinations=3Ddque,overthere@another.foo.bar.com=20
resources_max If you chose to have more than one execution = queue based=20 on the size or type of job, you may wish to establish maximum and = minimum values=20 for various resource limits. This will restrict which jobs may enter the = queue=20 and will override the same resource resources_max defined at the Server = level.=20 If there is no maximum value declared for a resource type, there is no=20 restriction on that resource. For example: s q dque = resources_max.cput=3D2:00:00=20 places a restriction that no job requesting more than 2 hours of cpu = time will=20 be allowed in the queue. There is no restriction on the memory, mem, = limit a job=20 may request.=20
resources_min Defines the minimum value of resource limit = specified by=20 a job before the job will be accepted into the queue. If not set, there = is no=20 minimum restriction.=20
resources_default Defines a set of default values for jobs = entering=20 the queue that did not specify certain resource limits. There is a = corresponding=20 server attribute which sets a default for all jobs.=20
The limit for a specific resource usage is established by checking = various=20 job, queue, and server attributes. The following list shows the = attributes and=20 their order of precedence:
1. The job attribute Resource_List, i.e. = what was requested by the user. 2. The queue attribute resources_default. 3. The Server attribute resources_default. 4. The queue attribute resources_max. 5. The Server attribute resources_max.Please note, an unset resource limit for a job is treated as an = infinite=20 limit.=20
Should you wish to record the configuration of a Server for re-use, = you may=20 use the print subcommand of qmgr. For example,=20
qmgr -c "print server" > /tmp/server.con=20
will record in the file server.con the qmgr subcommands required to = recreate=20 the current configuration including the queues. The commands could be = feed back=20 into qmgr via standard input:=20
qmgr < /tmp/server.con=20
It isn't necessary to do this at every pbs_server startup because = (unless=20 -t create is specified) it maintains current configuration in a = private=20 database (server_priv/serverdb)=20
PBS provides a separate process to schedule which jobs should be =
placed into=20
execution. This is a flexible mechanism by which you may implement a =
very wide=20
variety of policies. In fact it is possible to implement a replacement =
Scheduler=20
using the provided APIs which will enforce the desired policies. The=20
configuration required for a Scheduler depends on the Scheduler itself. =
The=20
delivered FIFO Scheduler provides the ability to sort the jobs in =
several=20
different ways, in addition to FIFO order. There is also the ability to =
sort on=20
user and group priority. Mainly this Scheduler is intended to be a =
jumping off=20
point for a real Scheduler to be written. A good amount of code has been =
written=20
to make it easier to change and add to this Scheduler. As distributed, =
the fifo=20
Scheduler is configured with the following options, see file=20
PBS_HOME/sched_priv/sched_config:=20
Change directory into PBS_HOME/sched_priv and edit the scheduling =
policy=20
config file sched_config, or use the default values. This =
file=20
controls the scheduling policy (which jobs are run when).The format of =
the=20
sched_config file is:=20
name: value [prime | non_prime | all]=20
name and value may not contain any white space value can be: true | = false |=20 number | string any line starting with a '#' is a comment. A blank third = word is=20 equivalent to "all" which is both prime and non-prime. The associated = values as=20 shipped as defaults are shown in braces {}. Here is some of scheduler = attributes=20 you can set:=20
round_robin {false all} boolean: If true - run jobs one from = each=20 queue in a circular fashion; if false - run as many jobs as possible up = to=20 queue/server limits from one queue before processing the next queue. The = following server and queue attributes, if set, will control if a job = "can be"=20 run: resources_max, max_running, max_user_run, and max_group_run. See = the man=20 pages pbs_server_attributes and pbs_queue_attributes.=20
by_queue {true all} boolean: If true - the jobs will be run = from their=20 queues; if false - the entire job pool in the Server is looked at as one = large=20 queue.=20
strict_fifo {false all} boolean: If true - will run jobs in a = strict=20 FIFO order. This means if a job fails to run for any reason, no more = jobs will=20 run from that queue/server that scheduling cycle. If strict_fifo is not = set,=20 large jobs can be starved, i.e., not allowed to run because a never = ending=20 series of small jobs use the available resources. Also see the server = attribute=20 resources_max and the fifo parameter help_starving_jobs below.=20
fair_share {false all} boolean: This will turn on the fair = share=20 algorithm. It will also turn on usage collecting and jobs will be = selected using=20 a function of their usage and priority(shares).=20
load_balancing {false all} boolean: If this is set the = Scheduler will=20 load balance the jobs between a list of time-shared hosts (:ts) obtained = from=20 the Server (pbs_server). The Server reads the list from its nodes file.=20
help_starving_jobs boolean: This bit will have the Scheduler = turn on=20 its rudimentary starving jobs support. Once jobs have waited for the = amount of=20 time give by starve_max, they are considered starving, i.e. no jobs will = run=20 until the starving job can be run. Starve_max needs to be set also.=20
starve_max The amount of time before a job is considered = starving.=20 This config variable is not used if help_starving_jobs is not set.=20
sort_by {shortest_job_first} string: have the jobs sorted. = sort_by can=20 be set to a single sort type or multi_sort. If set to multi_sort, = multiple key=20 fields are used. Each key field will be a key for the multi sort. The = order of=20 the key fields decides which sort type is used first. Possible sort = keys:=20 no_sort, shortest_job_first, longest_job_first, smallest_memory_first,=20 largest_memory_first, high_priority_first, low_priority_first, = multi_sort,=20 fair_share, large_walltime_first, short_walltime_first.=20
log_filter {256} What event types not to log. The value should = be the=20 addition of the event classes which should be filtered (i.e. ORing them=20 together). The numbers are defined in src/include/log.h. NOTE: those = numbers are=20 in hex and log_filter is in base 10.=20
dedicated_prefix {ded} The queues with this prefix will be = considered=20 dedicated queues.=20
#Set the boolean values which define how the scheduling policy = finds #the next job to consider to run. round_robin: False ALL by_queue: True prime by_queue: false non-prime strict_fifo: true ALL fair_share: True prime fair_share: false non-prime # help jobs which have been waiting too long help_starving_jobs: true prime help_starving_jobs: false non-prime # Set a multi_sort # This example will sort jobs first by ascending cpu time requested, and = then # by ascending memory requested, and then finally by descending job = priority # sort_by: multi_sort key: shortest_job_first key: smallest_memory_first key: high_priority_first # Set the debug level to only show high level messages. # Currently this only shows jobs being run debug_level: high_mess # a job is considered starving if it has waited for this long max_starve: 24:00:00 # If the Scheduler comes by a user which is not currently in the = resource group # tree, they get added to the "unknown" group. The "unknown" group is = in roots # resource group. This says how many shares it gets. unknown_shares: 10 # The usage information needs to be written to disk in case the = Scheduler # goes down for any reason. This is the amount of time between when the # usage information in memory is written to disk. The example syncs the # information ever hour. sync_time: 1:00:00 # What events do you not want to log. The event numbers are defined in # src/include/log.h. NOTE: the numbers are in hex, and log_filter is in # base 10.