Resource Pools¶
To run tasks such as experiments or notebooks, Determined needs to have resources (CPUs, GPUs) on which to run the tasks. However, different tasks have different resource requirements and, given the cost of GPU resources, it’s important to choose the right resources for specific goals so that you get the most value out of your money. For example, you may want to run your training on beefy V100 GPU machines, while you want your Tensorboards to run on cheap CPU machines with minimal resources.
Determined has the concept of a resource pool, which is a collection of identical resources that are located physically close to each other. Determined allows you to configure your cluster to have multiple resource pools and to assign tasks to a specific resource pool, so that you can use different sets of resources for different tasks. Each resource pool handles scheduling and instance provisioning independently.
When you configure a cluster, you set which pool is the default for auxiliary tasks and which pool
is the default for compute tasks. CPU-only tasks such as Tensorboards will run on the default
auxiliary pool unless you specify that they should run in a different pool when launching the task.
Tasks which require a slot, such as experiments or GPU-notebooks, will use the default compute pool
unless otherwise specified. For this reason we recommend that you always create a cluster with at
least two pools, one with low-cost CPU instances for auxiliary tasks and one with GPU instances for
compute tasks. This is the default setup when launching a cluster on AWS or GCP via det deploy.
Here are some scenarios where it can be valuable to use multiple resource pools:
Use GPU for training while using CPUs for TensorBoard.
You create one pool,
aws-v100, that provisionsp3dn.24xlargeinstances (large V100 EC2 instances) and another pool,aws-cputhat provisionsm5.largeinstances (small and cheap CPU instances). You train your experiments using theaws-v100pool, while you run your Tensorboards in theaws-cpupool. When your experiments complete, theaws-v100 poolcan scale down to zero to save money, but you can continue to run your TensorBoard. Without resource pools, you would have needed to keep ap3dn.24xlargeinstance running to keep the TensorBoard alive. By default Tensorboard will always run on the default CPU pool.Use GPUs in different availability zones on AWS.
You have one pool
aws-v100-us-east-1athat runsp3dn.24xlargein theus-east-1aavailability zone and another poolaws-v100-us-east-1bthat runsp3dn.24xlargeinstances in theus-east-1bavailability zone. You can launch an experiment intoaws-v100-us-east-1aand, if AWS does not have sufficientp3dn.24xlargecapacity in that availability zone, you can launch the experiment inaws-v100-us-east-1bto check if that availability zone has capacity. Note that currently the “AWS does not have capacity” notification is only visible in the master logs, not on the experiment itself.Use spot/preemptible instances and fall back to on-demand if needed.
You have one pool
aws-v100-spotthat you use to try to run training on spot instances and another poolaws-v100-on-demandthat you fall back to if AWS does not have enough spot capacity to run your job. Determined will not switch from spot to on-demand instances automatically, but by configuring resource pools appropriately, it should be easy for users to select the appropriate pool depending on the job they want to run and the current availability of spot instances in the AWS region they are using. For more information on using spot instances, refer to AWS Spot Instances.Use cheaper GPUs for prototyping on small datasets and expensive GPU for training on full datasets.
You have one pool with less expensive GPUs that you use for initial prototyping on small data sets and another pool that you use for training more mature models on large datasets.
Limitations¶
Currently resource pools are completely independent from each other so it is not possible to launch an experiment that tries to use one pool and then falls back to another one if a certain condition is met. You will need to manually decide to shift an experiment from one pool to another.
We do not currently allow a cluster to have resource pools in multiple AWS/GCP regions or across multiple cloud providers. If the master is running in one AWS/GCP region, all resource pools must also be in that AWS/GCP region.
If you create a task that needs slots and specify a pool that will never have slots (i.e. a pool with CPU-only instances), that task can never get scheduled. Currently that task will appear to be PENDING permanently.
We are constantly working to improve Determined and would love to hear your feedback either through GitHub issues or in our community Slack.
Setting Up Resource Pools¶
Resource pools are configured via the Master Configuration. For each resource pool, you can configure scheduler and provider information.
If you are using static resource pools and launching agents by hand, you will need to update the Agent Configuration to specify which resource pool the agent should join.
Migrating to Resource Pools¶
With the introduction of resource pools, the Master Configuration format has changed to a new format.
This is a backwards compatible change and cluster configurations in the old format will continue to work. A configuration in the old format is interpreted as a cluster with a single resource pool that is the default for both CPU and GPU tasks. However, to take full advantage of resource pools, you will need to convert to the new format, which is a simple process of moving around and renaming a small number of top-level fields.
The old format had the top level fields of scheduler and provisioner which set the scheduler
and provisioner settings for the cluster. The new format has the top level fields
resource_manager and resource_pools. The resource_manager section is for cluster level
setting such as which pools should be used by default and the default scheduler settings. The
scheduler information is identical to the scheduler field in the legacy format. The
resource_pools section is a list of resource pools each of which has a name, description and
resource pool level settings. Each resource pool can be configured with a provider field that
contains the same information as the provisioner field in the legacy format. Each resource pool
can also have a scheduler field that sets resource pool specific scheduler settings. If the
scheduler field is not set for a specific resource pool, the default settings are used.
Note that defining resource pool-specific scheduler settings is all-or-nothing. If the
pool-specific scheduler field is blank, all scheduler settings will be inherited from the
settings defined in resource_manager.scheduler. If any fields are set in the pool-specific
scheduler section, no settings will be inherited from resource_manager.scheduler - you need
to redefine everything.
Here is an example master configuration illustrating the potential problem.
resource_manager:
type: agent
scheduler:
type: round_robin
fitting_policy: best
default_aux_resource_pool: pool1
default_compute_resource_pool: pool1
resource_pools:
- pool_name: pool1
scheduler:
fitting_policy: worst
In this example, we are setting the cluster-wide scheduler defaults to use a best-fit, round robin
scheduler in resource_manager.scheduler. We are then overwriting the scheduler settings at the
pool level for pool1. Because we set scheduler.fitting_policy=worst, no settings are
inherited from resource_manager.scheduler so pool1 will end up using a worst-fit, fair share
scheduler (because when scheduler.type is left blank, the default value is fair_share).
If you want to have pool1 use a worst-fit, round robin scheduler, you need to make sure you
redefine the scheduler type at the pool-specific level:
resource_manager:
type: agent
scheduler:
type: round_robin
fitting_policy: best
default_aux_resource_pool: pool1
default_compute_resource_pool: pool1
resource_pools:
- pool_name: pool1
scheduler:
type: round_robin
fitting_policy: worst
Launching Tasks Into Resource Pools¶
When creating a task, the job configuration file has a section called “resources”. You can set the
resource_pool subfield to specify the resource_pool that a task should be launched into.
resources:
resource_pool: pool1
If this field is not set, the task will be launched into one of the two default pools defined in the
Master Configuration. Experiments will be launched into the default compute pool.
Tensorboards will be launched into the default auxiliary pool. Commands, Shells, and Notebooks that
request a slot (which is the default behavior if the resources.slots field is not set) will be
launched into the default compute pool. Commands, Shells, and Notebooks that explicitly request 0
slots (for example the “Launch CPU-only Notebook” button in the Web UI) will use the auxiliary pool.