Experiment Configuration¶
The behavior of an experiment can be configured via a YAML file. A configuration file is typically passed as a command-line argument when an experiment is created with the Determined CLI. For example:
det experiment create config-file.yaml model-directory
Metadata¶
Optional Fields
nameA short human-readable name for the experiment.
descriptionA human-readable description of the experiment. This does not need to be unique but should be limited to less than 255 characters for the best experience.
labelsA list of label names (strings). Assigning labels to experiments allows you to identify experiments that share the same property or should be grouped together. You can add and remove labels using either the CLI (
det experiment label) or the WebUI.
dataThis field can be used to specify information about how the experiment accesses and loads training data. The content and format of this field is user-defined: it should be used to specify whatever configuration is needed for loading data for use by the experiment’s model definition. For example, if your experiment loads data from Amazon S3, the
datafield might contain the S3 bucket name, object prefix, and AWS authentication credentials.
Entrypoint¶
Required Fields
entrypointThe location of the trial class in a user’s model definition as an entrypoint specification string. The entrypoint specification is expected to take the form
<module>:<object reference>.<module>specifies the module containing the trial class within the model definition, relative to the root.<object reference>specifies the naming of the trial class within the module. It may be a nested object delimited by dots. Examples::MNistTrialexpects anMNistTrialclass that is exposed in a__init__.pyfile at the top level of the context directory.model_def:CIFAR10Trialexpects aCIFAR10Trialclass that is defined in a filemodel_def.pyat the top level of the context directory.determined_lib.trial:trial_classes.NestedTrialexpects aNestedTrialclass that is an attribute oftrial_classes, wheretrial_classesis defined in a filedetermined_lib/trial.py.
Note that this follows the Entry points specification defined in the Python Packaging User Guide with a single difference: the name of the context directory is prefixed to
<module>, or used as the module if<module>is empty.
Basic Behaviors¶
Optional Fields
scheduling_unitInstructs how frequent to perform system operations, such as periodic checkpointing and preemption, in the unit of batches. The number of records in a batch is controlled by the global_batch_size hyperparameter. Defaults to
100.Setting this value too small can increase the overhead of system operations and decrease training throughput.
Setting this value too large might prevent the system from reallocating resources from this workload to another, potentially more important, workload.
As a rule of thumb, it should be set to the number of batches that can be trained in roughly 60–180 seconds.
records_per_epochThe number of records in the training data set. It must be configured if you want to specify
min_validation_period,min_checkpoint_period,searcher.max_length, andsearcher.length_per_roundin units ofepochs.The system does not attempt to determine the size of an epoch automatically, because the size of the training set might vary based on data augmentation, changes to external storage, or other factors.
max_restartsThe maximum number of times that trials in this experiment will be restarted due to an error. If an error occurs while a trial is running (e.g., a container crashes abruptly), the Determined master will automatically restart the trial and continue running it. This parameter specifies a limit on the number of times to try restarting a trial; this ensures that Determined does not go into an infinite loop if a trial encounters the same error repeatedly. Once
max_restartstrial failures have occurred for a given experiment, subsequent failed trials will not be restarted – instead, they will be marked as errored. The experiment itself will continue running; an experiment is considered to complete successfully if at least one of its trials completes successfully. The default value is5.
Validation Policy¶
Optional Fields
min_validation_periodInstructs the minimum frequency for running validation for each trial.
This needs to be set in the unit of records, batches, or epochs using a nested dictionary. For example:
min_validation_period: epochs: 2
If this is in the unit of epochs, records_per_epoch must be specified.
perform_initial_validationInstructs Determined to perform an initial validation before any training begins, for each trial. This can be useful to determine a baseline when fine-tuning a model on a new dataset.
Checkpoint Policy¶
We will checkpoint in the following situations:
During training, periodically to keep record of the training progress;
During training, to allow the trial’s execution to be recovered from resuming or errors;
When the trial is completed;
Before the searcher makes a decision based on the validation of trials, to maintain consistency in the event of a failure.
Optional Fields
min_checkpoint_periodInstructs the minimum frequency for running checkpointing for each trial.
This needs to be set in the unit of records, batches, or epochs using a nested dictionary. For example:
min_checkpoint_period: epochs: 2
If this is in the unit of epochs, records_per_epoch must be specified.
checkpoint_policyControls how Determined performs checkpoints after validation operations, if at all. Should be set to one of the following values:
best(default): A checkpoint will be taken after every validation operation that performs better than all previous validations for this experiment. Validation metrics are compared according to themetricandsmaller_is_betteroptions in the searcher configuration.all: A checkpoint will be taken after every validation, no matter the validation performance.none: A checkpoint will never be taken due to a validation. However, even with this policy selected, checkpoints are still expected to be taken after the trial is finished training, due to cluster scheduling decisions, before search method decisions, or due to min_checkpoint_period.
Checkpoint Storage¶
The checkpoint_storage section defines how model checkpoints will be stored. A checkpoint
contains the architecture and weights of the model being trained. Each checkpoint has a UUID, which
is used as the name of the checkpoint directory on the external storage system.
If this field is not specified, the experiment will default to the checkpoint storage configured in the Master Configuration.
Checkpoint Garbage Collection¶
When an experiment finishes, the system will optionally delete some checkpoints to reclaim space.
The save_experiment_best, save_trial_best and save_trial_latest parameters specify which
checkpoints to save. If multiple save_* parameters are specified, the union of the specified
checkpoints are saved.
save_experiment_best: The number of the best checkpoints with validations over all trials to save (where best is measured by the validation metric specified in the searcher configuration).save_trial_best: The number of the best checkpoints with validations of each trial to save.save_trial_latest: The number of the latest checkpoints of each trial to save.
These fields default to the following respective value:
save_experiment_best: 0
save_trial_best: 1
save_trial_latest: 1
This policy will save the most recent and the best checkpoint per trial. In other words, if the most recent checkpoint is also the best checkpoint for a given trial, only one checkpoint will be saved for that trial. Otherwise, two checkpoints will be saved.
Examples¶
Suppose an experiment has the following trials, checkpoints and validation metrics (where
smaller_is_better is true):
Trial ID |
Checkpoint ID |
Validation Metric |
|---|---|---|
1 |
1 |
null |
1 |
2 |
null |
1 |
3 |
0.6 |
1 |
4 |
0.5 |
1 |
5 |
0.4 |
2 |
6 |
null |
2 |
7 |
0.2 |
2 |
8 |
0.3 |
2 |
9 |
null |
2 |
10 |
null |
The effect of various policies is enumerated in the following table:
|
|
|
Saved Checkpoint IDs |
|---|---|---|---|
0 |
0 |
0 |
none |
2 |
0 |
0 |
8,7 |
>= 5 |
0 |
0 |
8,7,5,4,3 |
0 |
1 |
0 |
7,5 |
0 |
>= 3 |
0 |
8,7,5,4,3 |
0 |
0 |
1 |
10,5 |
0 |
0 |
3 |
10,9,8,5,4,3 |
2 |
1 |
0 |
8,7,5 |
2 |
0 |
1 |
10,8,7,5 |
0 |
1 |
1 |
10,7,5 |
2 |
1 |
1 |
10,8,7,5 |
If aggressive reclamation is desired, set save_experiment_best to a 1 or 2 and leave the other
parameters zero. For more conservative reclamation, set save_trial_best to 1 or 2; optionally
set save_trial_latest as well.
Checkpoints of an existing experiment can be garbage collected by changing the GC policy using the
det experiment set gc-policy subcommand of the Determined CLI.
Storage Type¶
Determined currently supports several kinds of checkpoint storage, gcs, hdfs, s3,
azure, and shared_fs, identified by the type subfield. Additional fields may also be
required, depending on the type of checkpoint storage in use. For example, to store checkpoints on
Google Cloud Storage:
checkpoint_storage:
type: gcs
bucket: <your-bucket-name>
Google Cloud Storage¶
If type: gcs is specified, checkpoints will be stored on Google Cloud Storage (GCS).
Authentication is done using GCP’s “Application Default Credentials” approach. When using Determined
inside Google Compute Engine (GCE), the simplest approach is to ensure that the VMs used by
Determined are running in a service account that has the “Storage Object Admin” role on the GCS
bucket being used for checkpoints. As an alternative (or when running outside of GCE), you can add
the appropriate service account credentials
to your container (e.g., via a bind-mount), and then set the GOOGLE_APPLICATION_CREDENTIALS
environment variable to the container path where the credentials are located. See
Environment Variables for more details on how to set environment variables in containers.
The following fields are required when using GCS checkpoint storage:
bucketThe GCS bucket name to use.
HDFS¶
If type: hdfs is specified, checkpoints will be stored in HDFS using the WebHDFS API for
reading and writing checkpoint resources.
Required Fields
hdfs_urlHostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode. Multiple namenodes are allowed as a semicolon-separated list (e.g.,
"http://namenode1:50070;http://namenode2:50070").hdfs_pathThe prefix path where all checkpoints will be written to and read from. The resources of each checkpoint will be saved in a subdirectory of
hdfs_path, where the subdirectory name is the checkpoint’s UUID.
Optional Fields
userThe user name to use for all read and write requests. If not specified, this defaults to the user of the trial runner container.
Amazon S3¶
If type: s3 is specified, checkpoints will be stored in Amazon S3 or an S3-compatible object
store such as MinIO.
Required Fields
bucketThe S3 bucket name to use.
access_keyThe AWS access key to use.
secret_keyThe AWS secret key to use.
Optional Fields
prefixThe optional path prefix to use. Must not contain
... Note: Prefix is normalized, e.g.,/pre/.//fix->/pre/fixendpoint_urlThe endpoint to use for S3 clones, e.g.,
http://127.0.0.1:8080/. If not specified, Amazon S3 will be used.
Azure Blob Storage¶
If type: azure is specified, checkpoints will be stored in Microsoft’s Azure Blob Storage.
Please only specify one of connection_string or the account_url, credential tuple.
Required Fields
containerThe Azure Blob Storage container name to use.
connection_stringThe connection string for the Azure Blob Storage service account to use.
account_urlThe account URL for the Azure Blob Storage service account to use.
Optional Fields
credentialThe credential to use with the
account_url.
Hyperparameters¶
The hyperparameters section defines the hyperparameter space for the experiment. Which
hyperparameters are appropriate for a given model is up to the user and depends on the nature of the
model being trained. In Determined, it is common to specify hyperparameters that influence many
aspects of the model’s behavior, including how data augmentation is done, the architecture of the
neural network, and which optimizer to use, along with how that optimizer should be configured.
The value chosen for a hyperparameter in a given trial can be accessed via the trial context using
context.get_hparam(). For instance, the current value
of a hyperparameter named learning_rate can be accessed by
context.get_hparam("learning_rate").
Note
Every experiment must specify a hyperparameter named global_batch_size. This is because this
hyperparameter is treated specially: when doing distributed training, the global batch size must
be known so that the per-worker batch size can be computed appropriately. Batch size per slot is
computed at runtime, based on the number of slots used to train a single trial of this experiment
(see resources.slots_per_trial). The updated values
should be accessed via the trial context, using context.get_per_slot_batch_size() and context.get_global_batch_size().
The hyperparameter space is defined by a dictionary. Each key in the dictionary is the name of a hyperparameter; the associated value defines the range of the hyperparameter. If the value is a scalar, the hyperparameter is a constant; otherwise, the value should be a nested map. Here is an example:
hyperparameters:
global_batch_size: 64
optimizer_config:
optimizer:
type: categorical
vals:
- SGD
- Adam
- RMSprop
learning_rate:
type: log
minval: -5.0
maxval: 1.0
base: 10.0
num_layers:
type: int
minval: 1
maxval: 3
layer1_dropout:
type: double
minval: 0.2
maxval: 0.5
This configuration defines the following hyperparameters:
global_batch_size: a constant valueoptimizer_config: a top level nested hyperparameter with two child hyperparameters:optimizer: a categorical hyperparameterlearning_rate: a log scale hyperparameter
num_layers: an integer hyperparameterlayer1_dropout: a double hyperparameter
The field optimizer_config demonstrates how nesting can be used to organize hyperparameters.
Arbitrary levels of nesting are supported with all types of hyperparameters. Aside from
hyperparameters with constant values, the four types of hyperparameters – categorical,
double, int, and log – can take on a range of possible values. The following sections
cover how to configure the hyperparameter range for each type of hyperparameter.
Categorical¶
A categorical hyperparameter ranges over a set of specified values. The possible values are
defined by the vals key. vals is a list; each element of the list can be of any valid YAML
type, such as a boolean, a string, a number, or a collection.
Double¶
A double hyperparameter is a floating point variable. The minimum and maximum values of the
variable are defined by the minval and maxval keys, respectively (inclusive of endpoints).
When doing a grid search, the count key can also be specified; this defines the number of points
in the grid for this hyperparameter. Grid points are evenly spaced between minval and
maxval. See Hyperparameter Search: Grid for details.
Integer¶
An int hyperparameter is an integer variable. The minimum and maximum values of the variable are
defined by the minval and maxval keys, respectively (inclusive of endpoints).
When doing a grid search, the count key can also be specified; this defines the number of points
in the grid for this hyperparameter. Grid points are evenly spaced between minval and
maxval. See Hyperparameter Search: Grid for details.
Log¶
A log hyperparameter is a floating point variable that is searched on a logarithmic scale. The
base of the logarithm is specified by the base field; the minimum and maximum exponent values of
the hyperparameter are given by the minval and maxval fields, respectively (inclusive of
endpoints).
When doing a grid search, the count key can also be specified; this defines the number of points
in the grid for this hyperparameter. Grid points are evenly spaced between minval and
maxval. See Hyperparameter Search: Grid for details.
Searcher¶
The searcher section defines how the experiment’s hyperparameter space will be explored. To run
an experiment that trains a single trial with fixed hyperparameters, specify the single searcher
and specify constant values for the model’s hyperparameters. Otherwise, Determined supports six
different hyperparameter search algorithms: adaptive_asha, random, grid, and pbt.
The name of the hyperparameter search algorithm to use is configured via the name field; the
remaining fields configure the behavior of the searcher and depend on the searcher being used. For
example, to configure a random hyperparameter search that trains 5 trials for 1000 batches each:
searcher:
name: random
metric: accuracy
max_trials: 5
max_length:
batches: 1000
For details on using Determined to perform hyperparameter search, refer to Hyperparameter Tuning. For more information on the search methods supported by Determined, refer to Hyperparameter Tuning.
Single¶
The single search method does not perform a hyperparameter search at all; rather, it trains a
single trial for a fixed length. When using this search method, all of the hyperparameters specified
in the hyperparameters section must be constants.
By default, validation metrics are only computed once, after the specified length of training has
been completed; min_validation_period can be used
to specify that validation metrics should be computed more frequently.
Required Fields
metricThe name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_lengthThe length of the trial.
This needs to be set in the unit of records, batches, or epochs using a nested dictionary. For example:
max_length: epochs: 2
If this is in the unit of epochs, records_per_epoch must be specified.
Optional Fields
smaller_is_betterWhether to minimize or maximize the metric defined above. The default value is
true(minimize).source_trial_idIf specified, the weights of this trial will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of this experiment.
source_checkpoint_uuidLike
source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_idorsource_checkpoint_uuidshould be set.
Random¶
The random search method implements a simple random search. The user specifies how many
hyperparameter configurations should be trained and how long each configuration should be trained
for; the configurations are sampled randomly from the hyperparameter space. Each trial is trained
for the specified length and then validation metrics are computed. min_validation_period can be used to specify that validation metrics should be
computed more frequently.
Required Fields
metricThe name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_trialsThe number of trials, i.e., hyperparameter configurations, to evaluate.
max_lengthThe length of each trial.
This needs to be set in the unit of records, batches, or epochs using a nested dictionary. For example:
max_length: epochs: 2
If this is in the unit of epochs, records_per_epoch must be specified.
Optional Fields
smaller_is_betterWhether to minimize or maximize the metric defined above. The default value is
true(minimize).max_concurrent_trialsThe maximum number of trials that can be worked on simultaneously. The default value is
0, in which case we will try to work on as many trials as possible.source_trial_idIf specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is incompatible with the model architecture of any of the trials in this experiment.
source_checkpoint_uuidLike
source_trial_idbut specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_idorsource_checkpoint_uuidshould be set.
Grid¶
The grid search method performs a grid search. The coordinates of the hyperparameter grid are
specified via the hyperparameters field. For more details see the
Hyperparameter Search: Grid.
Required Fields
metricThe name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_lengthThe length of each trial.
This needs to be set in the unit of records, batches, or epochs using a nested dictionary. For example:
max_length: epochs: 2
If this is in the unit of epochs, records_per_epoch must be specified.
Optional Fields
smaller_is_betterWhether to minimize or maximize the metric defined above. The default value is
true(minimize).max_concurrent_trialsThe maximum number of trials that can be worked on simultaneously. The default value is
0, in which case we will try to work on as many trials as possible.source_trial_idIf specified, the weights of this trial will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of this experiment.
source_checkpoint_uuidLike
source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_idorsource_checkpoint_uuidshould be set.
Adaptive ASHA¶
The adaptive_asha search method employs multiple calls to the asynchronous successive halving
algorithm (ASHA) which is suitable for large-scale
experiments with hundreds or thousands of trials.
Required Fields
metricThe name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_lengthThe maximum training length of any one trial. The vast majority of trials will be stopped early, and thus only a small fraction of trials will actually be trained for this long. This quantity is domain-specific and should roughly reflect the length of training needed for the model to converge on the data set.
This needs to be set in the unit of records, batches, or epochs using a nested dictionary. For example:
max_length: epochs: 2
If this is in the unit of epochs, records_per_epoch must be specified.
max_trialsThe number of trials, i.e., hyperparameter configurations, to evaluate.
Optional Fields
smaller_is_betterWhether to minimize or maximize the metric defined above. The default value is
true(minimize).modeHow aggressively to perform early stopping. There are three modes:
aggressive,standard, andconservative; the default isstandard.These modes differ in the degree to which early-stopping is used. In
aggressivemode, the searcher quickly stops underperforming trials, which enables the searcher to explore more hyperparameter configurations, but at the risk of discarding a configuration too soon. On the other end of the spectrum,conservativemode performs significantly less downsampling, but as a consequence does not explore as many configurations given the same budget. We recommend using eitheraggressiveorstandardmode.stop_onceIf
stop_onceis set totrue, we will use a variant of ASHA that will not resume trials once stopped. This variant defaults to continuing training and will only stop trials if there is enough evidence to terminate training. We recommend using this version of ASHA when training a trial for the max length as fast as possible is important or when fault tolerance is too expensive.divisorThe fraction of trials to keep at each rung, and also determines the training length for each rung. The default setting is
4; only advanced users should consider changing this value.max_rungsThe maximum number of times we evaluate intermediate results for a trial and terminate poorly performing trials. The default value is
5; only advanced users should consider changing this value.max_concurrent_trialsThe maximum number of trials that can be worked on simultaneously. The default value is
0, and we set reasonable values depending onmax_trialsand the number of rungs in the brackets. This is akin to controlling the degree of parallelism of the experiment. If this value is less than the number of brackets produced by the adaptive algorithm, it will be rounded up.source_trial_idIf specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of any of the trials in this experiment.
source_checkpoint_uuidLike
source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_idorsource_checkpoint_uuidshould be set.
PBT¶
The pbt search method uses population-based training, which maintains a
population of active trials to train. After each trial has been trained the length specified by
length_per_round, all the trials are validated. The searcher then closes some trials and
replaces them with altered copies of other trials. This process makes up one “round”; the searcher
runs some number of rounds to execute a complete search. The model definition class must be able to
restore from a checkpoint that was created with a different set of hyperparameters; in particular,
you will not be able to vary any hyperparameters that change the sizes of weight matrices without
taking special steps to save or restore models.
Required Fields
metricSpecifies the name of the validation metric used to evaluate the performance of a hyperparameter configuration.
population_sizeThe number of trials (i.e., different hyperparameter configurations) to keep active at a time.
length_per_roundThe length to train each trial during a round.
This needs to be set in the unit of records, batches, or epochs using a nested dictionary. For example:
length_per_round: epochs: 2
If this is in the unit of epochs, records_per_epoch must be specified.
num_roundsThe total number of rounds to execute.
replace_functionHow to choose which trials to close and which trials to copy at the end of each round. At present, only a single replacement function is supported:
truncate_fractionDefines truncation selection, in which the worst
truncate_fraction(multiplied by the population size) trials, ranked by validation metric, are closed and the same number of top trials are copied.
explore_functionHow to alter a set of hyperparameters when a copy of a trial is made. Each parameter is either resampled (i.e., its value is chosen from the configured distribution) or perturbed (i.e., its value is computed based on the value in the original set).
explore_functionhas two required sub-fields:resample_probabilityThe probability that a parameter is replaced with a new value sampled from the original distribution specified in the configuration.
perturb_factorThe amount by which parameters that are not resampled are perturbed. Each numerical hyperparameter is multiplied by either
1 + perturb_factoror1 - perturb_factorwith equal probability;categoricalandconsthyperparameters are left unchanged.
Optional Fields
smaller_is_betterWhether to minimize or maximize the metric defined above. The default value is
true(minimize).
Resources¶
The resources section defines the resources that an experiment is allowed to use.
Optional Fields
slots_per_trialThe number of slots to use for each trial of this experiment. The default value is
1; specifying a value greater than 1 means that multiple GPUs will be used in parallel. Training on multiple GPUs is done using data parallelism. Configuringslots_per_trialto be greater thanmax_slotsis not sensible and will result in an error.Note
Using
slots_per_trialto enable data parallel training for PyTorch can alter the behavior of certain models, as described in the PyTorch documentation.
agent_labelIf set, tasks launched for this experiment will only be scheduled on agents that have the given label set. If this is not set (the default behavior), tasks launched for this experiment will only be scheduled on unlabeled agents. An agent’s label can be configured via the
labelfield in the agent configuration.max_slotsThe maximum number of scheduler slots that this experiment is allowed to use at any one time. The slot limit of an active experiment can be changed using
det experiment set max-slots <id> <slots>. By default, there is no limit on the number of slots an experiment can use.Warning
max_slotsis only considered when scheduling jobs; it is not currently used when provisioning dynamic agents. This means that we may provision more instances than the experiment can schedule.weightThe weight of this experiment in the scheduler. When multiple experiments are running at the same time, the number of slots assigned to each experiment will be approximately proportional to its weight. The weight of an active experiment can be changed using
det experiment set weight <id> <weight>. The default weight is1.shm_sizeThe size in bytes of
/dev/shmfor trial containers. Defaults to4294967296(4GiB). If set, this value overrides the value specified in the master configuration.priorityThe priority assigned to this experiment. Only applicable when using the
priorityscheduler. Experiments with smaller priority values are scheduled before experiments with higher priority values. If using Kubernetes, the opposite is true; experiments with higher priorities are scheduled before those with lower priorities. Refer to Scheduling for more information.resource_poolThe resource pool where this experiment will be scheduled. If no resource pool is specified, experiments will run in the default GPU pool. Refer to Resource Pools for more information.
devicesA list of device strings to pass to the Docker daemon. Each entry in the list is equivalent to a
--device DEVICEcommand line argument todocker run.devicesis honored by resource managers of typeagentbut is ignored by resource managers of typekubernetes. See master configuration for details about resource managers.
Bind Mounts¶
The bind_mounts section specifies directories that are bind-mounted into every container
launched for this experiment. Bind mounts are often used to enable trial containers to access
additional data that is not part of the model definition directory.
This field should consist of an array of entries; each entry has the form described below. Users must ensure that the specified host paths are accessible on all agent hosts (e.g., by configuring a network file system appropriately).
For each bind mount, the following fields are required:
host_pathThe file system path on each agent to use. Must be an absolute filepath.
container_pathThe file system path in the container to use. May be a relative filepath, in which case it will be mounted relative to the working directory inside the container. It is not allowed to mount directly into the working directory (i.e.,
container_path == ".") to reduce the risk of cluttering the host filesystem.
For each bind mount, the following optional fields may also be specified:
read_onlyWhether the bind-mount should be a read-only mount. Defaults to
false.propagationPropagation behavior for replicas of the bind-mount. Defaults to
rprivate.
For example, to mount /data on the host to the same path in the container, use:
bind_mounts:
- host_path: /data
container_path: /data
It is also possible to mount multiple paths:
bind_mounts:
- host_path: /data
container_path: /data
- host_path: /shared/read-only-data
container_path: /shared/read-only-data
read_only: true
Environment¶
The environment section defines properties of the container environment that is used to execute
workloads for this experiment. For more information on customizing the trial environment, refer to
Custom Environment.
Optional Fields
imageThe Docker image to use when executing the workload. This image must be accessible via
docker pullto every Determined agent machine in the cluster. Users can configure different container images for NVIDIA GPU tasks usingcudakey (gpuprior to 0.17.6), CPU tasks usingcpukey, and ROCm (AMD GPU) tasks usingrocmkey. Default values:determinedai/environments:cuda-11.3-pytorch-1.10-lightning-1.5-tf-2.8-gpu-0.17.15for NVIDIA GPUs.determinedai/environments:py-3.8-pytorch-1.10-lightning-1.5-tf-2.8-cpu-0.17.15for CPUs.determinedai/environments:rocm-4.2-pytorch-1.9-tf-2.5-rocm-0.17.15for ROCm.
force_pull_imageForcibly pull the image from the Docker registry, bypassing the Docker cache. Defaults to
false.registry_authThe Docker registry credentials to use when pulling a custom base Docker image, if needed. Credentials are specified as the following nested fields:
username(required)password(required)server(optional)email(optional)
environment_variablesA list of environment variables that will be set in every trial container. Each element of the list should be a string of the form
NAME=VALUE. See Environment Variables for more details. Users can customize environment variables for CUDA (NVIDIA GPU), CPU, and ROCm (AMD GPU) tasks differently by specifying a dict withcuda(gpuprior to 0.17.6),cpu, androcmkeys.
pod_specOnly applicable when running Determined on Kubernetes. Applies a pod spec to the pods that are launched by Determined for this task. See Custom Pod Specs for details.
add_capabilitiesA list of Linux capabilities to grant to task containers. Each entry in the list is equivalent to a
--cap-add CAPcommand line argument todocker run.add_capabilitiesis honored by resource managers of typeagentbut is ignored by resource managers of typekubernetes. See master configuration for details about resource managers.drop_capabilitiesJust like
add_capabilitiesbut corresponding to the--cap-dropargument ofdocker runrather than--cap-add.
Optimizations¶
The optimizations section contains configuration options that influence the performance of the
experiment.
Optional Fields
aggregation_frequencySpecifies after how many batches gradients are exchanged during Distributed Training. Defaults to
1.average_aggregated_gradientsWhether gradients accumulated across batches (when
aggregation_frequency> 1) should be divided by theaggregation_frequency. Defaults totrue.average_training_metricsFor multi-GPU training, whether to average the training metrics across GPUs instead of only using metrics from the chief GPU. This impacts the metrics shown in the Determined UI and TensorBoard, but does not impact the outcome of training or hyperparameter search. This option is currently only supported in PyTorch. Defaults to
false.gradient_compressionWhether to compress gradients when they are exchanged during Distributed Training. Compression may alter gradient values to achieve better space reduction. Defaults to
false.mixed_precisionWhether to use mixed precision training with PyTorch during Distributed Training. Setting
O1enables mixed precision and loss scaling. Defaults toO0which disables mixed precision training. This configuration setting is deprecated; users are advised to callcontext.configure_apex_ampin the constructor of their trial class instead.tensor_fusion_thresholdThe threshold in MB for batching together gradients that are exchanged during Distributed Training. Defaults to
64.tensor_fusion_cycle_timeThe delay (in milliseconds) between each tensor fusion during Distributed Training. Defaults to
5.auto_tune_tensor_fusionWhen enabled, configures
tensor_fusion_thresholdandtensor_fusion_cycle_timeautomatically. Defaults tofalse.
Reproducibility¶
The reproducibility section specifies configuration options related to reproducible experiments.
See Reproducibility for more details.
Optional Fields
experiment_seedThe random seed to use to initialize random number generators for all trials in this experiment. Must be an integer between 0 and 231–1. If an
experiment_seedis not explicitly specified, the master will automatically generate an experiment seed.
Profiling¶
The profiling section specifies configuration options related to profiling experiments. See
How To Profile An Experiment for a more detailed walkthrough.
Optional Fields
profilingProfiling is supported for all frameworks, though timings are only collected for
PyTorchTrial. Profiles are collected for a maximum of 5 minutes, regardless of the settings below.enabledDefines whether profiles should be collected or not. Defaults to false.
begin_on_batchSpecifies the batch on which profiling should begin.
end_after_batchSpecifies the batch after which profiling should end.
sync_timingsSpecifies whether Determined should wait for all GPU kernel streams before considering a timing as ended. Defaults to ‘true’. Applies only for frameworks that collect timing metrics (currently just PyTorch).
Data Layer¶
The data_layer section specifies configuration options related to the Data Layer API for Keras and Estimator.
Determined currently supports three types of storage for the data_layer: s3, gcs, and
shared_fs, identified by the type subfield. Defaults to shared_fs.
Shared File System¶
If type: shared_fs is specified, the cache will be stored in a directory on an agent’s file
system.
Optional Fields
host_storage_pathThe file system path on each agent to use.
container_storage_pathThe file system path to use as the mount point in the trial runner container.
Amazon S3¶
If type: s3 is specified, the cache will be stored on Amazon S3 or an S3-compatible object store
such as MinIO.
Required Fields
bucketThe S3 bucket name to use.
bucket_directory_pathThe path in the S3 bucket to store the cache.
Optional Fields
local_cache_host_pathThe file system path to store a local copy of the cache, which is synchronized with the S3 cache.
local_cache_container_pathThe file system path to use as the mount point in the trial runner container for storing the local cache.
access_keyThe AWS access key to use.
secret_keyThe AWS secret key to use.
endpoint_urlThe endpoint to use for S3 clones, e.g.,
http://127.0.0.1:8080/.
Google Cloud Storage¶
If type: gcs is specified, the cache will be stored on Google Cloud Storage (GCS).
Authentication is done using GCP’s “Application Default Credentials” approach. When using Determined
inside Google Compute Engine (GCE), the simplest approach is to ensure that the VMs used by
Determined are running in a service account that has the “Storage Object Admin” role on the GCS
bucket being used for checkpoints. As an alternative (or when running outside of GCE), you can add
the appropriate service account credentials
to your container (e.g., via a bind-mount), and then set the GOOGLE_APPLICATION_CREDENTIALS
environment variable to the container path where the credentials are located. See
Environment Variables for more details on how to set environment variables in containers.
Required Fields
bucketThe GCS bucket name to use.
bucket_directory_pathThe path in GCS bucket to store the cache.
Optional Fields
local_cache_host_pathThe file system path to store a local copy of the cache, which is synchronized with the GCS cache.
local_cache_container_pathThe file system path to use as the mount point in the trial runner container for storing the local cache.
Training Units¶
Some configuration settings, such as searcher training lengths and budgets,
min_validation_period, and min_checkpoint_period, can be expressed in terms of a few
training units: records, batches, or epochs.
records: A record is a single labeled example (sometimes called a sample).batches: A batch is a group of records. The number of records in a batch is configured via theglobal_batch_sizehyperparameter.epoch: An epoch is a single copy of the entire training data set; the number of records in an epoch is configured via the records_per_epoch configuration field.
For example, to specify the max_length for a searcher in terms of batches, the configuration
would read as shown below.
max_length:
batches: 900
To express it in terms of records or epochs, records or epochs would be specified in place
of batches. In the case of epochs, records_per_epoch must also
be specified. Below is an example that configures a single searcher to train a model for 64
epochs.
records_per_epoch: 50000
searcher:
name: single
metric: validation_error
max_length:
epochs: 64
smaller_is_better: true
The configured records_per_epoch is only used for interpreting configuration fields that are expressed in epochs. Actual epoch boundaries are still determined by the dataset itself (specifically, the end of an epoch occurs when the training data loader runs out of records).
Note
If the amount of data to train a model on is specified using records or epochs and the batch size does not divide evenly into the configured number of inputs, the remaining “partial batch” of data will be dropped (ignored). For example, if an experiment is configured to train a single model on 10 records with a configured batch size of 3, the model will only be trained on 9 records of data. In the corner case that a trial is configured to be trained for less than a single batch of data, a single complete batch will be used instead.
Caveats¶
In most cases, a value expressed using one type of training unit can be converted to a different type of training unit with identical behavior, with a few caveats:
Because training units must be positive integers, converting between quantities of different types is not always possible. For example, converting 50
recordsinto batches is not possible if the batch size is 64.When doing a hyperparameter search over a range of values for
global_batch_size, the specifiedbatchescannot be converted to a fixed number of records or epochs and hence cause different behaviors in different trials of the search.When using adaptive_asha, a single training unit is treated as atomic (unable to be divided into fractional parts) when dividing
max_lengthinto the series of rounds (or rungs) by which we early-stop underperforming trials. This rounding may result in unexpected behavior when configuringmax_lengthin terms of a small number of large epochs or batches.
To verify your search is working as intended before committing to a full run, you can use the CLI’s “preview search” feature:
det preview-search <configuration.yaml>