Generating CMORised data with CDDS for GCModelDev simulations using the CDDS Workflow
See also guidance for CMIP6 / CMIP6 Plus generation of CMORised data.
Tip
Use <script> -h
or <script> --help
to print information about the script, including available parameters.
Example
A simulation for the pre-industrial control from HadGEM3 will be used as an example in these instructions.
Note
The procedure below assumes that you are keeping track of progress using a CDDS progress ticket. For exercises such as CMIP6 this was managed centrally, but as GCModelDev is intended to provide support for ad-hoc processing we recommend you have some form of progress note, you are welcome to use the CDDS Trac for this purpose if you wish.
Prerequisites
Before running the CDDS Operational Procedure, please ensure that:
-
you have a project framework to work within (see next section)
-
you use a bash shell. CDDS uses Conda which can experience problems running in a shell other than bash.
Tip
You can check which shell you use by following command:
If the result is notecho $SHELL
/bin/bash
, you can switch to a bash shell by running:/bin/bash
If any of the above are not true please contact the CDDS Team for guidance.
What to do when things go wrong
On occasion issues will arise with tasks performed by users of CDDS and these will trigger CRITICAL
error messages in
the logs and usually require user intervention. Many simple issues (MASS/MOOSE or file system problems) can be resolved
by re-triggering tasks. Support is available via the CDDS Team.
The project framework
The tools we have were written primarily for CMIP6 to CMORise HadGEM3 and UKESM1 model output, but can be applied to other projects provided an appropriate set of variable and metadata definitions are available. Defining a new project, e.g. CMIP6, requires a reasonable amount of information as does adding an entirely new model configuration and the CDDS team should be involved in discussions to do this if you need it. However, it is straightforward to use an existing project and include new activities and experiments.
When run in relaxed
mode CDDS will allow you to use any value for the mip
(activity id) and experiment_id
. We have
a general purpose project for users interested in CMORising data for adhoc use called GCModelDev
which takes the CMIP6 variable definitions and standards, to which we can add new variables as required. This is not
intended for preparing data for immediate publication to locations such as ESGF, but can be used for analysis alongside
CMIP6 data and for feeding in to tools that base themselves on the same data structure/standards.
Activate the CDDS install
-
Setup the environment to use the central installation of CDDS and its dependencies:
wheresource ~cdds/bin/setup_env_for_cdds <cdds_version>
<cdds_version>
is the version of CDDS you wish to use, e.g.3.0.0
. Unless instructed otherwise you should use the most recent version of CDDS available (to ensure that all bugfixes are picked up), and this version should be used in all stages of the package being processed. If in doubt contact the CDDS team for advice. -
Ticket: Record the version of CDDS being used on the CDDS progress ticket.
-
Setup the environment to use the central installation of CDDS and its dependencies:
wheresource ~cdds/bin/setup_env_for_cdds <cdds_version>
<cdds_version>
is the version of CDDS you wish to use, e.g.3.0.0
. Unless instructed otherwise you should use the most recent version of CDDS available (to ensure that all bugfixes are picked up), and this version should be used in all stages of the package being processed. If in doubt contact the CDDS team for advice. -
Ticket: Record the version of CDDS being used on the CDDS progress ticket.
Note
- The available version numbers for this script can be found here (MetOffice github access required)
- If you wish to deactivate the CDDS environment then you can use the command
conda deactivate
.
Create the request configuration file
Info
Please, also the documentation of the request configuration.
A request configuration file contains a number of fields that guide what CDDS processing does and can be viewed as a "control" file with a reasonable number of arguments. The simplest approach is to copy an existing file and edit certain fields.
Examples:
- GCModelDev HadGEM3-GC31-LL
- GCModelDev HadGEM3-GC31-LL using ens class
- GCModelDev HadGEM3-GC31-MM
- GCModelDev UKESM1-0-LL
These request files should be suitable for use both within the Met Office and on JASMIN.
If you are working with a particular model then to set up a new CDDS processing "package", the user would need to alter the experiment_id and/or variant_label fields, possibly the mip, and the workflow_id along with a set of streams
Important
- Check that the
mode
in thecommon
section of the request configuration file is set torelaxed
. Inrelaxed
mode CDDS will allow you to use any value for themip
(activity id) andexperiment_id
. - Check that the
mip_era
in themetadata
section of the request configuration file is set toGCModelDev
.
You also need to set following values manually:
Value | Description |
---|---|
root_proc_dir |
Path to the CDDS proc directory |
root_data_dir |
Path to the CDDS data directory |
output_mass_root |
Path to the moose location where the data should be archived starts with moose: |
output_mass_suffix |
Sub-directory in MASS to used when moving data. |
Info
The CDDS data directory is the directory where the model output files are written to. The CDDS proc directory is the directory where all the non-data outputs are written to, like the log files.
Note
Please check the other values, as well and do adjustments as needed. For any help, please contact the CDDS Team.
Prepare a list of variables to process
-
Create a text file with the list of variables or copy and modify an existing list. Each line in the file should have the form
<mip table>/<variable name>:<stream>
Example
For example process the variable
tas
for the MIP tableAmon
when processing theap5
stream:Amon/tas:ap5
-
Set the value
variable_list_file
in the request configuration to the path of the created variable file.
Checkout and configure the CDDS workflow
-
Run the following command after replacing values within
<>
:checkout_processing_workflow <name for processing workflow> \ <path to request configuration> \ --workflow_destination .
Example
Checkout the CDDS processing workflow with the name
my-cdds-test
and the request file location/home/foo/cdds-example-1/request.cfg
:checkout_processing_workflow my-cdds-test \ /home/foo/cdds-example-1/request.cfg \ --workflow_destination .
Info
A directory containing a rose workflow will be placed in a subdirectory under the location specified in
--workflow_destination
.
If this is not specified it will be checked out under~/roses/
-
This step is optional: Set some useful environmental variables to access the CDDS directories:
where you must replace all values withinexport CDDS_PROC_DIR=/<root_proc_dir>/<mip_era>/<mip>/<model_id>_<experiment_id>_<variant_label>/<package>/ export CDDS_DATA_DIR=/<root_data_dir>/<mip_era>/<mip>/<model_id>/<experiment_id>/<variant_label>/<package>/ ls $CDDS_PROC_DIR ls $CDDS_DATA_DIR
<>
. Theroot_proc_dir
androot_data_dir
are the values that has been specified in the request configuration. -
Run the workflow:
cd <name for processing workflow> cylc vip .
Info
Cylc 8 is used for running the processing workflow.
If your default version of cylc is not cylc 8 (run
cylc --version
to check) you will need to run the
following command before running the workflow:
export CYLC_VERSION=8
Monitor conversion workflows
For each stream a CDDS Convert workflow will be triggered by the processing workflow. Each of the workflows launched by CDDS Convert
requires monitoring. This can be done using the command line tool cylc gui
to obtain a window with an updating summary
of workflows progress or equivalently the Cylc Review online tools.
Conversion workflows will usually be named cdds_<model id>_<experiment id>_<variant_label>_<stream>
and each stream will
run completely independently.
If a workflow has issues, due to task failure, it will stall, and you will receive an e-mail.
If you hit issues or are unsure how to proceed update the CDDS progress ticket for your package with anything you believe is relevant (include the location of your working directory) and contact the CDDS Team for advice.
The conversion workflows run the following steps
-
run_extract_<stream>
Extract
- Run CDDS Extract for this stream.
- Runs in
long
queue with a wall time of 2 days. - If there are any issues with extracting data they will be reported in the
job.err
log file in the workflow and the$CDDS_PROC_DIR/extract/log/cdds_extract_<stream>_<date stamp>.log
log file and the task will fail. - The extraction task will automatically resubmit 4 times if it fails and manual intervention is required to proceed.
- Most issues are related to either MASS (i.e. moo commands failing), file system anomalies (failure to create files /directories) or running out of time.
- Identify issues either by searching for "CRITICAL" in the
job.err
logs in Cylc Review or by usinggrep CRITICAL $CDDS_PROC_DIR/extract/log/cdds_extract_<stream>_<date stamp>.log
- If the issue appears to be due to MASS issues you can re-run the failed CDDS Extract job by re-triggering the
run_extract_<stream>
task via the cylc gui or via the cylc command line tools:cylc trigger cdds_<model id>_<experiment id>_<variant label>_<stream> run_extract_<stream>:failed
- If in doubt update your CDDS progress ticket and contact CDDS Team for advice.
-
validate_extract_<stream>
Extract Validation
Validation of the output is now performed as a separate task from extracting it. This task will report missing or unexpected files and unreadable netCDF files.
-
setup_output_dir_<stream>
Setup Output Directory
This task will create output directories for conversion output.
-
mip_convert_<stream>_<grid group>
MIP Convert
- Run MIP Convert to produce output files for a small time window for this simulation.
- Will retry up to 3 times before workflow stalls.
- CRITICAL issues are appended to
$CDDS_PROC_DIR/convert/log/critical_issues.log
. These will likely need user action to correct for. So, update your CDDS progress ticket and contact CDDS Team for advice. - The CRITICAL log file will not exist if there are no critical issues.
- A variant named
mip_convert_first_<stream>_<grid group>
may be launched to align the cycling dates with the concatenation processing.
-
finaliser_<stream>
MIP Convert Finaliser
This ensures that Concatenation Tasks are launched once all MIP Convert tasks have been successfully performed for a particular time range. This step should never fail.
-
organise_files_<stream>
Organise Files
- Re-arranges the output files on disk from a directory structure created by the MIP Convert tasks of the form
to
$CDDS_DATA_DIR/output/<stream>_mip_convert/<YYYY-MM-DD>/<grid>/<files>
$CDDS_DATA_DIR/output/<stream>_concat/<MIP table>/<variable name>/<files>
- Ready for concatenation. A variation named
organise_files_final_<stream>
does the same thing but at the end of the conversion process.
- Re-arranges the output files on disk from a directory structure created by the MIP Convert tasks of the form
-
mip_concatenate_setup_<stream>
MIP Concatenate Setup
- This step constructs a list of concatenation jobs that must be performed
-
mip_concatenate_batch_<stream>
MIP Concatenate Batch
- Perform the concatenation commands (
ncrcat
) required to join small files together. - Runs in
long
queue with a wall time of 2 days and can retry up to 3 times before workflow stalls (failures are usually due to running out of time while performing a concatenation). - Only one
mip_concatenate_batch_<stream>
task can run at one time. - Issues can be identified using:
If any critical issues arise or tasks fail update your CDDS progress ticket and contact the CDDS Team for advice.
grep CRITICAL $CDDS_PROC_DIR/convert/log/mip_concatenate_*.log
- Output data is written to
$CDDS_DATA_DIR/output/<stream>/<MIP table>/<variable name>/<files>
- Perform the concatenation commands (
-
run_qc_<stream>
Quality Check (QC)
- Run the QC process on output data for this stream
- Produces a report at:
and a list of variables which pass the quality checks at:
$CDDS_PROC_DIR/qualitycheck/report_<stream>_<datestamp>.json
and a log file at:$CDDS_PROC_DIR/qualitycheck/approved_variables_<stream>_<datestamp>.txt
$CDDS_PROC_DIR/qualitycheck/log/qc_run_and_report_<stream>_<datestamp>.log
- The approved variables file will have one line per successfully produced Dataset of the form:
<MIP table>/<variable name>;<Directory containing files>
- This task will fail if any QC issues are found and will not resubmit. If this occurs please update your CDDS progress ticket and contact the CDDS Team for advice.
-
run_transfer_<stream>
Transfer
- Archive data for variables that are marked active in the requested variables file produced by CDDS Prepare and have successfully passed the QC checks, i.e. are listed in the approved variables file.
- Will not automatically retry, even if failure was due to MASS/MOOSE issues.
- The location in MASS to which these data are archived is determined by the
output_mass_suffix
argument specified in the request configuration file. - Task will fail if
- There are MASS issues: For example if the following command returns anything there has been a MASS outage and you can re-trigger the task:
grep SSC_STORAGE_SYSTEM_UNAVAILABLE $CDDS_PROC_DIR/archive/log/cdds_store_<stream>_<date stamp>.log
- An attempt is made to archive data that already exists in MASS. If this occurs please update your CDDS progress ticket and contact the CDDS Team for advice.
VERY IMPORTANT
Do not delete data from MASS without consultation with Matt Mizielinski.
-
completion_<stream>
Completion
This is a dummy task that is the last thing to run in the workflow -- this is to allow inter workflow dependencies by allowing the
CDDS workflow
to monitor whether each per stream workflow has completed.
If all goes well the workflow will complete, and you will receive an email confirming that the workflow has shutdown containing content of the form:
Message: AUTOMATIC
See: http://fcm1/cylc-review/taskjobs/<user id>/<workflow name>