Generating CMORised data with CDDS for CMIP6 / CMIP6 Plus simulations using the CDDS Workflow
See also guidance for adhoc generation of CMORised data.
Tip
Use <script> -h
or <script> --help
to print information about the script, including available parameters.
Example
A simulation for the pre-industrial control from UKESM will be used as an example in these instructions.
Prerequisites
Before running the CDDS Operational Procedure, please ensure that:
-
you own a CDDS operational simulation ticket (see the list of CDDS operational simulation tickets) that will monitor the processing of a CMIP6 / CMIP6 Plus simulation using CDDS.
-
you belong to the
cdds
groupTip
type
groups
on the command line to print the groups a user is in -
you have write permissions to
moose:/adhoc/projects/cdds/
on MASSTip
You can check if you have correct permissions by running following command and check if your moose username is included in the access control list output:
moo getacl moose:/adhoc/projects/cdds
-
you use a bash shell. CDDS uses Conda which can experience problems running in a shell other than bash.
Tip
You can check which shell you use by following command:
If the result is notecho $SHELL
/bin/bash
, you can switch to a bash shell by running:/bin/bash
If any of the above are not true please contact the CDDS Team for guidance.
Packages
CDDS is designed to handle a "package" of simulation data at one time; a set of variables from a particular simulation
run. Multiple "packages" can be run for a given simulation to add new or corrected variables to the archive. Each package
should be run using a separate processing (proc
) and data
directory. The simplest way to separate two run throughs
of CDDS is to use a different package name. This is set either when running the write_request
script below or by modifying
the request configuration itself.
Partial processing of a simulation
In certain circumstances it may be desirable to process and submit a subset of an entire simulation, i.e. the first 250
years of the esm-piControl
simulation. Please contact the CDDS Team to discuss this
prior to starting processing to
- Get appropriate guidance on the steps needed to correctly construct the requested variables file in CDDS Prepare
- Arrange for an appropriate Errata to be issued following submission of data sets.
What to do when things go wrong
On occasion issues will arise with tasks performed by users of CDDS and these will trigger CRITICAL
error messages in
the logs and usually require user intervention. Many simple issues (MASS/MOOSE or file system problems) can be resolved
by re-triggering tasks. When you take any action please ensure that you update your CDDS operational simulation ticket
and if support is needed contact the CDDS Team.
Set up the CDDS operational simulation ticket
- Select
start work
on the CDDS operational simulation ticket (so that the status isin_progress
) to indicate that work is starting.
Activate the CDDS install
-
Setup the environment to use the central installation of CDDS and its dependencies:
wheresource ~cdds/bin/setup_env_for_cdds <cdds_version>
<cdds_version>
is the version of CDDS you wish to use, e.g.3.0.0
. Unless instructed otherwise you should use the most recent version of CDDS available (to ensure that all bugfixes are picked up), and this version should be used in all stages of the package being processed. If in doubt contact the CDDS team for advice. -
Ticket: Record the version of CDDS being used on the CDDS operational simulation ticket.
-
Setup the environment to use the central installation of CDDS and its dependencies:
wheresource ~cdds/bin/setup_env_for_cdds <cdds_version>
<cdds_version>
is the version of CDDS you wish to use, e.g.3.3.0
. Unless instructed otherwise you should use the most recent version of CDDS available (to ensure that all bugfixes are picked up), and this version should be used in all stages of the package being processed. If in doubt contact the CDDS team for advice. -
Ticket: Record the version of CDDS being used on the CDDS operational simulation ticket.
Note
- The available version numbers for this script can be found here
- If you wish to deactivate the CDDS environment then you can use the command
conda deactivate
.
Create the request configuration file
The request configuration file is constructed from information in the rose-suite.info
files within each workflow.
Important
If the rose-suite.info
file contains incorrect information, this will be propagated through CDDS. As such it is
critically important that the information in these files is correct
To construct the request configuration file take the following steps
-
Set up a working directory
Add the location of your working directory to the CDDS operational simulation ticket.mkdir cdds-example-1 cd cdds-example-1 export WORKING_DIR=`pwd`
-
Collect required information on the rose workflow for the simulation;
- workflow id, e.g.
u-aw310
- branch, e.g.
cdds
- revision
Info
You can find the revision of the workflow branch for a CMIP6 workflow by using the following command:
rosie lookup --prefix=u --query project eq u-cmip6 and id eq u-aw310 and branch eq cdds
- workflow id, e.g.
-
Create the request configuration file;
write_request <workflow id> <branch> <revision> <package name> [<list of streams>] -c <path to proc dir> -t <path to data dir>
Example
Create a request configuration file for the rose suite
u-aw310
, branchcdds
and packageround-20
:write_request u-aw310 cdds 115492 round-20 ap4 ap5 ap6 onm inm \ -c /project/cdds/proc -t /project/cdds_data
Note
Be careful when re-running CDDS using the same request configuration file: pre-existing data will cause problems for the extraction tasks and pre-existing logs in the proc directory may cause issues when diagnosing problems. If in doubt use a different package name in the request configuration file.
Tip
If necessary the start and end dates for processing can be overridden using the --start_date
and --end_date
arguments. Please consult with the CDDS Team if you believe this is necessary.
Info
The log file and request configuration file are written to the current working directory
Prepare a list of variables to process
Warning
This method does not refer to the Data Request or CDDS inventory database (to check which Datasets have been previously produced), so care should be taken with the choice of variables.
-
Create a text file with the list of variables or copy and modify an existing list. Each line in the file should have the form
<mip table>/<variable name>:<stream>
Example
For example process the variable
tas
for the MIP tableAmon
when processing theap5
stream:Amon/tas:ap5
-
Set the value
variable_list_file
in the request configuration to the path of the created variable file.
Note
If you are using a workflow with the CMIP6 STASH set up then you can add the default stream to a list of variables using the command
stream_mappings --varfile <filename without streams> --outfile <new file with streams>
Configure request configuration
Important
The request.cfg
file contains all information that is needed to process the data through CDDS. The creation of the
file does not set all values. So, it must be adjusted manually.
You need to adjust your request.cfg
:
-
Open the
request.cfg
via a text editor, e.g.vi
orgedit
-
Following values need to be set manually:
Value | Description |
---|---|
variable_list_file |
Path to your variables file |
output_mass_root |
Path to the moose loction where the data should be archived starts with moose: |
output_mass_suffix |
Sub-directory in MASS to used when moving data. |
Note
Please check the other values as well and do adjustments as needed. For any help, please contact the CDDS Team.
Info
The MIP era (CMIP6
or CIMP6 Plus
) you are using is defined in the value mip_era
of the metdata
section.
Checkout and configure the CDDS workflow
-
Run the following command after replacing values within
<>
:checkout_processing_workflow <name for processing workflow> \ <path to request configuration> \ --workflow_destination .
Example
Checkout the CDDS processing workflow with the name
my-cdds-test
and the request file location/home/foo/cdds-example-1/request.cfg
:checkout_processing_workflow my-cdds-test \ /home/foo/cdds-example-1/request.cfg \ --workflow_destination .
Info
A directory containing a rose workflow will be placed in a subdirectory under the location specified in
--workflow_destination
.
If this is not specified it will be checked out under~/roses/
-
This step is optional: Set some useful environmental variables to access the CDDS directories:
where you must replace all values withinexport CDDS_PROC_DIR=/<root_proc_dir>/<mip_era>/<mip>/<model_id>_<experiment_id>_<variant_label>/<package>/ export CDDS_DATA_DIR=/<root_data_dir>/<mip_era>/<mip>/<model_id>/<experiment_id>/<variant_label>/<package>/ ls $CDDS_PROC_DIR ls $CDDS_DATA_DIR
<>
. Theroot_proc_dir
androot_data_dir
are the values that has been specified in the request configuration.Example
Assume:
- Path to the root proc directory is
/home/foo/cdds-example-1/proc
. - Path to the root data directory is
/home/foo/cdds-example-1/data
. - MIP era is
CMIP6
and MIPCMIP
. - Model ID is
UKESM1-0-LL
for experimentpiControl
with variant labelr1i1p1f2
and packageround-1
Then the command to set the environmental variables is:
export CDDS_PROC_DIR=/home/foo/cdds-example-1/data/CMIP6/CMIP/UKESM1-0-LL_piControl_r1i1p1f2/round-1/ export CDDS_DATA_DIR=/home/foo/cdds-example-1/data/CMIP6/CMIP/UKESM1-0-LL/piControl/r1i1p1f2/round-1/
- Path to the root proc directory is
-
Run the workflow:
cd <name for processing workflow> cylc vip .
Example
If the name of the processing workflow is
my-cdds-test
, then run:cd my-cdds-test cylc vip .
Info
Cylc 8 is used for running the processing workflow. You can do this by running following command before running the workflow:
export CYLC_VERSION=8
Monitor conversion workflows
For each stream a CDDS Convert workflow will be triggered by the processing workflow. Each of the workflows launched by CDDS Convert
requires monitoring. This can be done using the command line tool cylc gui
to obtain a window with an updating summary
of workflows progress or equivalently the Cylc Review online tools.
Conversion workflows will usually be named cdds_<workflow_base_name>_<stream>
and each stream will
run completely independently.
If a workflow has issues, due to task failure, it will stall, and you will receive an e-mail.
If you hit issues or are unsure how to proceed update the CDDS operational simulation ticket for your package with anything you believe is relevant (include the location of your working directory) and contact the CDDS Team for advice.
The conversion workflows run the following steps
-
run_extract_<stream>
Extract
- Run CDDS Extract for this stream.
- Runs in
long
queue with a wall time of 2 days. - If there are any issues with extracting data they will be reported in the
job.out
log file in the workflow and the$CDDS_PROC_DIR/extract/log/cdds_extract_<stream>_<date stamp>.log
log file and the task will fail. - The extraction task will automatically resubmit 4 times if it fails and manual intervention is required to proceed.
- Most issues are related to either MASS (i.e. moo commands failing), file system anomalies (failure to create files /directories) or running out of time.
- Identify issues either by searching for "CRITICAL" in the
job.out
logs in Cylc Review or by usinggrep CRITICAL $CDDS_PROC_DIR/extract/log/cdds_extract_<stream>_<date stamp>.log
- If the issue appears to be due to MASS issues you can re-run the failed CDDS Extract job by re-triggering the
run_extract_<stream>
task via the cylc gui or via the cylc command line tools:cylc trigger cdds_<workflow_base_name>_<stream> run_extract_<stream>:failed
- If in doubt update your CDDS operational simulation ticket and contact CDDS Team for advice.
-
validate_extract_<stream>
Extract Validation
Validation of the output is now performed as a separate task from extracting it. This task will report missing or unexpected files and unreadable netCDF files.
-
setup_output_dir_<stream>
Setup Output Directory
This task will create output directories for conversion output.
-
mip_convert_<stream>_<grid group>
MIP Convert
- Run MIP Convert to produce output files for a small time window for this simulation.
- Will retry up to 3 times before workflow stalls.
- CRITICAL issues are appended to
$CDDS_PROC_DIR/convert/log/critical_issues.log
. These will likely need user action to correct for. So, update your CDDS operational simulation ticket and contact CDDS Team for advice. - The CRITICAL log file will not exist if there are no critical issues.
- A variant named
mip_convert_first_<stream>_<grid group>
may be launched to align the cycling dates with the concatenation processing.
-
finaliser_<stream>
MIP Convert Finaliser
This ensures that Concatenation Tasks are launched once all MIP Convert tasks have been successfully performed for a particular time range. This step should never fail.
Note
If this task fails, the reason is that the adjustment of the memory and time limits failed. So, please resubmit the task.
-
organise_files_<stream>
Organise Files
- Re-arranges the output files on disk from a directory structure created by the MIP Convert tasks of the form
to
$CDDS_DATA_DIR/output/<stream>_mip_convert/<YYYY-MM-DD>/<grid>/<files>
$CDDS_DATA_DIR/output/<stream>_concat/<MIP table>/<variable name>/<files>
- Ready for concatenation. A variation named
organise_files_final_<stream>
does the same thing but at the end of the conversion process.
- Re-arranges the output files on disk from a directory structure created by the MIP Convert tasks of the form
-
mip_concatenate_setup_<stream>
MIP Concatenate Setup
- This step constructs a list of concatenation jobs that must be performed
-
mip_concatenate_batch_<stream>
MIP Concatenate Batch
- Perform the concatenation commands (
ncrcat
) required to join small files together. - Runs in
long
queue with a wall time of 2 days and can retry up to 3 times before workflow stalls (failures are usually due to running out of time while performing a concatenation). - Only one
mip_concatenate_batch_<stream>
task can run at one time. - Issues can be identified using:
If any critical issues arise or tasks fail update your CDDS operational simulation ticket and contact the CDDS Team for advice.
grep CRITICAL $CDDS_PROC_DIR/convert/log/mip_concatenate_*.log
- Output data is written to
$CDDS_DATA_DIR/output/<stream>/<MIP table>/<variable name>/<files>
- Perform the concatenation commands (
-
run_qc_<stream>
Quality Check (QC)
- Run the QC process on output data for this stream
- Produces a report at:
and a list of variables which pass the quality checks at:
$CDDS_PROC_DIR/qualitycheck/report_<stream>_<datestamp>.json
and a log file at:$CDDS_PROC_DIR/qualitycheck/approved_variables_<stream>_<datestamp>.txt
$CDDS_PROC_DIR/qualitycheck/log/qc_run_and_report_<stream>_<datestamp>.log
- The approved variables file will have one line per successfully produced Dataset of the form:
<MIP table>/<variable name>;<Directory containing files>
- This task will fail if any QC issues are found and will not resubmit. If this occurs please update your CDDS operational simulation ticket and contact the CDDS Team for advice.
-
run_transfer_<stream>
Transfer
- Archive data for variables that are marked active in the requested variables file produced by CDDS Prepare and have successfully passed the QC checks, i.e. are listed in the approved variables file.
- Will not automatically retry, even if failure was due to MASS/MOOSE issues.
- The location in MASS to which these data are archived is determined by the
output_mass_suffix
argument specified in the request configuration file. - Task will fail if
- There are MASS issues: For example if the following command returns anything there has been a MASS outage and you can re-trigger the task:
grep SSC_STORAGE_SYSTEM_UNAVAILABLE $CDDS_PROC_DIR/archive/log/cdds_store_<stream>_<date stamp>.log
- An attempt is made to archive data that already exists in MASS. If this occurs please update your CDDS operational simulation ticket and contact the CDDS Team for advice.
VERY IMPORTANT
Do not delete data from MASS without consultation with Matt Mizielinski.
-
completion_<stream>
Completion
This is a dummy task that is the last thing to run in the workflow -- this is to allow inter workflow dependencies by allowing the
CDDS workflow
to monitor whether each per stream workflow has completed.
If all goes well the workflow will complete, and you will receive an email confirming that the workflow has shutdown containing content of the form:
Message: AUTOMATIC
See: http://fcm1/cylc-review/taskjobs/<user id>/<workflow name>
Prepare CDDS operational simulation ticket for review & submission
Once all workflows for a particular package have completed update your CDDS operational simulation ticket confirming that the Extract, Convert, QC and Transfer tasks have been completed.
Note
You can check if workflows has completed by using the command cylc gscan
or using the cylc review tool.
-
Copy the request JSON file and any logs to
$CDDS_PROC_DIR
cp request.json *.log $CDDS_PROC_DIR/
-
Add a comment to the CDDS operational simulation ticket specifying the archived data is ready for submission, and include the full path to your request configuration location.
-
Select
assign for review to
on the CDDS operational simulation ticket (so that the status isreviewing
) and assign the CDDS operational simulation ticket to Matthew Mizielinski by selecting this name from the list. -
The ticket will then be reviewed according to the CDDS simulation review procedure by members of the CDDS team.
Info
The review script used by the CDDS team involves running the following command
cdds_sim_review <path to the request configuration>
Run CDDS Teardown
- Once the approved ticket has been returned to you following submission, delete the contents of the data directory:
cd <path to the data directory> rm -rf input output
-
Delete all workflows used:
cdds_clean <path to the request configuration>
-
Update and close the CDDS operational simulation ticket