Skip to content

Generating CMORised data with CDDS for CMIP6 / CMIP6 Plus simulations using the CDDS Workflow

See also guidance for adhoc generation of CMORised data.

Tip

Use <script> -h or <script> --help to print information about the script, including available parameters.

Example

A simulation for the pre-industrial control from UKESM will be used as an example in these instructions.

Prerequisites

Before running the CDDS Operational Procedure, please ensure that:

  • you own a CDDS operational simulation ticket (see the list of CDDS operational simulation tickets) that will monitor the processing of a CMIP6 / CMIP6 Plus simulation using CDDS.

  • you belong to the cdds group

    Tip

    type groups on the command line to print the groups a user is in

  • you have write permissions to moose:/adhoc/projects/cdds/ on MASS

    Tip

    You can check if you have correct permissions by running following command and check if your moose username is included in the access control list output:

    moo getacl moose:/adhoc/projects/cdds
    

  • you use a bash shell. CDDS uses Conda which can experience problems running in a shell other than bash.

    Tip

    You can check which shell you use by following command:

    echo $SHELL
    
    If the result is not /bin/bash, you can switch to a bash shell by running:
    /bin/bash
    

If any of the above are not true please contact the CDDS Team for guidance.

Packages

CDDS is designed to handle a "package" of simulation data at one time; a set of variables from a particular simulation run. Multiple "packages" can be run for a given simulation to add new or corrected variables to the archive. Each package should be run using a separate processing (proc) and data directory. The simplest way to separate two run throughs of CDDS is to use a different package name. This is set either when running the write_request script below or by modifying the request configuration itself.

Partial processing of a simulation

In certain circumstances it may be desirable to process and submit a subset of an entire simulation, i.e. the first 250 years of the esm-piControl simulation. Please contact the CDDS Team to discuss this prior to starting processing to

  1. Get appropriate guidance on the steps needed to correctly construct the requested variables file in CDDS Prepare
  2. Arrange for an appropriate Errata to be issued following submission of data sets.

What to do when things go wrong

On occasion issues will arise with tasks performed by users of CDDS and these will trigger CRITICAL error messages in the logs and usually require user intervention. Many simple issues (MASS/MOOSE or file system problems) can be resolved by re-triggering tasks. When you take any action please ensure that you update your CDDS operational simulation ticket and if support is needed contact the CDDS Team.

Set up the CDDS operational simulation ticket

  • Select start work on the CDDS operational simulation ticket (so that the status is in_progress) to indicate that work is starting.

Activate the CDDS install

  1. Setup the environment to use the central installation of CDDS and its dependencies:

    source ~cdds/bin/setup_env_for_cdds <cdds_version>
    
    where <cdds_version> is the version of CDDS you wish to use, e.g. 3.0.0. Unless instructed otherwise you should use the most recent version of CDDS available (to ensure that all bugfixes are picked up), and this version should be used in all stages of the package being processed. If in doubt contact the CDDS team for advice.

  2. Ticket: Record the version of CDDS being used on the CDDS operational simulation ticket.

  1. Setup the environment to use the central installation of CDDS and its dependencies:

    source ~cdds/bin/setup_env_for_cdds <cdds_version>
    
    where <cdds_version> is the version of CDDS you wish to use, e.g. 3.3.0. Unless instructed otherwise you should use the most recent version of CDDS available (to ensure that all bugfixes are picked up), and this version should be used in all stages of the package being processed. If in doubt contact the CDDS team for advice.

  2. Ticket: Record the version of CDDS being used on the CDDS operational simulation ticket.

Note

  • The available version numbers for this script can be found here
  • If you wish to deactivate the CDDS environment then you can use the command conda deactivate.

Create the request configuration file

The request configuration file is constructed from information in the rose-suite.info files within each workflow.

Important

If the rose-suite.info file contains incorrect information, this will be propagated through CDDS. As such it is critically important that the information in these files is correct

To construct the request configuration file take the following steps

  1. Set up a working directory

    mkdir cdds-example-1
    cd cdds-example-1
    export WORKING_DIR=`pwd`
    
    Add the location of your working directory to the CDDS operational simulation ticket.

  2. Collect required information on the rose workflow for the simulation;

    • workflow id, e.g. u-aw310
    • branch, e.g. cdds
    • revision

    Info

    You can find the revision of the workflow branch for a CMIP6 workflow by using the following command:

    rosie lookup --prefix=u --query project eq u-cmip6 and id eq u-aw310 and branch eq cdds
    

  3. Create the request configuration file;

    write_request <workflow id> <branch> <revision> <package name> [<list of streams>] -c <path to proc dir> -t <path to data dir>
    

    Example

    Create a request configuration file for the rose suite u-aw310, branch cdds and package round-20:

    write_request u-aw310 cdds 115492 round-20 ap4 ap5 ap6 onm inm \
    -c /project/cdds/proc -t /project/cdds_data
    

Note

Be careful when re-running CDDS using the same request configuration file: pre-existing data will cause problems for the extraction tasks and pre-existing logs in the proc directory may cause issues when diagnosing problems. If in doubt use a different package name in the request configuration file.

Tip

If necessary the start and end dates for processing can be overridden using the --start_date and --end_date arguments. Please consult with the CDDS Team if you believe this is necessary.

Info

The log file and request configuration file are written to the current working directory

Prepare a list of variables to process

Warning

This method does not refer to the Data Request or CDDS inventory database (to check which Datasets have been previously produced), so care should be taken with the choice of variables.

  1. Create a text file with the list of variables or copy and modify an existing list. Each line in the file should have the form

    <mip table>/<variable name>:<stream>
    

    Example

    For example process the variable tas for the MIP table Amon when processing the ap5 stream:

    Amon/tas:ap5
    

  2. Set the value variable_list_file in the request configuration to the path of the created variable file.

Note

If you are using a workflow with the CMIP6 STASH set up then you can add the default stream to a list of variables using the command

stream_mappings --varfile <filename without streams> --outfile <new file with streams>
If you are not using a workflow with the CMIP6 STASH configuration then contact us for advice as this process will need to be performed by hand.

Configure request configuration

Important

The request.cfg file contains all information that is needed to process the data through CDDS. The creation of the file does not set all values. So, it must be adjusted manually.

You need to adjust your request.cfg:

  1. Open the request.cfg via a text editor, e.g. vi or gedit

  2. Following values need to be set manually:

Value Description
variable_list_file Path to your variables file
output_mass_root Path to the moose loction where the data should be archived starts with moose:
output_mass_suffix Sub-directory in MASS to used when moving data.

Note

Please check the other values as well and do adjustments as needed. For any help, please contact the CDDS Team.

Info

The MIP era (CMIP6 or CIMP6 Plus) you are using is defined in the value mip_era of the metdata section.

Checkout and configure the CDDS workflow

  1. Run the following command after replacing values within <>:

    checkout_processing_workflow <name for processing workflow> \
    <path to request configuration> \
    --workflow_destination .
    

    Example

    Checkout the CDDS processing workflow with the name my-cdds-test and the request file location /home/foo/cdds-example-1/request.cfg:

    checkout_processing_workflow my-cdds-test \
    /home/foo/cdds-example-1/request.cfg \
    --workflow_destination .
    

    Info

    A directory containing a rose workflow will be placed in a subdirectory under the location specified in --workflow_destination.
    If this is not specified it will be checked out under ~/roses/

  2. This step is optional: Set some useful environmental variables to access the CDDS directories:

    export CDDS_PROC_DIR=/<root_proc_dir>/<mip_era>/<mip>/<model_id>_<experiment_id>_<variant_label>/<package>/
    export CDDS_DATA_DIR=/<root_data_dir>/<mip_era>/<mip>/<model_id>/<experiment_id>/<variant_label>/<package>/
    ls $CDDS_PROC_DIR
    ls $CDDS_DATA_DIR
    
    where you must replace all values within <>. The root_proc_dir and root_data_dir are the values that has been specified in the request configuration.

    Example

    Assume:

    • Path to the root proc directory is /home/foo/cdds-example-1/proc.
    • Path to the root data directory is /home/foo/cdds-example-1/data.
    • MIP era is CMIP6 and MIP CMIP.
    • Model ID is UKESM1-0-LL for experiment piControl with variant label r1i1p1f2 and package round-1

    Then the command to set the environmental variables is:

    export CDDS_PROC_DIR=/home/foo/cdds-example-1/data/CMIP6/CMIP/UKESM1-0-LL_piControl_r1i1p1f2/round-1/
    export CDDS_DATA_DIR=/home/foo/cdds-example-1/data/CMIP6/CMIP/UKESM1-0-LL/piControl/r1i1p1f2/round-1/
    

  3. Run the workflow:

    cd <name for processing workflow>
    cylc vip .
    

    Example

    If the name of the processing workflow is my-cdds-test, then run:

    cd my-cdds-test
    cylc vip .
    

Info

Cylc 8 is used for running the processing workflow. You can do this by running following command before running the workflow:

export CYLC_VERSION=8

Monitor conversion workflows

For each stream a CDDS Convert workflow will be triggered by the processing workflow. Each of the workflows launched by CDDS Convert requires monitoring. This can be done using the command line tool cylc gui to obtain a window with an updating summary of workflows progress or equivalently the Cylc Review online tools.

Conversion workflows will usually be named cdds_<workflow_base_name>_<stream> and each stream will run completely independently. If a workflow has issues, due to task failure, it will stall, and you will receive an e-mail.

If you hit issues or are unsure how to proceed update the CDDS operational simulation ticket for your package with anything you believe is relevant (include the location of your working directory) and contact the CDDS Team for advice.

The conversion workflows run the following steps

  • run_extract_<stream>

    Extract
    • Run CDDS Extract for this stream.
    • Runs in long queue with a wall time of 2 days.
    • If there are any issues with extracting data they will be reported in the job.out log file in the workflow and the $CDDS_PROC_DIR/extract/log/cdds_extract_<stream>_<date stamp>.log log file and the task will fail.
    • The extraction task will automatically resubmit 4 times if it fails and manual intervention is required to proceed.
    • Most issues are related to either MASS (i.e. moo commands failing), file system anomalies (failure to create files /directories) or running out of time.
    • Identify issues either by searching for "CRITICAL" in the job.out logs in Cylc Review or by using
      grep CRITICAL $CDDS_PROC_DIR/extract/log/cdds_extract_<stream>_<date stamp>.log
      
    • If the issue appears to be due to MASS issues you can re-run the failed CDDS Extract job by re-triggering the run_extract_<stream> task via the cylc gui or via the cylc command line tools:
      cylc trigger cdds_<workflow_base_name>_<stream> run_extract_<stream>:failed
      
    • If in doubt update your CDDS operational simulation ticket and contact CDDS Team for advice.
  • validate_extract_<stream>

    Extract Validation

    Validation of the output is now performed as a separate task from extracting it. This task will report missing or unexpected files and unreadable netCDF files.

  • setup_output_dir_<stream>

    Setup Output Directory

    This task will create output directories for conversion output.

  • mip_convert_<stream>_<grid group>

    MIP Convert
    • Run MIP Convert to produce output files for a small time window for this simulation.
    • Will retry up to 3 times before workflow stalls.
    • CRITICAL issues are appended to $CDDS_PROC_DIR/convert/log/critical_issues.log. These will likely need user action to correct for. So, update your CDDS operational simulation ticket and contact CDDS Team for advice.
    • The CRITICAL log file will not exist if there are no critical issues.
    • A variant named mip_convert_first_<stream>_<grid group> may be launched to align the cycling dates with the concatenation processing.
  • finaliser_<stream>

    MIP Convert Finaliser

    This ensures that Concatenation Tasks are launched once all MIP Convert tasks have been successfully performed for a particular time range. This step should never fail.

    Note

    If this task fails, the reason is that the adjustment of the memory and time limits failed. So, please resubmit the task.

  • organise_files_<stream>

    Organise Files
    • Re-arranges the output files on disk from a directory structure created by the MIP Convert tasks of the form
      $CDDS_DATA_DIR/output/<stream>_mip_convert/<YYYY-MM-DD>/<grid>/<files>
      
      to
      $CDDS_DATA_DIR/output/<stream>_concat/<MIP table>/<variable name>/<files>
      
    • Ready for concatenation. A variation named organise_files_final_<stream> does the same thing but at the end of the conversion process.
  • mip_concatenate_setup_<stream>

    MIP Concatenate Setup
    • This step constructs a list of concatenation jobs that must be performed
  • mip_concatenate_batch_<stream>

    MIP Concatenate Batch
    • Perform the concatenation commands (ncrcat) required to join small files together.
    • Runs in long queue with a wall time of 2 days and can retry up to 3 times before workflow stalls (failures are usually due to running out of time while performing a concatenation).
    • Only one mip_concatenate_batch_<stream> task can run at one time.
    • Issues can be identified using:
      grep CRITICAL $CDDS_PROC_DIR/convert/log/mip_concatenate_*.log
      
      If any critical issues arise or tasks fail update your CDDS operational simulation ticket and contact the CDDS Team for advice.
    • Output data is written to
      $CDDS_DATA_DIR/output/<stream>/<MIP table>/<variable name>/<files>
      
  • run_qc_<stream>

    Quality Check (QC)
    • Run the QC process on output data for this stream
    • Produces a report at:
      $CDDS_PROC_DIR/qualitycheck/report_<stream>_<datestamp>.json
      
      and a list of variables which pass the quality checks at:
      $CDDS_PROC_DIR/qualitycheck/approved_variables_<stream>_<datestamp>.txt
      
      and a log file at:
      $CDDS_PROC_DIR/qualitycheck/log/qc_run_and_report_<stream>_<datestamp>.log
      
    • The approved variables file will have one line per successfully produced Dataset of the form:
      <MIP table>/<variable name>;<Directory containing files>
      
    • This task will fail if any QC issues are found and will not resubmit. If this occurs please update your CDDS operational simulation ticket and contact the CDDS Team for advice.
  • run_transfer_<stream>

    Transfer
    • Archive data for variables that are marked active in the requested variables file produced by CDDS Prepare and have successfully passed the QC checks, i.e. are listed in the approved variables file.
    • Will not automatically retry, even if failure was due to MASS/MOOSE issues.
    • The location in MASS to which these data are archived is determined by the output_mass_suffix argument specified in the request configuration file.
    • Task will fail if
    • There are MASS issues: For example if the following command returns anything there has been a MASS outage and you can re-trigger the task:
      grep SSC_STORAGE_SYSTEM_UNAVAILABLE $CDDS_PROC_DIR/archive/log/cdds_store_<stream>_<date stamp>.log
      
    • An attempt is made to archive data that already exists in MASS. If this occurs please update your CDDS operational simulation ticket and contact the CDDS Team for advice.

    VERY IMPORTANT

    Do not delete data from MASS without consultation with Matt Mizielinski.

  • completion_<stream>

    Completion

    This is a dummy task that is the last thing to run in the workflow -- this is to allow inter workflow dependencies by allowing the CDDS workflow to monitor whether each per stream workflow has completed.

If all goes well the workflow will complete, and you will receive an email confirming that the workflow has shutdown containing content of the form:

Message: AUTOMATIC
See: http://fcm1/cylc-review/taskjobs/<user id>/<workflow name>

Prepare CDDS operational simulation ticket for review & submission

Once all workflows for a particular package have completed update your CDDS operational simulation ticket confirming that the Extract, Convert, QC and Transfer tasks have been completed.

Note

You can check if workflows has completed by using the command cylc gscan or using the cylc review tool.

  • Copy the request JSON file and any logs to $CDDS_PROC_DIR

    cp request.json *.log $CDDS_PROC_DIR/
    

  • Add a comment to the CDDS operational simulation ticket specifying the archived data is ready for submission, and include the full path to your request configuration location.

  • Select assign for review to on the CDDS operational simulation ticket (so that the status is reviewing) and assign the CDDS operational simulation ticket to Matthew Mizielinski by selecting this name from the list.

  • The ticket will then be reviewed according to the CDDS simulation review procedure by members of the CDDS team.

Info

The review script used by the CDDS team involves running the following command

cdds_sim_review <path to the request configuration>
checking any CRITICAL issues and following up any other anomalies.

Run CDDS Teardown

  1. Once the approved ticket has been returned to you following submission, delete the contents of the data directory:
    cd <path to the data directory>
    rm -rf input output
    
  2. Delete all workflows used:

    cdds_clean <path to the request configuration>
    

  3. Update and close the CDDS operational simulation ticket