Data Operators

So far there is one aspect of manipulating UM files which hasn’t been addressed - modifying the actual data content of the fields. This section will cover how to do this in detail, as it is a bit more involved than simple filtering and header modifications.

The principles behind operators

A key goal of this API is to be fairly lightweight and efficient, this is at its most difficult when trying to process large UM files containing tens of thousands of fields. In the basic part of the user guide you saw how a mule.Field object doesn’t store any data, only a method get_data() which returns a data array when accessed; this is central to the way data operators work.

In the case of fields in a file object loaded from disk, the get_data() method is directly linked to some subclass of the mule.DataProvider class attached to the field. When reading from a file that class contains instructions to:

  • Open the file containing the field (if it wasn’t already open).

  • Read in the raw data of the field.

  • Unpack and/or uncompress the field data if it was packed.

All of the above then allow the 2-d array data to be returned to you. In the earlier section we called get_data() manually to do this, but consider what is happening when you don’t do this and you try to write out some fields to a new file. For each field being written the API first calls get_data() to retrieve the data in the field, then it writes the data out to the new file.

Note

Actually, it is a little more complicated than this - if the field’s data hasn’t been modified as we are about to describe and the packing settings (lbpack and bacc) of the field haven’t been changed, the data provider actually bypasses step 3 above (because there’s no point in unpacking all the data only to immediately re-pack it again!)

So with all that in mind - in order to efficiently make changes to the data in the field you hook into this get_data() mechanism; intercepting the data given by the field’s normal data provider and adding your own changes. A mule.DataOperator provides a simple and re-usable framework to do exactly this.

Defining an operator

Before we dive in and try to write a mule.DataOperator let’s first quickly examine what parts make up an operator. Here’s a definition of an operator:

import mule

class ExampleOperator(mule.DataOperator):

    def __init__(self):
        pass

    def new_field(self, source_field):
        return source_field.copy()

    def transform(self, source_field, new_field):
        return source_field.get_data()

This is pretty much the absolute barebones minimal example of an operator, if you carried the example through it would work, but it won’t actually have any effect on anything right now. But still, let’s take a moment to analyse what we can see above.

Firstly, the operator inherits from mule.DataOperator - this is an important detail, as without the logic contained in this parent class the functionality will not work. Your operator must override the 3 methods you see here (not doing so will cause it to raise an exception when used). Each of these methods has a special purpose.

The new_field() method

Let’s start with the new_field() method - when you come to use this operator you will apply it to an existing mule.Field object. At that point a reference to the original field object will pass through new_field(). The method must return a new field object (as the name implies), and in the example above it is doing so by taking an exact copy of the original field. However in practice this is where you might want to make changes to the lookup header that are required by the operation, for instance:

import mule

class ExampleOperator(mule.DataOperator):

    def __init__(self):
        pass

    def new_field(self, source_field):
        field = source_field.copy()

        field.lbproc += 512

        return field

    def transform(self, source_field, new_field):
        return source_field.get_data()

Now the new_field() method is again copying the source field, but it is incrementing the “lbproc” value of the new field by 512 before returning it - to save you reaching for UMDP F03 this change is supposed to indicate that the field’s data is the “square root of a field” - so if this operator were designed to take the square root of the original data this would be a suitable change to make here.

Warning

It is highly advisable not to modify the “source_field” argument in this routine. If you do then the original field will be modified after the call to your operator - if you aren’t being very careful this will be confusing and could lead to all sorts of problems.

The transform() method

This is the most important method in the operator - it is exactly the method that will be called by the new field object (returned by the new_field() method) when the field’s get_data() method is called. It must return the data array for the field and this is where you will introduce your own modifications (because in practice this won’t get called until it is time to write the field out to a new file).

As with the new_field() method this method will be passed a reference to the original field object, as well as a reference to the new field object. In the example above the transform() method was simply taking the data from the original field and returning it (resulting in no change) so let’s update that:

import mule
import numpy as np

class ExampleOperator(mule.DataOperator):

    def __init__(self):
        pass

    def new_field(self, source_field):
        field = source_field.copy()

        field.lbproc += 512

        return field

    def transform(self, source_field, new_field):
        data = source_field.get_data()

        data = np.sqrt(data)

        return data

Continuing the idea from the new_field() method - our transform() method now does what the new “lbproc” code indicates. It first obtains the original data from the source field (by calling its get_data() method) and then calculates the element-wise square root before returning it.

Warning

Just like with the new_field() method - it is strongly recommended that you do not modify either the “source_field” or “new_field” arguments in this routine. They are intended to be for reference only.

The __init__() method

That only leaves the init method - this method is just like any other class initialising method in Python - there are no special requirements here for what it should do, but it might be used to pass additional information to different instances of the same operator. An example of this will be in the upcoming example.

Your first operator

Let’s actually create a real operator now and try applying it to some fields, we’ll start with the same barebones example as above. (You may want to put this into a script at this point, as running this at the command line will become tiresome!):

import mule

class ExampleOperator(mule.DataOperator):

    def __init__(self):
        pass

    def new_field(self, source_field):
        return source_field.copy()

    def transform(self, source_field, new_field):
        return source_field.get_data()

To make it easy to see what the operator is doing we are going to scale a region of the input field by a factor. Here’s some code to do that (note we will also re-name the operator here to something more relevant):

class ScaleBoxOperator(mule.DataOperator):

    def __init__(self):
        pass

    def new_field(self, source_field):
        return source_field.copy()

    def transform(self, source_field, new_field):
        data = source_field.get_data()

        size_x = new_field.lbrow
        size_y = new_field.lbnpt

        x_1 = size_x/3
        x_2 = 2*x_1
        y_1 = size_y/3
        y_2 = 2*y_1

        data[x_1:x_2, y_1:y_2] = 0.1*data[x_1:x_2, y_1:y_2]

        return data

We’re just grabbing approximately the middle third of the data and lowering the values by 90%. Before we continue let’s apply this to a field (we’ll take a field from one of the example files used in the basic section of the guide, see that section for details):

scale_operator = ScaleBoxOperator()

# "ff" is a FieldsFile object and we take the second field this time
field = ff.fields[1]

new_field = scale_operator(field)

Try calling the get_data() method of either the original field or the new field and plotting the data (again see the basic section for details). You should be able to see that the new field has the central region scaled as we intended.

Notice that the operator still needs to be instantiated (the first line above), but it can then be used to process any number of fields. The initial call is the point you could include arguments to the __init__() method, for example here it might be logical to be able to pass in the scaling factor:

class ScaleBoxOperator(mule.DataOperator):

    def __init__(self, factor):
        self.factor = factor

    def new_field(self, source_field):
        return source_field.copy()

    def transform(self, source_field, new_field):
        data = source_field.get_data()

        size_x = new_field.lbrow
        size_y = new_field.lbnpt

        x_1 = size_x/3
        x_2 = 2*x_1
        y_1 = size_y/3
        y_2 = 2*y_1

        data[x_1:x_2, y_1:y_2] = self.factor*data[x_1:x_2, y_1:y_2]

        return data

The passed argument is simply saved to the operator and then re-used in the transform() method as required. By doing it this way we can create slightly different operator instances from the same class, like this:

scale_half_operator = ScaleBoxOperator(0.5)
scale_quarter_operator = ScaleBoxOperator(0.25)

We aren’t going to do anything in the new_field() method here, because we already covered it in the example above (and there isn’t really anything sensible we can set in the header for this slightly odd manipulation) but it would work in just the same way.

Multi-field or other operators

In some cases the formula discussed above might not be quite sufficient for a task - for example if the new field is supposed to be a product or a difference of two or more existing fields, or if the new field isn’t actually based on an existing field at all.

The operator class allows for this; the first argument to both the new_field() and transform() method is actually completely generic. You can pass any type you like to these, so long as the methods still return the correct result (a new mule.Field object and a data array, respectively). So for example an operator which multiplies two existing fields together might look like this:

class FieldProductOperator(mule.DataOperator):

    def __init__(self):
        pass

    def new_field(self, field_list):
        field = field_list[0].copy()

        field.lbproc += 256

        return field

    def transform(self, field_list, new_field):

        data_1 = field_list[0].get_data()
        data_2 = field_list[1].get_data()

        return data_1*data_2

Note that our input to new_field() is now a list of fields, and we simply assume the headers should copy from the first field in the list (we update “lbproc” by 256 - “Product of two fields” according to UMDP F03). The operator then simply retrieves the data from both fields and multiplies them together.

Note

This example is designed for brevity but in practice you might want to include some input checking in the methods - for example the above could check that the input is actually a list and that it contains 2 fields (and maybe that it contains exactly 2 fields). However note that you don’t need to repeat the checks in both of the methods (the argument passed to transform() will always be exactly what was passed to new_field())

In actual fact the first argument can be literally anything - so you are free to implement your operator however you wish (as long as each method returns the correct output).

Provided Operators for LBCs

Compared to the other file types the data sections of the fields in LBC files are slightly more awkward to interpret. In this section we will explain the features which can help with transforming the LBC data - for full details of exactly how the data is arranged consult the main UM documentation.

Supposing we have loaded an LBC file, then accessing the data from the first field will return an array with one dimension being the vertical level and the other containing all points in the field in an LBC specific ordering:

>>> # "lbc" is an LBCFile object
>>> field = lbc.fields[0]
>>> data = field.get_data()
>>> data.shape
(38, 272)

In some cases this might be suitable for your requirements without any extra interpretation. For example if you simply want to scale the entire field by a factor or add it to another field, it doesn’t matter that the points are arranged in this way. However if your processing needs to refer to specific parts of the domain or if you wish to visualise the data in some way, you can make use of the following built-in operator:

>>> from mule.lbc import LBCToMaskedArrayOperator
>>> lbc_to_masked = LBCToMaskedArrayOperator()
>>> masked_field = lbc_to_masked(field)
>>> data = masked_field.get_data()
>>> type(data)
<class 'numpy.ma.core.MaskedArray'>
>>> data.shape
(38, 18, 24)

It’s a simple operator, requiring no arguments and mapping directly from a standard LBC field. The resulting object’s get_data() method returns a masked-array where the central portion of the LBC domain provides the mask. It still has the level dimension but the other one has been expanded to appear as a 2d array.

Of course if this is being done as part of a broader (set of) data operations with the intention of writing out the field with modifications, it will need to be translated back the other way before writing. An equivalent operator exists to perform this reverse-translation:

>>> from mule.lbc import MaskedArrayToLBCOperator
>>> masked_to_lbc = MaskedArrayToLBCOperator()
>>> field = masked_to_lbc(masked_field)
>>> data = field.get_data()
>>> type(data)
<type 'numpy.ndarray'>
>>> data.shape
(38, 272)

As discussed above the modular nature of the operators means that for LBC files a common pattern will be to apply the LBCToMaskedArrayOperator to a field from an input file, followed by an operator of your own and then eventually use the MaskedArrayToLBCOperator to prepare it for output.

Conclusion

Having read through this section you should have an idea of how you can use data operators to manipulate the data in UM files. As a slightly abstract concept the best way to improve your understanding from here is to try writing a few simple operators of your own and see what you can come up with!