Data Operators¶
So far there is one aspect of manipulating UM files which hasn’t been addressed - modifying the actual data content of the fields. This section will cover how to do this in detail, as it is a bit more involved than simple filtering and header modifications.
The principles behind operators¶
A key goal of this API is to be fairly lightweight and efficient, this
is at its most difficult when trying to process large UM files containing
tens of thousands of fields. In the basic part of the user guide you
saw how a mule.Field object doesn’t store any data, only a
method get_data() which returns a data array when accessed; this
is central to the way data operators work.
In the case of fields in a file object loaded from disk, the get_data()
method is directly linked to some subclass of the mule.DataProvider
class attached to the field. When reading from a file that class contains
instructions to:
Open the file containing the field (if it wasn’t already open).
Read in the raw data of the field.
Unpack and/or uncompress the field data if it was packed.
All of the above then allow the 2-d array data to be returned to you. In
the earlier section we called get_data() manually to do this, but
consider what is happening when you don’t do this and you try to write out
some fields to a new file. For each field being written the API first
calls get_data() to retrieve the data in the field, then it writes
the data out to the new file.
Note
Actually, it is a little more complicated than this - if the field’s data hasn’t been modified as we are about to describe and the packing settings (lbpack and bacc) of the field haven’t been changed, the data provider actually bypasses step 3 above (because there’s no point in unpacking all the data only to immediately re-pack it again!)
So with all that in mind - in order to efficiently make changes to the data
in the field you hook into this get_data() mechanism; intercepting
the data given by the field’s normal data provider and adding your own
changes. A mule.DataOperator provides a simple and re-usable
framework to do exactly this.
Defining an operator¶
Before we dive in and try to write a mule.DataOperator let’s first
quickly examine what parts make up an operator. Here’s a definition of an
operator:
import mule
class ExampleOperator(mule.DataOperator):
def __init__(self):
pass
def new_field(self, source_field):
return source_field.copy()
def transform(self, source_field, new_field):
return source_field.get_data()
This is pretty much the absolute barebones minimal example of an operator, if you carried the example through it would work, but it won’t actually have any effect on anything right now. But still, let’s take a moment to analyse what we can see above.
Firstly, the operator inherits from mule.DataOperator - this is an
important detail, as without the logic contained in this parent class the
functionality will not work. Your operator must override the 3 methods
you see here (not doing so will cause it to raise an exception when used).
Each of these methods has a special purpose.
The new_field() method¶
Let’s start with the new_field() method - when you come to use this
operator you will apply it to an existing mule.Field object. At
that point a reference to the original field object will pass through
new_field(). The method must return a new field object (as the
name implies), and in the example above it is doing so by taking an exact
copy of the original field. However in practice this is where you might
want to make changes to the lookup header that are required by the
operation, for instance:
import mule
class ExampleOperator(mule.DataOperator):
def __init__(self):
pass
def new_field(self, source_field):
field = source_field.copy()
field.lbproc += 512
return field
def transform(self, source_field, new_field):
return source_field.get_data()
Now the new_field() method is again copying the source field, but it
is incrementing the “lbproc” value of the new field by 512 before returning it
- to save you reaching for UMDP F03 this change is supposed to indicate that
the field’s data is the “square root of a field” - so if this operator were
designed to take the square root of the original data this would be a suitable
change to make here.
Warning
It is highly advisable not to modify the “source_field” argument in this routine. If you do then the original field will be modified after the call to your operator - if you aren’t being very careful this will be confusing and could lead to all sorts of problems.
The transform() method¶
This is the most important method in the operator - it is exactly the method
that will be called by the new field object (returned by the new_field()
method) when the field’s get_data() method is called. It must return
the data array for the field and this is where you will introduce your own
modifications (because in practice this won’t get called until it is time to
write the field out to a new file).
As with the new_field() method this method will be passed a reference to
the original field object, as well as a reference to the new field object.
In the example above the transform() method was simply taking the data
from the original field and returning it (resulting in no change) so let’s
update that:
import mule
import numpy as np
class ExampleOperator(mule.DataOperator):
def __init__(self):
pass
def new_field(self, source_field):
field = source_field.copy()
field.lbproc += 512
return field
def transform(self, source_field, new_field):
data = source_field.get_data()
data = np.sqrt(data)
return data
Continuing the idea from the new_field() method - our transform()
method now does what the new “lbproc” code indicates. It first obtains the
original data from the source field (by calling its get_data() method)
and then calculates the element-wise square root before returning it.
Warning
Just like with the new_field() method - it is strongly recommended
that you do not modify either the “source_field” or “new_field”
arguments in this routine. They are intended to be for reference only.
The __init__() method¶
That only leaves the init method - this method is just like any other class initialising method in Python - there are no special requirements here for what it should do, but it might be used to pass additional information to different instances of the same operator. An example of this will be in the upcoming example.
Your first operator¶
Let’s actually create a real operator now and try applying it to some fields, we’ll start with the same barebones example as above. (You may want to put this into a script at this point, as running this at the command line will become tiresome!):
import mule
class ExampleOperator(mule.DataOperator):
def __init__(self):
pass
def new_field(self, source_field):
return source_field.copy()
def transform(self, source_field, new_field):
return source_field.get_data()
To make it easy to see what the operator is doing we are going to scale a region of the input field by a factor. Here’s some code to do that (note we will also re-name the operator here to something more relevant):
class ScaleBoxOperator(mule.DataOperator):
def __init__(self):
pass
def new_field(self, source_field):
return source_field.copy()
def transform(self, source_field, new_field):
data = source_field.get_data()
size_x = new_field.lbrow
size_y = new_field.lbnpt
x_1 = size_x/3
x_2 = 2*x_1
y_1 = size_y/3
y_2 = 2*y_1
data[x_1:x_2, y_1:y_2] = 0.1*data[x_1:x_2, y_1:y_2]
return data
We’re just grabbing approximately the middle third of the data and lowering the values by 90%. Before we continue let’s apply this to a field (we’ll take a field from one of the example files used in the basic section of the guide, see that section for details):
scale_operator = ScaleBoxOperator()
# "ff" is a FieldsFile object and we take the second field this time
field = ff.fields[1]
new_field = scale_operator(field)
Try calling the get_data() method of either the original field or the
new field and plotting the data (again see the basic section for details).
You should be able to see that the new field has the central region scaled
as we intended.
Notice that the operator still needs to be instantiated (the first line above),
but it can then be used to process any number of fields. The initial call is
the point you could include arguments to the __init__() method, for
example here it might be logical to be able to pass in the scaling factor:
class ScaleBoxOperator(mule.DataOperator):
def __init__(self, factor):
self.factor = factor
def new_field(self, source_field):
return source_field.copy()
def transform(self, source_field, new_field):
data = source_field.get_data()
size_x = new_field.lbrow
size_y = new_field.lbnpt
x_1 = size_x/3
x_2 = 2*x_1
y_1 = size_y/3
y_2 = 2*y_1
data[x_1:x_2, y_1:y_2] = self.factor*data[x_1:x_2, y_1:y_2]
return data
The passed argument is simply saved to the operator and then re-used in the
transform() method as required. By doing it this way we can create
slightly different operator instances from the same class, like this:
scale_half_operator = ScaleBoxOperator(0.5)
scale_quarter_operator = ScaleBoxOperator(0.25)
We aren’t going to do anything in the new_field() method here, because
we already covered it in the example above (and there isn’t really anything
sensible we can set in the header for this slightly odd manipulation) but it
would work in just the same way.
Multi-field or other operators¶
In some cases the formula discussed above might not be quite sufficient for a task - for example if the new field is supposed to be a product or a difference of two or more existing fields, or if the new field isn’t actually based on an existing field at all.
The operator class allows for this; the first argument to both the
new_field() and transform() method is actually completely
generic. You can pass any type you like to these, so long as the methods
still return the correct result (a new mule.Field object and a
data array, respectively). So for example an operator which multiplies
two existing fields together might look like this:
class FieldProductOperator(mule.DataOperator):
def __init__(self):
pass
def new_field(self, field_list):
field = field_list[0].copy()
field.lbproc += 256
return field
def transform(self, field_list, new_field):
data_1 = field_list[0].get_data()
data_2 = field_list[1].get_data()
return data_1*data_2
Note that our input to new_field() is now a list of fields, and we
simply assume the headers should copy from the first field in the list
(we update “lbproc” by 256 - “Product of two fields” according to UMDP F03).
The operator then simply retrieves the data from both fields and multiplies
them together.
Note
This example is designed for brevity but in practice you might want
to include some input checking in the methods - for example the above
could check that the input is actually a list and that it contains
2 fields (and maybe that it contains exactly 2 fields). However
note that you don’t need to repeat the checks in both of the methods
(the argument passed to transform() will always be exactly
what was passed to new_field())
In actual fact the first argument can be literally anything - so you are free to implement your operator however you wish (as long as each method returns the correct output).
Provided Operators for LBCs¶
Compared to the other file types the data sections of the fields in LBC files are slightly more awkward to interpret. In this section we will explain the features which can help with transforming the LBC data - for full details of exactly how the data is arranged consult the main UM documentation.
Supposing we have loaded an LBC file, then accessing the data from the first field will return an array with one dimension being the vertical level and the other containing all points in the field in an LBC specific ordering:
>>> # "lbc" is an LBCFile object
>>> field = lbc.fields[0]
>>> data = field.get_data()
>>> data.shape
(38, 272)
In some cases this might be suitable for your requirements without any extra interpretation. For example if you simply want to scale the entire field by a factor or add it to another field, it doesn’t matter that the points are arranged in this way. However if your processing needs to refer to specific parts of the domain or if you wish to visualise the data in some way, you can make use of the following built-in operator:
>>> from mule.lbc import LBCToMaskedArrayOperator
>>> lbc_to_masked = LBCToMaskedArrayOperator()
>>> masked_field = lbc_to_masked(field)
>>> data = masked_field.get_data()
>>> type(data)
<class 'numpy.ma.core.MaskedArray'>
>>> data.shape
(38, 18, 24)
It’s a simple operator, requiring no arguments and mapping directly from a
standard LBC field. The resulting object’s get_data() method returns
a masked-array where the central portion of the LBC domain provides the mask.
It still has the level dimension but the other one has been expanded to appear
as a 2d array.
Of course if this is being done as part of a broader (set of) data operations with the intention of writing out the field with modifications, it will need to be translated back the other way before writing. An equivalent operator exists to perform this reverse-translation:
>>> from mule.lbc import MaskedArrayToLBCOperator
>>> masked_to_lbc = MaskedArrayToLBCOperator()
>>> field = masked_to_lbc(masked_field)
>>> data = field.get_data()
>>> type(data)
<type 'numpy.ndarray'>
>>> data.shape
(38, 272)
As discussed above the modular nature of the operators means that for LBC files
a common pattern will be to apply the LBCToMaskedArrayOperator to a
field from an input file, followed by an operator of your own and then
eventually use the MaskedArrayToLBCOperator to prepare it for output.
Conclusion¶
Having read through this section you should have an idea of how you can use data operators to manipulate the data in UM files. As a slightly abstract concept the best way to improve your understanding from here is to try writing a few simple operators of your own and see what you can come up with!