The Default Transform & Working with Records - Matterbeam allows you to write your own Transformation steps in Python: Functional units of code that will be run on individual records.
Part 1 - The Default Transform & Working with Records
Part 2 - Reading from Records
Part 3 - Writing to Records
Part 4 - Removing Fields from a Record
Part 5 - Working with Strings in Python
Matterbeam allows you to write your own Transformation steps in Python: Functional units of code that will be run on individual records. (If you’re familiar with Kafka, RedPanda, or Cloudera, you could think of them as being similar in concept to SMTs: Single Message Transforms.)
The “default” Transform passes all records through unchanged, and essentially doesn’t change anything (and is not really a “transformation”, per se), but it’s still useful as an example, and in explaining how Transforms work in general:
def transform(records):
for dataset_id, record_id, record in records:
yield dataset_id, record_id, record
Let’s go through it line by line:
def transform(records):
This defines our Transform function: By convention, it should be called “transform”, and accept a single argument, a collection of records to be transformed. (if we wanted to import other code or resources, we would do so above this line.)
for dataset_id, record_id, record in records:
This iterates through the collection of records
, “unpacking” the contents of each value into three named variables:
dataset_id
: The string ID of the Dataset that this record came from. (This may be useful for Transforms that read from multiple Datasets.)record_id
: The string ID of an individual record within the source Dataset above, that we will be processing. (This may be useful for referencing Records outside Matterbeam.)record
: The field names and values of the actual record that we’ll be transforming - All Transforms will need to read from (and possibly also write to) this object.yield dataset_id, record_id, record
Finally this returns the dataset_id
, record_id
, and record
from our Transform function, allowing it to “pass through”, along with any changes you’ve made to the record, to another Transformation Step, or into an output Dataset.
This is a very important part of the Transform: If you add conditions or other logic that prevents a record
from being yielded, that effectively “filters out” the record, so that it won’t be included in the output dataset (or processed by any other Transform steps that happen after this one).
In order to make a Transform that meaningfully changes data, we’ll need to access the contents of the record
- Adding, removing, or altering fields, or using information from the record to decide if it should be filtered out, or included, in the output of our Transform.
Records are implemented as Python dictionary objects, which are mappings of unique “keys” (field names) to “values” (the values of those fields) It’s worth noting that, in a Matterbeam Record, the “keys” will always be string values (essentially, plain text), though the “values” may be strings, numbers, or dictionaries themselves...
Continue reading part 2 of this series here, where we dive into reading from records.