Knowledge base

Writing Transforms in Python - Part 1

The Default Transform & Working with Records - Matterbeam allows you to write your own Transformation steps in Python: Functional units of code that will be run on individual records.

2 min read
Writing Transforms in Python - Part 1

Full Series

Part 1 - The Default Transform & Working with Records
Part 2 - Reading from Records
Part 3 - Writing to Records
Part 4 - Removing Fields from a Record
Part 5 - Working with Strings in Python


Matterbeam allows you to write your own Transformation steps in Python: Functional units of code that will be run on individual records. (If you’re familiar with Kafka, RedPanda, or Cloudera, you could think of them as being similar in concept to SMTs: Single Message Transforms.)

The “Default” Transform

The “default” Transform passes all records through unchanged, and essentially doesn’t change anything (and is not really a “transformation”, per se), but it’s still useful as an example, and in explaining how Transforms work in general:

def transform(records):
    for dataset_id, record_id, record in records:
        yield dataset_id, record_id, record

Let’s go through it line by line:

def transform(records): 

This defines our Transform function: By convention, it should be called “transform”, and accept a single argument, a collection of records to be transformed. (if we wanted to import other code or resources, we would do so above this line.)

for dataset_id, record_id, record in records:

This iterates through the collection of records , “unpacking” the contents of each value into three named variables:

yield dataset_id, record_id, record

Finally this returns the dataset_id, record_id, and record from our Transform function, allowing it to “pass through”, along with any changes you’ve made to the record, to another Transformation Step, or into an output Dataset.

This is a very important part of the Transform: If you add conditions or other logic that prevents a record from being yielded, that effectively “filters out” the record, so that it won’t be included in the output dataset (or processed by any other Transform steps that happen after this one).

Working with Records

In order to make a Transform that meaningfully changes data, we’ll need to access the contents of the record - Adding, removing, or altering fields, or using information from the record to decide if it should be filtered out, or included, in the output of our Transform.

Records are implemented as Python dictionary objects, which are mappings of unique “keys” (field names) to “values” (the values of those fields) It’s worth noting that, in a Matterbeam Record, the “keys” will always be string values (essentially, plain text), though the “values” may be strings, numbers, or dictionaries themselves...


Continue reading part 2 of this series here, where we dive into reading from records.

Share This Post

Check out these related posts

Writing Transforms in Python - Part 5

Writing Transforms in Python - Part 4

Writing Transforms in Python - Part 3