Writing Transforms in Python

Full Series

Part 1 - The Default Transform & Working with Records
Part 2 - Reading from Records
Part 3 - Writing to Records
Part 4 - Removing Fields from a Record
Part 5 - Working with Strings in Python

Matterbeam allows you to write your own Transformation steps in Python: Functional units of code that will be run on individual records. (If you’re familiar with Kafka, RedPanda, or Cloudera, you could think of them as being similar in concept to SMTs: Single Message Transforms.)

The “Default” Transform

The “default” Transform passes all records through unchanged, and essentially doesn’t change anything (and is not really a “transformation”, per se), but it’s still useful as an example, and in explaining how Transforms work in general:

def transform(record):
        return record

Let’s go through it line by line:

def transform(record):

This defines our Transform function: By convention, it should be called “transform”, and accept a single argument, a record to be transformed. (if we wanted to import other code or resources, we would do so above this line.)

return record

This returns the record from our Transform function, allowing it to “pass through”, along with any changes you’ve made to the record. (The record might move on to another Transformation Step, or into an output Dataset.)

This is a very important part of the Transform: If you add conditions or other logic that prevents a record from being returned, that effectively “filters out” the record, so that it won’t be included in the output dataset (or processed by any other Transform steps that happen after this one).

Working with Records

In order to make a Transform that meaningfully changes data, we’ll need to access the contents of the record - Adding, removing, or altering fields, or using information from the record to decide if it should be filtered out, or included, in the output of our Transform.

Records are implemented as Python dictionary objects, which are mappings of unique “keys” (field names) to “values” (the values of those fields) It’s worth noting that, in a Matterbeam Record, the “keys” will always be string values (essentially, plain text), though the “values” may be strings, numbers, or other objects.

Continue reading part 2 of this series here, where we dive into reading from records.

Writing Transforms in Python - Part 1

Full Series

The “Default” Transform

Working with Records

Share This Post

Check out these related posts

Writing Transforms in Python - Part 5

Writing Transforms in Python - Part 4

Writing Transforms in Python - Part 3