The Default Transform & Working with Records - Matterbeam allows you to write your own Transformation steps in Python: Functional units of code that will be run on individual records.
Part 1 - The Default Transform & Working with Records
Part 2 - Reading from Records
Part 3 - Writing to Records
Part 4 - Removing Fields from a Record
Part 5 - Working with Strings in Python
Matterbeam allows you to write your own Transformation steps in Python: Functional units of code that will be run on individual records. (If you’re familiar with Kafka, RedPanda, or Cloudera, you could think of them as being similar in concept to SMTs: Single Message Transforms.)
The “default” Transform passes all records through unchanged, and essentially doesn’t change anything (and is not really a “transformation”, per se), but it’s still useful as an example, and in explaining how Transforms work in general:
def transform(record):
return record
Let’s go through it line by line:
def transform(record):
This defines our Transform function: By convention, it should be called “transform”, and accept a single argument, a record to be transformed. (if we wanted to import other code or resources, we would do so above this line.)
return record
This returns the record
from our Transform function, allowing it to “pass through”, along with any changes you’ve made to the record. (The record might move on to another Transformation Step, or into an output Dataset.)
This is a very important part of the Transform: If you add conditions or other logic that prevents a record
from being returned, that effectively “filters out” the record, so that it won’t be included in the output dataset (or processed by any other Transform steps that happen after this one).
In order to make a Transform that meaningfully changes data, we’ll need to access the contents of the record
- Adding, removing, or altering fields, or using information from the record to decide if it should be filtered out, or included, in the output of our Transform.
Records are implemented as Python dictionary objects, which are mappings of unique “keys” (field names) to “values” (the values of those fields) It’s worth noting that, in a Matterbeam Record, the “keys” will always be string values (essentially, plain text), though the “values” may be strings, numbers, or other objects.
Continue reading part 2 of this series here, where we dive into reading from records.