Reading from Records - There are two main ways to access the individual values (or “fields”) of a Record: Indexing and the get() Method.
Part 1 - The Default Transform & Working with Records
Part 2 - Reading from Records
Part 3 - Writing to Records
Part 4 - Removing Fields from a Record
Part 5 - Working with Strings in Python
There are two main ways to access the individual values (or “fields”) of a Record:
We can do direct lookups on record
using square brackets, and the name of the field we want to access, which will always be stored as a case-sensitive string. For example, to look up a field called “Username”, and assign it to a variable:
username = record["Username"]
This is very simple, but will raise an Exception (essentially, cause an Error, stopping your Transform) if the referenced field doesn’t exist. This isn’t necessarily a bad thing, but it is a tradeoff: If you expect a field to exist, and the absence of that field represents a problem, this syntax will make that very clear, and prevent your Transform from running without it!
If a field is optional, or is otherwise not expected to be present in every record, the "get” method is a convenient option:
username = record.get("Username")
# If record has no "Username" field, username will be left as None
If “Username” is available, this behaves the same as the square bracket syntax, above. However, if the “Username” field isn’t available on a record, there won’t be an error - It will return None
instead. (This is the Python equivalent of NULL
- Don’t worry too much about the distinction, as None
values will automatically become NULL
values if written out to a JSON file, SQL Database, or other non-Python service.)
None
(and JSON’s NULL
) are special values used to indicate a "lack of data" - When actual data is missing, unavailable, or hasn’t arrived yet...The “get” method also allows us to specify a default, by adding a second argument:
username = record.get("Username", "(unknown?)")
# If record has no "Username" field, username will be set to "(unknown?)"
This changes the behavior when the “Username” field is missing from a record: Instead of returning None
, it now returns the default value “(unknown?)" instead. This is very useful if we want to enforce a default value for a field that isn’t always available, or otherwise improve the consistency of our data.
If you need to do something more complicated when a field isn’t present, you can combine the “get” method with some additional logic:
username = record.get("Username")
if username is None:
# Do something more interesting here!
We can attempt to get the value of “Username” from record - But if it’s not available, our username
variable will be set to None
. We can then check for this using a conditional “if” statement, specifically determining if username “is” None: In this case, we can implement more logic to determine what the “username” variable should be set to.
==
checks for equality: if two things are considered to be equal.is
checks for identity: If two things are “exactly the same thing”, b equivalent to each other.if username is None
will only be True if username is exactly, specifically None
.If we just want to know whether or not a field is present, without trying to read its value or assign it to a variable, we can use the in
operator:
if "Username" in record:
# Do something that relies on record["Username"] here...
(This checks to see if the "Username" key is present in the record
dictionary, without trying to access it, or retrieve the value.)
Sometimes it may be helpful to check for the opposite: If the “Username” field is not present:
if "Username" not in record:
# Do something if record["Username"] is missing here...
This is good for two reasons: It's less code, making our Transform shorter and easier to read. And it more clearly expresses our intentions, which will be helpful to other people who are trying to understand this Transform. (Which may include you, in the future!)
Given what we’ve learned: Could a Transform filter out records without a “Username”?
Yes, see below:
def transform(records):
for dataset_id, record_id, record in records:
if "Username" in record:
yield dataset_id, record_id, record
# (otherwise, we won't yield the record, filtering it out...)
Continue reading part 3 of this series here, where we dive into writing to records.