Google Cloud Dataflow 101

Data Intensive Dreamer
5 min readDec 10, 2022

--

Photo by Joshua Sortino on Unsplash

If you are a streaming lover and also love Google Cloud, you are familiar with Dataflow. If this is the case, this article will be the right place to recap something you already know. If you’ve lived on the moon in the past five years or are just new to streaming on Google Cloud, read this article :).

Little Disclaimer, this article will show you a brief introduction to all the concepts that you need to know when approaching Dataflow, as always, the best book is the official doc:

Google Cloud Dataflow is a unified stream and batch data processing service that’s serverless and fully scalable. You can have massive stale data; that is not a problem! You can have infinite streaming data from multiple sources; that is not a problem!

Google Cloud Dataflow is based on the Apache Beam programming paradigm that I’ve squeezed into the following image:

As you can see, I’ve divided all the concepts into boxes with difficulty labeling. So let’s start from the beginning.

PCollections

Whatever you are working on in Dataflow is a PCollection. It can be a stream of data or a dataset; it will be identified as a PCollection.

You can do different things with your PCollection; you can:

  • Take data from sources and sink data to the destination; this is something you can do with Pipeline IO;
  • Apply a ParDo on it, so let’s say you can map a value that is 1 or 0 to a Boolean type with True and False; any transformation takes a PCollection as input and returns a PCollection (transformed) as an output; You can also apply some filters that will reduce the cardinality of you output PCollection;
  • Apply an Aggregation, so let’s say a GroupByKey operation or a CoGroupByKey (on two PCollections), and so on and so forth;

Windowing

Ok, that’s easy stuff, some basic streaming processing. Now we will dive into something a little bit more complex, windowing.

Photo by Mohammed lak on Unsplash

As stated on the official Apache Beam documentation: Windowing subdivides a PCollection according to the timestamps of its individual elements. So you are grouping elements on PCollection together. You can have different types of windows in Dataflow; the famous ones are:

  • Fixed Time Windows is the simplest form of windowing. For example, imagine you have a timestamped PCollection; each window might capture all elements with timestamps that fall into a 60 seconds interval;
  • Sliding Time Windows also represent time intervals in the data stream; however, sliding time windows can overlap. For example, each window might capture 60 seconds of data, but a new window will start every 30 seconds;
  • Per-Session Windows represents, as the name says, a session. If data arrives after the minimum specified gap duration time, this initiates the start of a new window.

I hope this image helps you summarize the concept of windowing. The next step is in the direction of triggers.

Triggers

You are grouping events in windows, ok; now what? It would be best if you had something that triggers them, sooooo, triggers :). Triggers are helpful to let Dataflow determine when to emit the aggregated result of a window.

Beam provides the following pre-built triggers:

  • Event time triggers. As the default trigger, the event time-based operates on the event time, so the timestamp is on each data element.
  • Processing time triggers. These triggers operate on the processing time, in other words, the time when the data element is processed in the pipeline.
  • Data-driven triggers. Unfortunately, this trigger only supports firing after a certain number of data elements have arrived in a window.
  • Composite triggers. These triggers combine multiple triggers in various ways.

Watermarks

Ok, now things are becoming interesting; as you may imagine, as in real life, also in streaming, bad things happen. So there may be some lag between the time an event occurs and the time its actual data elements are received by your pipeline.

In addition, there are no guarantees that data events will appear in your pipeline in the same order they were generated.

That’s why the watermark is an important concept. It is the system’s notion of when all data in a specific window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any additional elements that come with a timestamp in that window is considered late data.

You can use triggers to decide when each window aggregates and reports its results, including how it emits late elements.

State and Timers

As stated in the Apache Beam doc: Beam’s windowing and triggering facilities provide a powerful abstraction for grouping and aggregating unbounded input data based on timestamps. However, there are aggregation use cases for which developers may require more control than those provided by windows and triggers. Therefore, Beam provides an API for manually managing per-key state, allowing for fine-grained control over aggregations.

The basic concept is that you can manage a stateful application as it bests with these APIs. You can specify:

  • To use a ValueState (a scalar state value) or, better, a BagState ( a state to accumulate multiple elements);
  • To use a specific timer that will emit the aggregated result, so an event-time based timer or a processing-time event timer;

But the most powerful option you have here is to implement a data-driven trigger. So, for example, you can start a timer (a very short-one) when a particular event comes in the processing pipeline.

Imagine you have a use-case where you have to group all event’s coming for a specific key, and you have to emit the results only when a finished event arrives; this is the perfect use-case for state and timers.

Final Notes

I truly hope you like this article :), if you do, please like it!

--

--

Data Intensive Dreamer

A dreamer in love with data engineering and streaming data pipeline development.