Apache Beam and the Art of Scalable Pipelines

When I joined the Google project at Wizeline I had used Spark and a bit of Flink, but Beam was new to me. A year later I have strong opinions. Here’s the executive summary.

The unified model is the real value

Beam’s selling point — write once, run on any runner — sounds like marketing. In practice it’s genuinely useful. We developed and tested pipelines locally on the DirectRunner, then promoted them to Dataflow without changing a line of pipeline code. That feedback loop is fast.

Windows and triggers are hard

Handling late data correctly is where most Beam projects go wrong. The API is expressive but the mental model takes time to internalise. The key insight: windows are labels, triggers decide when to emit. Once that clicked, everything else followed.

(
    events
    | "Window" >> beam.WindowInto(
        beam.window.SlidingWindows(size=60, period=10),
        trigger=AfterWatermark(
            late=AfterCount(1)
        ),
        accumulation_mode=AccumulationMode.DISCARDING,
    )
    | "Sum" >> beam.CombinePerKey(sum)
)

Testing is your best friend

The DirectRunner makes unit testing pipelines trivial. I wrote tests for every transform. When a production job failed at scale, the bug was almost always in business logic that a test would have caught — not in the framework.

The takeaway

Beam rewards patience. The abstractions are genuinely elegant once you stop fighting them. If you’re building data pipelines that need to scale to petabytes and run on managed infrastructure, it’s worth the investment.