Workflow Orchestration - Part 2 (Why do I care?)
An increasingly distributed and fragile world
Workflow platforms are important because software engineers are increasingly adopting distributed systems in their architecture. There are two reasons for this change: 1) Users are demanding more frequent releases, feature teams, better peformance, and higher availability; 2) Providers are increasingly moving away from “use our library” (Spring Framework) to “use our APIs” (AWS, Azure, and GCP).
This change is undoubtedly a good thing, however, it also introduces new problems. It is much harder to trace program execution in a distributed system. A business process can span multiple services, created by multiple teams, in a variety of programming languages. There are more ways for things to fail, less consistency in code quality and documentation, and it’s harder to understand what happens when things go wrong.
Some hypothetical situations where a workflow platform is useful are:
Ingesting and processing data
Victoria is a data scientist working on deploying a machine learning model to production. She needs to deploy the trained model to live servers, as well as an automated process to ingest new data and re-train the model. This data pipeline is complicated and fragile. This use case was explored in part 1 of this series.
Running automated functional tests
Janet is a DevOps engineer. Her colleaguesrecently came to her with a thorny problem. “Can we have a copy of the server spun up after code is checked in on a feature branch, then run functional tests against that server if it builds successfully, finally, send a notification to everyone after the tests are run? Can you have it time-out and clean itself up after 20 minutes? Oh yeah, it’d be great if we can test the entire process on our local machines.”
The team points to what another team has built, which turns out to be a jumble of cron jobs, RabbitMQ workers, email alerts, and Jenkins Pipelines, each in its own code repository. Naturally, each thing uses a different branching process. Can Janet do better?
Yes. Janet can define the requirements as a single workflow:
- Spin up a machine (the platform handles concurrency, and retries)
- Runs function tests, clean up the infrastructure and sends notifications (the platform handles timeouts)
- Test the workflow as a unit test before deployment
Processing financial transactions
Joshua is part of a pretty successful startup. Their online business is bringing in $10M ARR through customer purchases. However, there are occasional hiccups with a few microservices which requires manual intervention. The customer service team spends about 10% of their time fixing payment problems. Recently, the team discovered that they have been unknowingly charging a former customer $500 a month for the past six month. The problem was caught, a refund was issued, as well an apology to the customer. Joshua realized they need to tighten up their transaction workflows. They need to execute transactions across multiple services in a truly fault resilient way and replace human interventions with automation.
Yes. Instead building retries, error handling, logging, and tracing for every microservice. The orchestration between microservices is done on a workflow platform and unit tested during CI/CD. Even human interactions are included – where a missed action on an important problem can be logged and escalated just like a service outage.
Using Temporal to outsource operational concerns
These hypotheticals are imperfect but I think they get to the point. In these scenarios, the team inevitbly end up implementing retrie, error notifications logging, message queues, timeouts, heartbeats, the circuit breaker pattern, health metrics, maybe even a simplistic UI to look at workflow histories and debug errors. That’s a lot to think about. That’s a lot operational “fluff” for every single team to implement on their own.
Your team just want to solve problems – Release a new feature, automate manual work, or scale the system to handle more traffic. Yet, they are spending a lot of their energy on operational “fluff”. Wouldn’t it be great if these concerns can be outsourced so the team can focus on shipping software?
When I was asked to work with AWS Step Functions last year, I started to research alternatives and understand the orchestration field as a whole. Step Functions was painful to maintain and debug. It promised simplicity with a JSON-based programming language, but it feels like the same unfulfilled promise made by XML. Finally, the discontinuity between State Machine and Activity code makes understanding the program very challenging.
I knew there had to be something better. Looking across the landscape, some of the alternatives are: AWS Simple Workflow Service (SWF), Netflix Conductor, Apache Airflow, and Temporal. Out of them, I liked Temporal the most.
Temporal (pronounced like “tempura”, not “temperature”) is a platform and framework for building fault tolerant “workflows”. For simplicity, one can think of any non-trivial computer programs as a workflow. Temporal provides the SDK for programming them and the platform these programs run on.
Unlike Step Functions and Conductor, the entire Temporal Workflow is defined by code. This makes the behavior of the entire program easier to understand. Moreover, for users of the Go or Java SDK, the users may take advantage of the compiler to catch a wide array of errors as well as improve quality in a large team. Then code is guaranteed to execute as you specified on the Temporal Platform. All of the operational stuff: Retries, recovery, persistent state, message queues, scaling, work distribution, tracing and monitoring are all handled by the platform itself. This really is amzaing stuff! If C simplified writng portable code, and GC simplified memory management, and serverless simplified scalable architectures, Temporal simplifies implementing a computer program on top of a fragile, distributed system.
A respite
I thought I’d demo examples of Temporal code today. I was wrong. This article turned out to be longer than I expected. We’ll return to this next week.
In the final part of this series, let’s dive into Temporal with some hands-on code.