I briefly worked on a workflow system at
$past_job, implemented using AWS Step Functions. My experience was pretty
terrible. I wasn’t sure which technical requirements led the team to this system. Some people said we needed a “system
that is configuration driven and not code driven” and some people said “we needed something that scales.” Whatever the
reason was, making improvements to this system was a pain in the ass, with AWS Step Functions itself being somewhat
What is a configuration-driven system? It is a system where changes can be described concisely and the effects of a
change can be easily understood. Ideally code changes, which may have wide-reaching consequences, are rarely
necessary. However, either because of poorly understood requirements or rushed delivery, the code that underlies this
system needed to be changed frequently. There was very little “configuration” in this system actually. Most
configurations were API endpoints and credentials. They almost never changed.
The AWS State Machine Definition is a domain-specific language. When a workflow changed, the DSL needed to be
changed. That is a code change, not a configuration change. For this project a change-set (for a user story) often
implement the worklflow. But that’s a different problem and we won’t talk about that. Anyway, most changes required
someone be familiar with both the DSL and the underlying programming languages. That’s actually a pretty big
problem. Most engineers have a hard time being proficient at a single programming language. Combine that with a
AWS-specific DSL and you end up with a lot of risk. Unsurprisingly, this system ended up quite fragile and bug-ridden.
Creating a deployment was painful. Recall that our changes generally affected both the state machine and underlying
code. This means changes generally requires updating the underlying Lambda Functions AND state machine definition. This
required learning a specialized deployment framework. Sometimes, we also needed to coordinate changes to the underlying
infrastructure (e.g., databases, S3, queues). Suddenly, you need to understand a pretty complex system and AWS-specific
tooling to work on a rather straight-forward problem.
Testing changes was difficult. We tested our AWS Step Functions by deploying the changes into a “QA” AWS environment. At
least with other frameworks like Cadence, you can test things locally. Unlike Cadence, your pile of spaghetti AWS
lambda definitions and DSL can be validated statically before you even deploy any code. Not to mention it’s not easy to
run automated integration tests. I guess the advantage is that AWS manages the infrastructure and scale. But unless you really
need that scale, is it really worth committing to their framework? I just don’t buy it.
AWS Step Functions felt like an unwieldy solution to the problem we were trying to solve and almost certainly slowed
down our delivery speed. It might have worked out well if its use was planned beter. But in this particular project it
was a liability.
Other workflow systems