In this article, we share Open Climate Fix’s decision-making process around selecting Airflow as a data orchestrator and why and how we’re managing it ourselves.
Instead of deploying Airflow using a managed Airflow Instance for a cool ~$400 a month, we weighed the costs and decided to deploy Airflow ourselves on AWS Elastic Beanstalk for ~$30 a month.
Benefits of an orchestrator
Airflow and other data orchestrators are designed to make data-flows easy and more robust. Tools like Dagster and Prefect provide a pleasant UI, you can see business logic, they simplify triggering new events, and you can see what might have gone wrong. Throw in a bit of version control, and you’ve got a production ready system.
Why not just use managed Airflow?
At Open Climate Fix we’ve found that there’s business-value value in understanding the problem we’re trying to solve with a new technology. Just because a solution’s trendy and everybody’s using it doesn’t mean it’s an appropriate solution for Open Climate Fix. And we want to avoid paying for bells and whistles that fix problems we don’t have.
Managed instances such asGCP Cloud ComposerorAWS MWAAprovide an infrastructure-free route to orchestration, removing the requirement for the developer time investment, as well as enabling interoperability with other services. Business-critical workflows can be confidently offloaded to these instances due to their focus on resilience and scalability, and because they define security scopes using pre-existing IAM tools from the cloud provider. All this at a cost that is far lower than the salary of a dedicated DevOps engineer, even to the point where they are only spending a few days a month managing the service. Clearly, deploying a manual Airflow instance is not going to be the best option for a large majority of users when compared with a cloud-managed solution.
Weighing up the options
In the Pre-Airflow era, we handled the orchestration usingECS task scheduling. ECS task scheduling is simple for setting off cron jobs and definitely got Open Climate Fix started. It also has no cost on top of the compute that has to be paid anyway. However, as the number and business importance of those tasks increases, cron jobs can quickly become an inadequate solution, especially in the case of failure. Diagnosing an error in a workflow spanning multiple tasks is tricky: without a centralised place to view the workflow graph it is difficult to find the point and reason of failure.
An orchestrator seemed like the logical next step for Open Climate Fix. Amongst the many options available, Airflow seemed the sensible choice. It is an open source orchestration solution which has been in active use for almost 10 years. Its longevity and community support give it a leg up on its competitors, and, crucially, it can run AWS ECS tasks, where our jobs are defined already. 
So what did we do?
We created a docker compose file with the following services:
Airflow scheduler: This keeps tracks of all the jobs, and runs new jobs when they are needed.
Airflow server: Front end UI for Airflow. Without this, there would not be much point as the UI is a key feature in allowing Airflow to save us time. .
S3-sync: We sync a S3 bucket with a local folder. The tasks (or DAGs) can then be loaded into S3 and are then can be used straight away in Airflow.
Airflow - init: This sets up various things for Airflow to be ready to use, like database tables
Database: Originally we had a local Postgres instance for Airflow to use. We later switched to a AWS RDS instance we had set up already.sing RDS increases stability and decreases dependence on this docker compose.
Trick: The DAGs and logs need to be accessible by all parts of the docker-compose. This took ages (2 days) to do, and we got very frustrated by `write permission errors`.
We created about ten DAGs ( and growing) which are used to provide our 24/7 solar generation forecast for National Grid ESO. Some of these DAGs run every 5 minutes, some run once a day.
How did we deploy it?
At Open Climate Fix we like Terraform, and in fact we use Terraform Cloud. This makes version control easy and gets rid of local terraform version dependencies. We’ve set up the following:
S3 bucket, for docker compose file and DAG files
An Elastic Beanstalk Application: this is the main thing
IAM roles: To be able to run the application but also needs to be able to kick off ECS tasks
Secrets: Store a few airflow secrets
For some companies the AWS MWAA managed service for Airflow can be set up very quickly and will be the appropriate choice. However, for companies trying to keep their AWS overhead costs low, and with some experience of coordinating services in the cloud, or just a willingness to learn, taking Airflow deployment into your own hands could prove effective and make good financial sense.
Thanks for reading and if you would like to find out more you can see what we have done onGitHubor get in contact if you have any questions.