Amplify Operations With Apache Airflow

Apache Airflow is turning heads these days. It integrates with many different systems and it is quickly becoming as full-featured as anything that has been around for workflow management over the last 30 years. This is predominantly attributable to the hundreds of operators for tasks such as executing Bash scripts, executing Hadoop jobs, and querying data sources with SQL.

It is important to understand the intent behind the creation of Airflow. If you check out the Airflow website, you will get a very simple explanation of what it is: “A platform to programmatically author, schedule, and monitor workflows.” Airflow is not just a scheduler or an ETL tool, and it is critical to appreciate why it was created so you can determine how it can best be used.

Airflow has gained rapid popularity for its flexibility, simplicity in extending its capabilities, and at least in some part because it plugs into Kubernetes (k8s). But if you were to summarize the appeal of Airflow in one word it would be “programmatic” because being able to program a workflow is a big deal. This enables far simpler integration than many other tools in the space. Airflow works on the notion of a directed acyclic graph (DAG), meaning that it has a path from start to finish with no looping.

Plugging into k8s enables Airflow to be operated and scheduled by k8s. More complex workflows often require identifying the hardware for which to run a workflow. This is also appealing to many administrators because k8s has effectively become the data center resource manager of choice.

Airflow also delivers on the promise of portability. It supports running workflows for testing purposes, as well as locally in a non-distributed mode. This is ideal for users, enabling them to create their workflows locally and then push them to another environment when they are ready. The only change is the executor used to run the workflow. When the workflow is ready, it can then be run with Kubernetes.

Workflow management is a critical piece of the infrastructure within most organizations, especially if there are a variety of systems to operate the daily business, or anything that requires multiple steps to accomplish. It is vital that a workflow management tool has a barrier to entry that is not too high, as this can impede adoption and overall success. In addition, the tool must not only be flexible for those creating the workflows but also help reduce complexity for managing and operating workflows on a daily basis. There are some well-refined user interfaces providing a variety of views: a complete overview, a tree view that spans time and shows blocking steps, the traditional directed graph view, a view that provides all the variables used by the workflows, and even a Gantt chart.

One of the bigger questions that comes up these days is which workflow tool is the right one for the job. Airflow is a general-purpose workflow tool with immense flexibility. The problem with that immense flexibility is that there is a trade-off between that approach and a special-purpose or a purpose-built tool.

Consider Kubeflow, which is winning hearts and minds and is off-the-charts in its adoption rates. Technically speaking, it is a workflow tool, but it is a workflow tool purpose-built with the intent to simplify machine learning workflows. The moment a workflow in Airflow starts to look as if it is being used for machine learning purposes, Kubeflow should be considered. They both support distributed workflow management with Kubernetes, which means that they will play nice without fighting each other. For the data scientist, however, there is the benefit of their specific toolset integrations working the way they expect without them having to jump through hoops that fall into the generic workflow management category.

While Airflow is turning heads and may offer a plethora of benefits, consider it as any other new technology. Identify a proper use case, test it out, get comfortable with it, and learn its intricacies. There are a lot of use cases and documented case studies that have been made available for public consumption. I encourage anyone interested in those to read more about Airflow. It is still young by all measures, but the promise is strong.

Jim Scott, VP of enterprise architecture at MapR (www.mapr.com), is the co-founder of the Chicago Hadoop Users Group (CHUG), where he coordinates the Chicago Hadoop community.



Newsletters

Subscribe to Big Data Quarterly E-Edition