Session
Too Big for DAG Factories?
You’re working on a project that needs to aggregate petabytes of data, and it doesn’t make sense to manually hard-code thousands of tables, DAGs (Directed Acyclic Graphs) and pipelines. How can you transform, optimize and scale your data workflow? Developers around the world (especially those who love Python) are using Apache Airflow — a platform created by the community to programmatically author, schedule and monitor workflows without limiting the scope of your pipelines.
In this talk, we’ll review use cases, and you’ll learn best practices for how to:
- use Airflow to transfer data, manage your infrastructure and more;
- implement Airflow in practical use cases, including as a:
- workflow controller for ETL pipelines loading big data;
- scheduler for a manufacturing process; and/or
- batch process coordinator for any type of enterprise;
- scale and dynamically generate thousands of DAGs that come from JSON configuration files;
- automate the release of both the DAGs and infrastructure updates via a CI/CD pipeline;
- run all tasks simultaneously using Airflow.
Both beginner and intermediate developers will benefit from this talk, and it is ideal for developers wanting to learn how to use Airflow for managing big data. Beginners will learn about dynamic DAG factories, and intermediate developers will learn how to scale DAG factories to thousands of DAGS — which is something Airflow can’t do out of the box.
After this talk and live demo, people will learn best practices (including access to a code repo) that will allow them to scale to thousands of DAGs and spend more time having fun with big data.