3 Surprising Use-cases for Branching in Airflow you’ve not seen before Your Data Pipelines can have as many branches as this nice tree. Photo by Andrew Svk on Unsplash Branching Conditionality is an important feature of many DAGs Introduction How often is it that you’re writing a Data Pipeline and then you wish you could do something contingently? Something that only happens if a set of conditions are satisfied? Hopefully, not that often! Airflow has supported this type of functionality via the AirflowBranchPython Operator. Many other workflow Orchestration tools have followed suit. Prefect have Conditional Flows, Dagster have DyanmicOutput, and in Orchestra we facilitate branching based on status. This leads us to the most important question: Why? Why bother at all with branching, thereby making your pipeline more complicated than it needs to be. We’ll see there are actually some pretty incredible use-cases, especially for folks that are looking for a greater amount of automation in their lives. A quick example of Branching in Airflow Before diving in to use-cases, we’ll use the below code as a reference so we can understand how branching works in practice. from airflow import DAG from airflow.operators.dummy import DummyOperator from airflow.operators.python import BranchPythonOperator from datetime import datetime def choose_branch(**kwargs): value = kwargs['ti'].xcom_pull(task_ids='check_value') if value > 10: return 'path_a' else: return 'path_b' default_args = { 'owner': 'airflow', 'start_date': datetime(2023, 1, 1), 'retries': 1, } dag = DAG('example_branching', default_args=default_args, schedule_interval='@daily') start = DummyOperator(task_id='start', dag=dag) check_value = PythonOperator( task_id='check_value', python_callable=lambda: 15, # Example condition value dag=dag ) branch_task = BranchPythonOperator( task_id='branch_task', provide_context=True, python_callable=choose_branch, dag=dag, ) path_a = DummyOperator(task_id='path_a', dag=dag) path_b = DummyOperator(task_id='path_b', dag=dag) end = DummyOperator(task_id='end', dag=dag) start >> check_value >> branch_task >> [path_a, path_b] >> end The choose_branch button function returns a different value depending on a task value that is stored in an xcom (a temporary data store for tasks). The branch_taskis actually a separate task, that invokes a python callable (in this case the choose_branch function). By specifying the variables path_aand path_b, and finally adding these as the possible outputs in array format to the branch_task, Airflow knows how to branch based on the branching logic. Automating Model Training and Deployment Branching is really powerful in the Machine Learning and Data Science world. Suppose you have a Machine Learning model that needs to be trained every week, because every week yo...
Azizi Othman’s Post
More Relevant Posts
-
Week 3 of #mlopszoomcamp focussed on doing orchestration with #Mage from mage.ai. This is an open-source data transformation framework which aims to make constructing complex workflows easy. I learned a lot this week about building robust ETL data pipelines, making them observable and testable, as well as deploying and monitoring models. Mage is really intuitive to use, thanks to a nice UI interface and an obvious mapping between the objects in the UI and the files on disk. It's also very easy to install and setup. I also liked the straightforward way one can pass data from one stage of the pipeline to the next without needing to serialise/deserialise between stages. For homework we implemented a simple pipeline in Mage, starting from loading and cleaning data to training a simple linear model and logging everything in #MLFlow. You can find the HW3 here: https://lnkd.in/dfbfxuUq To get a feeling of how it compares to other tools, I also constructed the homework in #Airflow. Building the actual pipeline was also fairly easy (but actually configuring Airflow was not). I found the Airflow UI to be more spartan than Mage, but that's not necessarily a big limitation, given that the focus of Airflow is creating Python DAGs. You can find the Airflow version here: https://lnkd.in/dut2MyRy Overall, I think I have a slight preference for Mage but I'm sure I'll continue to use both tools in the future.
To view or add a comment, sign in
-
🚀 Exciting News! I just published my first article on Medium! 🎉 As a data engineer and a big fan of Apache Airflow, I’m passionate about improving the quality and consistency of Airflow DAGs. In this article, I dive into how you can use Python’s Abstract Syntax Tree (AST) to lint your DAGs and standardize workflows across your team. Plus, I introduce DAGLint, a tool we built internally at Next Insurance to enforce best practices and ensure smooth operations. Check it out and let me know your thoughts!👇 #DataEngineering #Airflow #Python
Mastering Airflow DAG Standardization with Python’s AST: A Deep Dive into Linting at Scale
link.medium.com
To view or add a comment, sign in
-
Few days ago, I delved into the world of shell commands (bash) for data transformation. Today, the synergy unfolded as I completed the "Introduction to Airflow in Python" course. 🚀 Key Takeaways on Introduction to Airflow in Python Course 🚀 1. Operators and Tasks: Explored various operators to define tasks in Airflow workflows, providing flexibility in task execution. 2. Task Dependencies: Learned to establish dependencies between tasks using bitshift operators, ensuring the proper sequence of task execution. 3. Sensors for Workflow Conditions: Leveraged sensors to react to workflow conditions and states, enhancing adaptability in response to changing scenarios. 4. Scheduling Strategies: Explored diverse DAG scheduling approaches, tailoring workflows to meet specific requirements. 5. SLAs and Alerting: Utilized SLAs (Service Level Agreements) and alerting mechanisms to maintain visibility and proactively manage workflow performance. 6. Templating for Flexibility: Unleashed the power of templating to create highly flexible workflows, allowing dynamic task definition. 7. Conditional Logic with Branching: Implemented conditional logic using branching, adding decision-making capabilities to DAGs. 8. Airflow Interfaces: Explored the Airflow command line and UI interfaces for efficient workflow management. 9. Understanding Executors: Gained insights into Airflow executors, understanding their role in task execution. 10. Debugging and Troubleshooting: Acquired skills in debugging and troubleshooting, ensuring smooth workflow execution. Embarking on a journey with Apache Airflow has been enlightening, unlocking a world of possibilities in orchestrating data workflows! 💡✨
Faiz Samsudin's Statement of Accomplishment | DataCamp
datacamp.com
To view or add a comment, sign in
-
🎉 Excited to share that I've just completed the "Introduction to Airflow in Python" course! 🚀 Airflow is a powerful tool for automating, scheduling, and monitoring data workflows. This course provided me with a solid foundation in building and managing data pipelines, including creating DAGs, using operators, and setting up sensors. I'm looking forward to applying these skills to streamline data processes and drive efficiency in my future projects. #DataEngineering #Airflow #Python #Automation #DataPipelines #Learning
Prashant Singh's Statement of Accomplishment | DataCamp
datacamp.com
To view or add a comment, sign in
-
Recently in a discussion, when asked about Inheritance, I stumbled while answering(sad). Despite having implemented the concept before in few project, I never really thought about its applications in setting up data pipelines until a few days ago when I read about using operator classes to define tasks in Apache Airflow. To provide some context, Operator Classes are fundamental components for defining Tasks(a unit of work) in Airflow. Operator Classes define the functionality of tasks within a workflow. When executing an Airflow DAG (Directed Acyclic Graph), each instance of these operator classes represents a distinct task within the DAG. Similar to inheritance, some operator classes serve as the base classes (e.g., BaseOperator, SensorOperator), and specific operator classes (e.g., BashOperator, PythonOperator) are sub classes of these base operator classes. Like in Inheritance sub classes inherit common behavior from base classes but can also define additional functionality specific to their task type. Though there are similarities, based on my understanding, concepts like Multiple Inheritance might not apply in the Airflow setting, as operator classes typically inherit from a single parent class. However, I'm open to suggestions. #dataengineering #oopsconcepts #airflow #dataprocessing
To view or add a comment, sign in
-
Dean’s List #21: Big Data London Spotlight — Prefect’s Workflow Orchestration Revolution: A Conversation with CTO Chris White At Big Data London, I had a fantastic conversation with @Chris White, CTO of Prefect, about how their Python-based orchestration framework is transforming how data teams handle complex workflows. Here’s why Prefect is becoming a favorite for data practitioners: What Makes Prefect Stand Out? Scalable Performance: Built to handle thousands of workflows and tasks with minimal overhead. Dynamic Workflows: Define workflows at runtime, enabling conditional branching and adaptability in native Python. Enterprise-Grade Security: A federated architecture supports secure, distributed operations without exposing infrastructure. Why Teams Choose Prefect? - Perfect for high-volume workflows that require scalability. - Ideal for teams building complex, dynamic workflow structures. - Trusted by distributed enterprises to balance collaboration and security. Prefect empowers teams to automate, recover gracefully, and optimize workflows—all while staying in the familiar Python environment. Big thanks to Chris for the insights! Jeremiah Lowin Sarah Moses Brad Evans Thomas E. Prefect #BigDataLondon #DataEngineering #WorkflowOrchestration #Python
To view or add a comment, sign in
-
Introduction to Airflow in Python
Konstiantyn Ivashchenko's Statement of Accomplishment | DataCamp
datacamp.com
To view or add a comment, sign in
-
I'll probably add to the long-time battle between Airflow and Prefect with this post. Sharing practical insights has become a habit, so here’s another one about data work and workflow orchestration. Many of you are familiar with Airflow, its standard-setting capabilities but also its drawbacks—being bulky and sometimes restrictive. I've had personal success using Prefect as a modern alternative, a newer and simpler option that works well with Python. It makes things easier to manage and fits well with the fast-paced needs of data teams. Using Prefect for data processing has made it easier and faster to handle machine learning tasks, which is key for effective MLOps. MLOps, essential for managing machine learning models in production, requires tools for data monitoring, model retraining, and updating deployments. Prefect here has proven to be more flexible and simpler than Airflow, boosting productivity for data engineers and analysts. Of course, choosing data orchestration tools should depend on your specific needs and priorities. However, for those dealing with the challenges of MLOps, Prefect is a good option to consider. #mlops #prefect #airflow #machinelearning
To view or add a comment, sign in