There are 2 key concepts in the templated SQL script shown aboveĪirflow macros: They provide access to the metadata that is available for each DAG run. But it becomes very helpful when we have more complex logic and want to dynamically generate parts of the script, such as where clauses, at run time. This may seem like overkill for our use case. Airflow replaces them with a variable that is passed in through the DAG script at run-time or made available via Airflow metadata macros. WHERE date = are called templated parameters. We can first test our different tasks using the airflow test command, and then when we’ve verified that everything is configured correctly, we can use the airflow backfill command to run our DAG for a specific range of dates: airflow backfill my_first_dag -s -e įinally, we only need to launch Airflow’s scheduler with the airflow scheduler command and Airflow will make sure to run our DAG with the defined interval.DROP TABLE IF EXISTS event_stats_staging In our case, we want task_2 to run after task_1: task_1.set_downstream(task_2) Running the DAGīy default, DAGs should be placed in the ~/airflow/dags folder. Lastly, we just need to specify the dependencies. The next step consists of defining the tasks of our DAG: task_1 = BashOperator( The first parameter “first_dag” represents the DAG’s ID, and the schedule interval represents the interval between two runs of our DAG. The DAG itself can then be instantiated as follows: my_first_dag = DAG( For example, we can easily define the number of retries and the retry delay for the DAG’s runs. # initializing the default arguments that we'll pass to our DAGĪs we can see, Airflow offers a big number of default arguments that make DAG configuration even simpler. These arguments will then be applied to all of the DAG’s operators. We can then define a dictionary containing the default arguments that we want to pass to our DAG. # We need to import the operators used in our tasksįrom _operator import BashOperator To create our first DAG, let’s first start by importing the necessary modules: # We'll start by importing the DAG object For example, using PythonOperator to define a task means that the task will consist of running Python code. What each task does is determined by the task’s operator. In Airflow, a DAG is simply a Python script that contains a set of tasks and their dependencies. One of these concepts is the usage of DAGs, which allows Airflow to organize the multiple tasks and processes that it needs to run very fluidly. In a previous article on INVIVOO’s blog, we presented the main concepts that Airflow relies on. To initiate the database, you only need to run the following command: airflow initdb Creating your first DAG The recommended option is to start with Airflow’s own SQLite database, but you can also connect it to another option. So if we opt to add the Microsoft Azure subpackage, the command would be as follows: pip install 'apache-airflow'Īfterward, you only need to initiate a database for Airflow to store its own data. Installing AirflowĪirflow’s ease-of-use starts right from the installation process because it only requires one pip command to get all of its components: pip install apache-airflowĪdding external packages to support certain features (like compatibility with your cloud provider) is also a seamless operation. This article will guide you through the first steps with Apache Airflow towards the creation of your first Directed Acyclic Graph (DAG). One of the main reasons for which Airflow rapidly became this popular is its simplicity and how easy it is to get it up and running. Throughout the past few years, Apache Airflow has established itself as the go-to data workflow management tool within any modern tech ecosystem.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |