Building a Running Pipeline

Lets look at another example: we need to get some data from a file which is hosted online and insert it into our local database. We also need to look at removing duplicate rows while inserting.

Initial setup

We need to have Docker installed as we will be using the Running Airflow in Docker procedure for this example. The steps below should be sufficient, but see the quick-start documentation for full instructions.

  1. # Download the docker-compose.yaml file
  2. curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
  3. # Make expected directories and set an expected environment variable
  4. mkdir -p ./dags ./logs ./plugins
  5. echo -e "AIRFLOW_UID=$(id -u)" > .env
  6. # Initialize the database
  7. docker-compose up airflow-init
  8. # Start up all services
  9. docker-compose up

After all services have started up, the web UI will be available at: http://localhost:8080. The default account has the username airflow and the password airflow.

We will also need to create a connection to the postgres db. To create one via the web UI, from the “Admin” menu, select “Connections”, then click the Plus sign to “Add a new record” to the list of connections.

Fill in the fields as shown below. Note the Connection Id value, which we’ll pass as a parameter for the postgres_conn_id kwarg.

  • Connection Id: tutorial_pg_conn

  • Connection Type: postgres

  • Host: postgres

  • Schema: airflow

  • Login: airflow

  • Password: airflow

  • Port: 5432

Test your connection and if the test is successful, save your connection.

Table Creation Tasks

We can use the PostgresOperator to define tasks that create tables in our postgres db.

We’ll create one table to facilitate data cleaning steps (employees_temp) and another table to store our cleaned data (employees).

  1. from airflow.providers.postgres.operators.postgres import PostgresOperator
  2. create_employees_table = PostgresOperator(
  3. task_id="create_employees_table",
  4. postgres_conn_id="tutorial_pg_conn",
  5. sql="""
  6. CREATE TABLE IF NOT EXISTS employees (
  7. "Serial Number" NUMERIC PRIMARY KEY,
  8. "Company Name" TEXT,
  9. "Employee Markme" TEXT,
  10. "Description" TEXT,
  11. "Leave" INTEGER
  12. );""",
  13. )
  14. create_employees_temp_table = PostgresOperator(
  15. task_id="create_employees_temp_table",
  16. postgres_conn_id="tutorial_pg_conn",
  17. sql="""
  18. DROP TABLE IF EXISTS employees_temp;
  19. CREATE TABLE employees_temp (
  20. "Serial Number" NUMERIC PRIMARY KEY,
  21. "Company Name" TEXT,
  22. "Employee Markme" TEXT,
  23. "Description" TEXT,
  24. "Leave" INTEGER
  25. );""",
  26. )

Optional: Using SQL From Files

If you want to abstract these sql statements out of your DAG, you can move the statements sql files somewhere within the dags/ directory and pass the sql file_path (relative to dags/) to the sql kwarg. For employees for example, create a sql directory in dags/, put employees DDL in dags/sql/employees_schema.sql, and modify the PostgresOperator() to:

  1. create_employees_table = PostgresOperator(
  2. task_id="create_employees_table",
  3. postgres_conn_id="tutorial_pg_conn",
  4. sql="sql/employees_schema.sql",
  5. )

and repeat for the employees_temp table.

Data Retrieval Task

Here we retrieve data, save it to a file on our Airflow instance, and load the data from that file into an intermediate table where we can execute data cleaning steps.

  1. import os
  2. import requests
  3. from airflow.decorators import task
  4. from airflow.providers.postgres.hooks.postgres import PostgresHook
  5. @task
  6. def get_data():
  7. # NOTE: configure this as appropriate for your airflow environment
  8. data_path = "/opt/airflow/dags/files/employees.csv"
  9. os.makedirs(os.path.dirname(data_path), exist_ok=True)
  10. url = "https://raw.githubusercontent.com/apache/airflow/main/docs/apache-airflow/tutorial/pipeline_example.csv"
  11. response = requests.request("GET", url)
  12. with open(data_path, "w") as file:
  13. file.write(response.text)
  14. postgres_hook = PostgresHook(postgres_conn_id="tutorial_pg_conn")
  15. conn = postgres_hook.get_conn()
  16. cur = conn.cursor()
  17. with open(data_path, "r") as file:
  18. cur.copy_expert(
  19. "COPY employees_temp FROM STDIN WITH CSV HEADER DELIMITER AS ',' QUOTE '\"'",
  20. file,
  21. )
  22. conn.commit()

Data Merge Task

Here we select completely unique records from the retrieved data, then we check to see if any employee Serial Numbers are already in the database (if they are, we update those records with the new data).

  1. from airflow.decorators import task
  2. from airflow.providers.postgres.hooks.postgres import PostgresHook
  3. @task
  4. def merge_data():
  5. query = """
  6. INSERT INTO employees
  7. SELECT *
  8. FROM (
  9. SELECT DISTINCT *
  10. FROM employees_temp
  11. ) t
  12. ON CONFLICT ("Serial Number") DO UPDATE
  13. SET
  14. "Employee Markme" = excluded."Employee Markme",
  15. "Description" = excluded."Description",
  16. "Leave" = excluded."Leave";
  17. """
  18. try:
  19. postgres_hook = PostgresHook(postgres_conn_id="tutorial_pg_conn")
  20. conn = postgres_hook.get_conn()
  21. cur = conn.cursor()
  22. cur.execute(query)
  23. conn.commit()
  24. return 0
  25. except Exception as e:
  26. return 1

Completing our DAG

We’ve developed our tasks, now we need to wrap them in a DAG, which enables us to define when and how tasks should run, and state any dependencies that tasks have on other tasks. The DAG below is configured to:

  • run every day at midnight starting on Jan 1, 2021,

  • only run once in the event that days are missed, and

  • timeout after 60 minutes

And from the last line in the definition of the process-employees DAG, we see:

  1. [create_employees_table, create_employees_temp_table] >> get_data() >> merge_data()
  • the merge_data() task depends on the get_data() task,

  • the get_data() depends on both the create_employees_table and create_employees_temp_table tasks, and

  • the create_employees_table and create_employees_temp_table tasks can run independently.

Putting all of the pieces together, we have our completed DAG.

  1. import datetime
  2. import pendulum
  3. import os
  4. import requests
  5. from airflow.decorators import dag, task
  6. from airflow.providers.postgres.hooks.postgres import PostgresHook
  7. from airflow.providers.postgres.operators.postgres import PostgresOperator
  8. @dag(
  9. dag_id="process-employees",
  10. schedule_interval="0 0 * * *",
  11. start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
  12. catchup=False,
  13. dagrun_timeout=datetime.timedelta(minutes=60),
  14. )
  15. def ProcessEmployees():
  16. create_employees_table = PostgresOperator(
  17. task_id="create_employees_table",
  18. postgres_conn_id="tutorial_pg_conn",
  19. sql="""
  20. CREATE TABLE IF NOT EXISTS employees (
  21. "Serial Number" NUMERIC PRIMARY KEY,
  22. "Company Name" TEXT,
  23. "Employee Markme" TEXT,
  24. "Description" TEXT,
  25. "Leave" INTEGER
  26. );""",
  27. )
  28. create_employees_temp_table = PostgresOperator(
  29. task_id="create_employees_temp_table",
  30. postgres_conn_id="tutorial_pg_conn",
  31. sql="""
  32. DROP TABLE IF EXISTS employees_temp;
  33. CREATE TABLE employees_temp (
  34. "Serial Number" NUMERIC PRIMARY KEY,
  35. "Company Name" TEXT,
  36. "Employee Markme" TEXT,
  37. "Description" TEXT,
  38. "Leave" INTEGER
  39. );""",
  40. )
  41. @task
  42. def get_data():
  43. # NOTE: configure this as appropriate for your airflow environment
  44. data_path = "/opt/airflow/dags/files/employees.csv"
  45. os.makedirs(os.path.dirname(data_path), exist_ok=True)
  46. url = "https://raw.githubusercontent.com/apache/airflow/main/docs/apache-airflow/tutorial/pipeline_example.csv"
  47. response = requests.request("GET", url)
  48. with open(data_path, "w") as file:
  49. file.write(response.text)
  50. postgres_hook = PostgresHook(postgres_conn_id="tutorial_pg_conn")
  51. conn = postgres_hook.get_conn()
  52. cur = conn.cursor()
  53. with open(data_path, "r") as file:
  54. cur.copy_expert(
  55. "COPY employees_temp FROM STDIN WITH CSV HEADER DELIMITER AS ',' QUOTE '\"'",
  56. file,
  57. )
  58. conn.commit()
  59. @task
  60. def merge_data():
  61. query = """
  62. INSERT INTO employees
  63. SELECT *
  64. FROM (
  65. SELECT DISTINCT *
  66. FROM employees_temp
  67. ) t
  68. ON CONFLICT ("Serial Number") DO UPDATE
  69. SET "Serial Number" = excluded."Serial Number";
  70. """
  71. try:
  72. postgres_hook = PostgresHook(postgres_conn_id="tutorial_pg_conn")
  73. conn = postgres_hook.get_conn()
  74. cur = conn.cursor()
  75. cur.execute(query)
  76. conn.commit()
  77. return 0
  78. except Exception as e:
  79. return 1
  80. [create_employees_table, create_employees_temp_table] >> get_data() >> merge_data()
  81. dag = ProcessEmployees()

Save this code to a python file in the /dags folder (e.g. dags/process-employees.py) and (after a brief delay), the process-employees DAG will be included in the list of available DAGs on the web UI.

../_images/tutorial-pipeline-1.png

You can trigger the process-employees DAG by unpausing it (via the slider on the left end) and running it (via the Run button under Actions).

../_images/tutorial-pipeline-2.png

In the process-employees DAG’s Grid view, we see all that all tasks ran successfully in all executed runs. Success!

What’s Next?

You now have a pipeline running inside Airflow using Docker Compose. Here are a few things you might want to do next:

See also