End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker
This article is part of a project that’s split into two main phases. The first phase focuses on building a data pipeline. This involves getting data from an API and storing it in a PostgreSQL database. In the second phase, we’ll develop an application that uses a language model to interact with this database.
Ideal for those new to data systems or language model applications, this project is structured into two segments:
-
This initial article guides you through constructing a data pipeline utilizing Kafka for streaming, Airflow for orchestration, Spark for data transformation, and PostgreSQL for storage. To set-up and run these tools we will use Docker.
-
The second article, which will come later, will delve into creating agents using tools like LangChain to communicate with external databases.
This first part project is ideal for beginners in data engineering, as well as for data scientists and machine learning engineers looking to deepen their knowledge of the entire data handling process. Using these data engineering tools firsthand is beneficial. It helps in refining the creation and expansion of machine learning models, ensuring they perform effectively in practical settings.
This article focuses more on practical application rather than theoretical aspects of the tools discussed. For detailed understanding of how these tools work internally, there are many excellent resources available online.
Overview
Let’s break down the data pipeline process step-by-step:
-
Data Streaming: Initially, data is streamed from the API into a Kafka topic.
-
Data Processing: A Spark job then takes over, consuming the data from the Kafka topic and transferring it to a PostgreSQL database.
-
Scheduling with Airflow: Both the streaming task and the Spark job are orchestrated using Airflow. While in a real-world scenario, the Kafka producer would constantly listen to the API, for demonstration purposes, we’ll schedule the Kafka streaming task to run daily. Once the streaming is complete, the Spark job processes the data, making it ready for use by the LLM application.
All of these tools will be built and run using docker, and more specifically docker-compose.
Now that we have a blueprint of our pipeline, let’s dive into the technical details !
Local setup
First you can clone the Github repo on your local machine using the following command:
git clone https://github.com/HamzaG737/data-engineering-project.git
Here is the overall structure of the project:
├── LICENSE
├── README.md
├── airflow
│ ├── Dockerfile
│ ├── __init__.py
│ └── dags
│ ├── __init__.py
│ └── dag_kafka_spark.py
├── data
│ └── last_processed.json
├── docker-compose-airflow.yaml
├── docker-compose.yml
├── kafka
├── requirements.txt
├── spark
│ └── Dockerfile
└── src
├── __init__.py
├── constants.py
├── kafka_client
│ ├── __init__.py
│ └── kafka_stream_data.py
└── spark_pgsql
└── spark_streaming.py
-
The airflow directory contains a custom Dockerfile for setting up airflow and a dags directory to create and schedule the tasks.
-
The data directory contains the *last_processed.json file *which is crucial for the Kafka streaming task. Further details on its role will be provided in the Kafka section.
-
The docker-compose-airflow.yaml **file defines all the services required to run airflow.
-
The docker-compose.yaml file specifies the Kafka services and includes a docker-proxy. This proxy is essential for executing Spark jobs through a docker-operator in Airflow, a concept that will be elaborated on later.
-
The spark directory contains a custom Dockerfile for spark setup.
-
src contains the python modules needed to run the application.
To set up your local development environment, start by installing the required Python packages. The only essential package is psycopg2-binary. You have the option to install just this package or all the packages listed in the requirements.txt file. To install all packages, use the following command:
pip install -r requirements.txt
Next let’s dive step by step into the project details.
About the API
The API is RappelConso from the French public services. It gives access to data relating to recalls of products declared by professionals in France. The data is in French and it contains initially **31 **columns (or fields). Some of the most important are:
-
*reference_fiche (reference sheet): *Unique identifier of the recalled product. It will act as the primary key of our Postgres database later.
-
*categorie_de_produit (Product category): *For instance food, electrical appliance, tools, transport means, etc …
-
*sous_categorie_de_produit (Product sub-category): *For instance we can have meat, dairy products, cereals as sub-categories for the food category.
-
motif_de_rappel (Reason for recall): Self explanatory and one of the most important fields.
-
*date_de_publication *which translates to the publication date.
-
*risques_encourus_par_le_consommateur *which contains the risks that the consumer may encounter when using the product.
-
There are also several fields that correspond to different links, such as link to product image, link to the distributers list, etc…
You can see some examples and query manually the dataset records using this link.
We refined the data columns in a few key ways:
-
Columns like ndeg_de_version and rappelguid, which were part of a versioning system, have been removed as they aren’t needed for our project.
-
We combined columns that deal with consumer risks — risques_encourus_par_le_consommateur and description_complementaire_du_risque — for a clearer overview of product risks.