GitHub - jugnuarora/france-courses-enrollments: Data Pipeline creation of france courses enrollments. Every month the providers report the enrollments in their programs. The idea is to get the courses listed as well as the enrollments every month and look at the trend of enrolments and the inter comparison of the trainings s providers for different courses.

Project Overview

This project focuses on building an end-to-end data pipeline to analyze training course and enrollment data within the French market. The goal is to track course trends over time, understand enrollment patterns, and provide actionable insights through a dashboard.

Datasets Used:

Courses Data: Mon Compte Formation Catalogue - Contains information about available training courses, including provider details, course titles, and regions.
Enrollments Data: Entree Sortie Formation - Provides monthly enrollment data, showing the number of trainees enrolled in specific courses.
Formacode Data: Formacode Centre Inffo - Classifies training programs by skills (using Formacode codes) and includes descriptions and semantic fields. There are 5 formacodes allowed to be assigned to Courses.Example with their description is as below:
- 31025: Data Analytics
- 31026: Data Science
- 31028: Artificial Intelligence
- 31035: Data Visualization
Other than this there is a column fieldwhich contains 65 semantic field that helps in grouping the formacodes at higher level. For example, all the above would be part of ÌT and information systems. Other Example would be Financial Management account.

Problem Statement:

The primary objective is to develop a dashboard that visualizes:

The monthly evolution of training courses, allowing for trend analysis and identification of new course launches.
Enrollment patterns over time, showing the popularity of different courses and providers.

A key challenge is the lack of historical course launch dates. To address this, the pipeline is designed to capture monthly snapshots of course data, enabling retrospective analysis. So, if in the data today it shows there are 7 providers giving courses on data analytics, there is no way to find out what was the number 3 months back.

Additionally, the enrollment dataset's potential for data resets necessitates an incremental data ingestion approach.

The formacode dataset, in French, requires translation and processing for broader stakeholder understanding.

Data Pipeline Architecture

Technologies Used:

Cloud: Google Cloud Platform (GCP)
Workflow Orchestration: Kestra
Data Lake: Google Cloud Storage (GCS)
Data Warehouse: BigQuery
Data Ingestion: DLT Hub using python
Batch Processing: Apache Spark
Data Transformation: dbt Cloud
Data Transformation: Looker

High Level Architecture

graph LR
    subgraph "Data Sources (EL)"
        C[Courses]
        E[Enrollments]
        F[Formacode]
    end

    subgraph "Bash Script for complete automation"
        B[Bash Script: GCP Key creation & IAM Roles + Terraform Deployment + Kestra Dockerization & Dynamic Flows Import + Terraform Destroy]
    
        subgraph "Terraform"
            T[Terraform: Infrastructure as Code]
        end

        subgraph "Kestra Orchestration"
            subgraph "GCS"
                T --> GCS_Resources
                C --> D(DLT: Raw Data)
                E --> D
                F --> D
                D --> S[Spark: Initial Transformed Data]
            end

            subgraph "BigQuery"
                T --> BQ_Resources
                S --> P[Partitioned & Clustered source_tables]
                P --> DBT(dbt: Facts & Dimensions)
            end
        end
    end

    subgraph "Analytics Tools"
        DBT --> H[Visualization]
    end

    style T fill:#add8e6,stroke:#333,stroke-width:2px,color:#000;
    style D fill:#ccf,stroke:#333,stroke-width:2px,color:#000;
    style S fill:#ccf,stroke:#333,stroke-width:2px,color:#000;
    style P fill:#efe,stroke:#333,stroke-width:2px,color:#000;
    style DBT fill:#ccf,stroke:#333,stroke-width:2px,color:#000;
    style H fill:#ccf,stroke:#333,stroke-width:2px,color:#000;
    style GCS_Resources fill:#e0ffff,stroke:#333,stroke-width:1px,color:#000;
    style BQ_Resources fill:#e0ffff,stroke:#333,stroke-width:1px,color:#000;

Pipeline Components:

Data Ingestion (Batch):
- Courses Data: DLT is used to extract the course data from the source, store it as raw data in GCS, perform initial Spark transformations, and load it into BigQuery, partitioned by data_extracted and clustered by code-formacode-1. Everytime it runs, this will upload any new incoming rows while retaining the old ones.
- Enrollments Data: DLT is used to extract the enrollments data from the source, store it as raw data in GCS, perform initial Spark transformations, and load it into BigQuery, partitioned by year_month and clustered by provider. Everytime it runs, this will upload any new incoming rows while retaining the old ones.
- Formacode Data: Kestra downloads the Formacode ZIP file, extracts relevant data, translates descriptions and field columns using Spark, and stores the processed data in BigQuery. Broadcast variables in Spark are used for semantic fields translation. This flow will write truncate the file in bigquery which means that it will delete all the old records and insert new.
Workflow Orchestration (Kestra):
- A single Kestra workflow is created to orchestrate the courses and enrollments data pipelines, scheduled to run on the first Sunday of each month at 3 AM. Please refer to the Courses Enrollments Kestra Workflow Topology for the whole flow.
- A seperate kestra workflow is created to handle the formacode download, extraction, translation and upload. It is once in a while job as it remains more or less constant. Please refer to the Formacode Kestra Workflow Topology for the whole flow.
- The flows are also inter-dependent and set to trigger the next flow. There are 2 possible cases:
  - Bash script will trigger the start-to-end flow creating the dbt marts and fact tables. Please refer to the dependency graph.
  - 02_courses_enrollments_pipeline has a time trigger set-up as well which will in-turn trigger the 04_dbt_execution once complete.
Data Transformation (dbt Cloud):
- dbt Cloud is used to build data models for the final fact and dimension tables in BigQuery.
- Please refer to the DBT Lineage Graph to understand the workflow.
- A dbt macro is implemented to remove leading numbers from the field column in the Formacode data. This macro could be leveraged in future to remove leading digits in generic_term.
- An environment variable DBT_ENVIRONMENT is used to control data limiting in the stg_courses model (development: limit to 100 rows, production: full dataset). This can be overridden using local variables limit_data and further by passing false to the variable in command-bar.
- Tests are included in the dbt project to ensure data integrity. There are warnings issued showing that there are few formacodes listed in courses which are not part of formacode dimension file. These can be ignored for now.
- The project includes both development and production environments, with Monthly job creation and CI/CD job for deployment.
Data Warehouse (BigQuery):
- BigQuery is used as the data warehouse, with tables partitioned and clustered for optimal query performance.
- Local SQL queries are provided for data verification and reconciliation at different steps.

Dashboard

Tool: Google Data Studio

The dashboard includes two tiles:

Distribution of Course Categories: A graph showing the distribution of courses across different Formacode categories, providing insights into the areas with trainings and providers. I have come up with a KPI here which is Trainings Provider Ratio. It is the ratio of total number of trainings and total number of providers. If 10 providers are providing 150 bootcamps for a technology, then BPR = 150/10 = 15. The interpretation is:
- LOW: might indicate less demand due to lower number of trainings or overcrowdedness due to high number of providers.
- HIGH: might indicate oversaturation due to too many trainings or less competition due to less number of providers. We suggest to look at the TPR between 5 and 15.
Monthly Enrollment Trends: A line graph illustrating the monthly enrollment trends for selected courses or providers, highlighting temporal patterns and growth.

A screenshot of the visualization after selecting the formacode_field as IT and information systemsand training_provider_ratiobetween 5 and 15 can be found here.

Reproducibility

Automation (Recommended)

This section outlines the automated setup using a Bash script, Terraform, and Kestra. If you prefer manual setup, please scroll down.

Prerequisites:

Google Cloud Platform (GCP) Account:
- A GCP project with billing enabled. The script will handle service account creation, IAM role assignments, and resource provisioning.
Docker Desktop:
- Docker Desktop installed and running. This is required for Kestra's containerized setup.
Google Cloud SDK (gcloud CLI):
- Install the Google Cloud SDK, which includes the gcloud command-line tool.
- Follow the installation instructions for your operating system: https://cloud.google.com/sdk/docs/install
- After installation, initialize the gcloud CLI by running:
```
gcloud init
```

Steps to Run:

Clone the Repository:

git clone git@github.com:jugnuarora/france_courses_enrollments.git
cd france_courses_enrollments

Execute the Setup Script:
- In your terminal, run the setup script:
```
./setup.sh
```
- Follow the on-screen instructions. The script will
  - Authenticate with GCP.
  - Create necessary service accounts and assign IAM roles.
  - Deploy infrastructure using Terraform (GCS buckets, BigQuery datasets).
  - Start Kestra using Docker.
  - Import Kestra workflows.
  - Trigger the Kestra flow execution.
  - Then at the end of the script it will destroy all the terraform resources.
Verify Data in BigQuery: This step is for reconciliation.
- The setup script will generate a new_local_queries.sql file. Import this file into the BigQuery console.
- Execute the following queries for reconciliation:
  - Courses: Queries 1, 4, and 10.
  - Enrollments: Queries 11, 14, and 20.
  - Formacode: Queries 21 and 27.
- Verify the final table counts:
  - Courses (Query 9): ~195K records
  - Enrollments (Query 19): ~181K records
  - Formacode (Query 26): ~3379 records
Visualize in Looker Studio:
- You can access the visualization here.

Manual

Prerequisites:

GCP account with BigQuery and GCS enabled.
Kestra instance running.
dbt Cloud account with BigQuery connection.
Service account keys for GCP authentication.

Steps to Run:

Clone the Repository:

git clone git@github.com:jugnuarora/france_courses_enrollments.git
cd france_courses_enrollments

Setting up the Cloud:

Step 1: Create a project on Google Cloud. The name of the project for me is france-courses-enrollments. But it can be different.

Step 2: Create a service account by clicking on IAM. I have kept the service account name as france-courses-enrollments. But it can be different. Select roles Storage Adminand BigQuery Admin. Also generate the json key and store it safely for further connection to GCS and bigquery.

Step 3: Add a billing account to this project.

Step 4: Create a cloud storage bucket. For me it is jugnu-france-course-enrollments, but it can be different. Select suitable region. I have selected europe-west1 (Belgium).

Configure Kestra:

STEP 1: Run docker compose up. Access Kestra on localhost:8080.

STEP 2: Import the Kestra workflows

. 01_gcp_kv.yaml

. 02_formacode_pipeline.yaml

. 03_courses_enrollments_pipeline.yaml

. 04_dbt_execution.yaml

STEP 3: Execute 01_gcp_kv.yaml to set up the key value pair. Later on you can modify them with the values that corresponds to your set-up by going to namespaces, selecting france-courses-enrollments and then selecting KV Store. You will need following key value pairs:

. GCP_CREDS - It has the same content as the json file generated from google cloud.

. GCP_DATASET - It is the same name as database in Bigquery.

. GCP_BUCKET_NAME - It is the same name as Bucket in Google Cloud Storage.

. GCP_PROJECT_ID - It is the Project Id that is automatically generated when creating the new Project on Google Cloud.

. GCP_LOCATION - I had chosen europe-west1

Bigquery Set-up:

Make sure to have following dataset in bigquery:
```
. source_tables

. external
```
These are required for the kestra workflow to generate the source tables.
Configure dbt Cloud:

Step 1: Create a new account france-market-research.

Step 2: Create a new connection to BigQuery using the json file generated. Please note that the location of the dataset creation is europe-west1 (or whatever you have mentioned for the warehouse and data lake set up).

Step 3: Create a new project and give details of the repository and the project subdirectory as dbt. Add repository using GitHub. Project subfolder is a must since the whole dbt resides in the dbtfolder of the repository.

Step 4: Create a development environment. Please note that the location of the dataset creation is europe-west1 (or as per your GCP set-up), Connection is Bigquery. The dataset name is 'dbt_models', but it can be different also.

Step 5: Create environment variable DBT_ENVIRONMENT, where Project default will be productionand development will be development. Later on when we would create production environment, we will add the value for that environment as well.

Step 6: Now, you can access the contents of the project repository in Develop Cloud IDE.

Step 7: In stg_coursesthere is a variable defined limit_data, which is true if the environment variable DBT_ENVIRONMENTis development. To override it in development, dbt run --vars '{"limit_data": false}'. Otherwise, by default in development, it will only give 100 rows.

Step 8: Set-up the Productionenvironment.

Step 9: Create a new job Monthly, which will run in production environment and I have set the Schedule as Cron Schedule for Monthly to be run at 22:00 hrs UTC on 5th of every month.

Step 10: Create a new CI/CD job deploy_dbt_prod, which will run as soon as there is a merge in the main branch of the git hub repo associated. See the screenshot.
Run Kestra Workflows:
- Trigger the Kestra workflow 02_formacode_pipeline.yaml to start the data ingestion and processing pipeline for formacode. This will take some time to execute (almost 45-50 mins) because of the translation.
- Trigger the Kestra workflow 03_courses_enrollments_pipeline.yaml to start the data ingestion and processing pipelines for courses and enrollments respectively. Execute it to generate the courses source table and enrollments source table. The course source table is partitioned by data extract and the code_formacode_1. The enrollments table is partitioned by year_month and clustered by provider.
Verify Data:
- I prefer to reconcile my tables and thus created some local queries for it that can be run in BigQuery. Import the local_queries.sql.
- Execute Query 1, 4 and 10 to see the reconciliation table for courses.
- Execute Query 11, 14 and 20 to see the reconciliation table for enrollments.
- Execute Query 21 and 27 to see the reconciliation table for formacode.
- run the provided sql queries (local_queries) to verify the data. The final tables should have the count as below: . courses (Query 9) - ~195K records . enrollments (Query 19) - ~181K records . formacode (Query 26)- ~3379 records
Run dbt Models:
- Execute the dbt models in dbt Cloud to transform the data. You can execute in one of the ways listed below:
  - Using dbt command bar and giving: `dbt run --vars '{"limit_data": false}'
  - Executing the Monthly job now. The monthly job once finished will generate the docs as well.
Visualize in Dashboard:
- You can access the visualization here.

Going the Extra Mile (Optional)

Tests: Included dbt tests for data integrity.
CI/CD: CI/CD pipelines set up for dbt Cloud deployment.
Documentation: This README provides detailed instructions and explanations. Even the production jobs (Monthlyor deploy_dbt_prod) once run will produce the documentation.

Future Improvements

Use Terraform for Virtualization
Explore real-time data ingestion for enrollment data.
Optimize the Formacode translation job.

Folders/Root Files Information

dbt- This includes the dbt cloud project.
lib- This contains the jar files required for spark job.
screenshots- Various screenshots refered in this README are stored in this folder. These are highlighted as you read through the document.
scripts- All python scripts that are needed in Kestra workflows are stored in this folder.
- 01_courses_data_upload.py- Python script to get courses data through API using dlt and store it in GCS data lake.
- 02_enrollments_data_upload.py- Python script to get enrollments data through API using dlt and store it in GCS data lake.
- 03_courses_spark.py- Python Script to modify column types in courses using spark and store it back in GCS data lake.
- 04_enrollments_spark.py- Python Script to modify column types in enrollments using spark and store it back in GCS data lake.
- 05_formacode_download.py- Python Script to get the selected columns from the selected file in zip folder for formacode.
- 06_formacode_sparl_translation.py- Python Script to execute pyspark for the translation of 2 columns.
.gitignore - To ensure that no sensitive information or any kind of data is uploaded to GitHub.
01_gcp_kv.yaml- KESTRA workflow for Kestra key value set-up
02_courses_enrollments_pipeline.yaml- KESTRA workflow to get the courses and enrollments data.
03_formacode_pipeline-yaml- KESTRA workflow to get the formacode data.
docker-compose.yml- To run Kestra in docker container.
local_queries.sql- SQL queries that can be used to verify the BigQuery tables.
requirements.txt- Required packages to execute the python scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Data Pipeline Architecture

Dashboard

Reproducibility

Automation (Recommended)

Manual

Going the Extra Mile (Optional)

Future Improvements

Folders/Root Files Information

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
dbt		dbt
lib		lib
screenshots		screenshots
scripts		scripts
terraform		terraform
.DS_Store		.DS_Store
.gitignore		.gitignore
01_gcp_kv.yaml		01_gcp_kv.yaml
02_formacode_pipeline.yaml		02_formacode_pipeline.yaml
03_courses_enrollments_pipeline.yaml		03_courses_enrollments_pipeline.yaml
04_dbt_execution.yaml		04_dbt_execution.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
local_queries.sql		local_queries.sql
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Data Pipeline Architecture

Dashboard

Reproducibility

Automation (Recommended)

Manual

Going the Extra Mile (Optional)

Future Improvements

Folders/Root Files Information

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages