Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@
*.tfstate.*
**.terraform
**.terraform.lock.*
**google_credentials.json
**logs/
**.env
**__pycache__/
19 changes: 7 additions & 12 deletions project/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
(TBD)

### Project Modules
1. `terraform`: Creates project infrastructure (GCS & BigQuery)

2. Batch:
* `airflow`: Workflows (DAGs) for ingestion (extraction) of raw data to Data Lake (GCS)
* `spark`: Transformation of Raw Data (GCS) to DWH (BigQuery)
* `Docker` config to containerize Airflow & Spark

3. `dbt`: Workflows to transform DWH data to queryable views

4. Streaming:
* `kafka`:

* `terraform`: Creates project infrastructure (GCS & BigQuery)
* `docker` config to containerize Postgres, Airflow & Spark
* `airflow`: Workflows (DAGs) for ingestion (extraction) of raw data to Data Lake (GCS) & DWH (BigQuery)
* `dbt`: Workflows to transform DWH data to queryable views
* `spark`: Transformation of Raw Data (GCS) to DWH (BigQuery), orchestrated by Airflow
* `kafka`: ingesting streaming data
63 changes: 0 additions & 63 deletions project/terraform/README.md

This file was deleted.

56 changes: 44 additions & 12 deletions week_1_basics_n_setup/1_terraform_gcp/1_terraform_overview.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,50 @@
(In Draft mode)

## Terraform Overview

### Concepts

1. Introduction
2. TF state & backend
3. Google Provider as source
* modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
4. Code: main, resources, variables, locals, outputs
5. Demo
* GCP CLI client (gcloud) - setup & auth
* tf init, plan & apply
#### Introduction
1. What if Terraform?
* open-source tool by HashiCorp, used for provisioning infrastructure resources
* supports DevOps best practices for change management
* Managing configuration files in source control to maintain an ideal provisioning state
for testing and production environments

2. What is IaC?
* Infrastructure-as-Code
* build, change, and manage your infrastructure in a safe, consistent, and repeatable way
by defining resource configurations that you can version, reuse, and share.

3. Some advantages
* Infrastructure lifecycle management
* Version control commits
* Very useful for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S…
* State-based approach to track resource changes throughout deployments


#### Files
* `main.tf`
* `variables.tf`
* Optional: `resources.tf`, `output.tf`
* `.tfstate`

#### Declarations
* `terraform`
* `backend`: state
* `provider`:
* adds a set of resource types and/or data sources that Terraform can manage
* The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms.
* `resource`
* blocks to define components of your infrastructure
* Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
* `variable` & `locals`


#### Execution steps
1. `terraform init`: Initialize & install
2. `terraform plan`: Match changes against the previous state
3. `terraform apply`: Apply changes to cloud
4. `terraform destroy`: Remove your stack from cloud


### Workshop
Continue [here](../../project/terraform): `data-engineering-zoomcamp/project/terraform`
### Terraform Workshop for GCP Infra
Continue [here](../terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform`
43 changes: 38 additions & 5 deletions week_1_basics_n_setup/1_terraform_gcp/2_gcp_overview.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,42 @@
(In Draft mode)

## GCP Overview

## Tools & Tech
- Cloud Storage
- BigQuery
### Project infrastructure modules in GCP:
* Google Cloud Storage (GCS): Data Lake
* BigQuery: Data Warehouse

(Concepts explained in Week 2 - Data Ingestion)

### Initial Setup

For this course, we'll use a free version (upto EUR 300 credits).

1. Create an account with your Google email ID
2. Setup your first [project](https://console.cloud.google.com/)
* eg. "DTC DE Course", and note down the "Project ID"
3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project
* Grant `Viewer` role to begin with.
* Download service-account-keys (.json) for auth.
4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup
5. Set environment variable to point to your downloaded GCP keys:
```shell
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token, and verify authentication
gcloud auth application-default login
```

### Setup for Access

1. [IAM Roles](https://cloud.google.com/storage/docs/access-control/iam-roles) for Service account:

Viewer + Storage Admin + Storage Object Admin + BigQuery Admin

2. Enable these APIs for your project:
* https://console.cloud.google.com/apis/library/iam.googleapis.com
* https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com

3. Please ensure `GOOGLE_APPLICATION_CREDENTIALS` env-var is set.
```shell
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
```

31 changes: 11 additions & 20 deletions week_1_basics_n_setup/1_terraform_gcp/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,17 @@
(In Draft mode)

## Local Setup for Terraform and GCP

### Terraform

Installation: https://www.terraform.io/downloads
### Pre-Requisites
1. Terraform client installation: https://www.terraform.io/downloads
2. Cloud Provider account: https://console.cloud.google.com/

### GCP
### Terraform Concepts
[Terraform Overview](1_terraform_overview.md)

For this course, we'll use a free version (upto EUR 300 credits).
### GCP setup

1. Create an account with your Google email ID
2. Setup your first [project](https://console.cloud.google.com/), eg. "DTC DE Course", and note down the "Project ID"
3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project, and download auth-keys (.json).
4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup
5. Set environment variable to point to your downloaded GCP auth-keys:
```shell
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token, and verify authentication
gcloud auth application-default login
```
1. [Setup for First-time](2_gcp_overview.md#Initial Setup)
2. [IAM / Access specific to this course](2_gcp_overview.md#Setup for Access)

### Workshop
Continue [here](../../project/terraform): `data-engineering-zoomcamp/project/terraform`
### Terraform Workshop for GCP Infra
Continue [here](terraform).
`week_1_basics_n_setup/1_terraform_gcp/terraform`
23 changes: 23 additions & 0 deletions week_1_basics_n_setup/1_terraform_gcp/terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

### Execution

```shell
# Refresh service-account's auth-token for this session
gcloud auth application-default login

# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
terraform plan -var="project=<your-project-id>"
```

```shell
# Create new infra
terraform apply -var="project=<your-project-id>"
```

```shell
# Delete infra after your work, to avoid costs on any running services
terraform destroy
```
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ provider "google" {
}

# Data Lake Bucket
# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket
resource "google_storage_bucket" "data-lake-bucket" {
name = "${local.data_lake_bucket}_${var.project}" # Concatenating DL bucket & Project name for unique naming
location = var.region
Expand All @@ -39,24 +40,10 @@ resource "google_storage_bucket" "data-lake-bucket" {
force_destroy = true
}

// In-Progress
//
//# DWH
//resource "google_bigquery_dataset" "dataset" {
// dataset_id = var.BQ_DATASET
//}
//
//# May not be needed if covered by DBT
//resource "google_bigquery_table" "table" {
// dataset_id = google_bigquery_dataset.dw.dataset_id
// table_id = var.TABLE_NAME[count.index]
// count = length(var.TABLE_NAME)
//
// external_data_configuration {
// autodetect = true
// source_format = "CSV"
// source_uris = [
// "gs://${var.BUCKET_NAME}/dw/${var.TABLE_NAME[count.index]}/*.csv"
// ]
// }
//}
# DWH
# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset
resource "google_bigquery_dataset" "dataset" {
dataset_id = var.BQ_DATASET
project = var.project
location = var.region
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@ variable "region" {
type = string
}

# Not needed for now
variable "bucket_name" {
description = "The name of the Google Cloud Storage bucket. Must be globally unique."
default = ""
}

variable "storage_class" {
description = "Storage class type for your bucket. Check official docs for more info."
default = "STANDARD"
}

variable "BQ_DATASET" {
description = "BigQuery Dataset that raw data (from GCS) will be written to"
type = string
default = "trips_data_all"
}

This file was deleted.

3 changes: 0 additions & 3 deletions week_1_basics_n_setup/2_docker_airflow/README.md

This file was deleted.

3 changes: 3 additions & 0 deletions week_1_basics_n_setup/2_docker_postgres_sql/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
(In Draft mode)

## Setup Postgres Env with Docker
12 changes: 7 additions & 5 deletions week_1_basics_n_setup/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
(In Draft mode)
### Architecture
[images/architecture/](images/architecture/)

## Technologies
#### Technologies used through this course

* Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
* Google Cloud Storage (GCS): Data Lake
* BigQuery: Data Warehouse
* Terraform: Infrastructure-as-Code (IaC), to create project infra on Google Cloud Platform
* Docker: Containerization, for Airflow environment
* Docker: Containerized environment for resources such as Postgres
* SQL: Data Analysis & Exploration
* Airflow: Pipeline Orchestration tool
* Spark: Distributed Processing
* DBT: Data Transformation tool
* SQL: Analysis
* Spark: Distributed Processing
* Kafka: Streaming
File renamed without changes.
1 change: 1 addition & 0 deletions week_2_data_ingestion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* Airflow: Pipeline Orchestration tool
Loading