🖥️
DE Zoomcamp Notes
Linkedin | Kayla TinkerGithub | Tinker0425Blog | From Clouds to CodeBlueSky | Cloudy Blue Wave
  • Welcome - Data Engineering Zoomcamp 2025 Notes
  • INTRODUCTION
    • Introduction & Set Up
      • Virtual Environments
  • MODULE 1
    • Introduction to Module 1
    • 1.1 - Google Cloud Platform GCP
      • 1.1.1 - Introduction to Google Cloud Platform
    • 1.2 - Docker & Docker-compose
      • 1.2.1 - Introduction to Docker
      • 1.2.2 - Ingesting NY Taxi Data to Postgres
      • 1.2.3 - Connecting pgAdmin and Postgres
      • 1.2.4 - Dockerizing the Ingestion Script
      • 1.2.5 - Running Postgres and pgAdmin with Docker-Compose
      • Docker-Compose Summary
      • 1.2.6 - SQL Refresher
      • Optional Docker Video
    • 1.3 - Setting up infrastructure on GCP with Terraform
      • 1.3.1 - Terraform Primer
      • 1.3.2 - Terraform Basics
      • 1.3.3 - Terraform Variables
    • Homework
  • Module 2
    • Introduction to Module 2
    • 2.1 - Introduction to Orchestration and Kestra
      • 2.1.1 - Workflow Orchestration Introduction
      • 2.1.2 - Learn Kestra
    • 2.2 - ETL Pipelines in Kestra: Detailed Walkthrough
      • 2.2.1 - Create an ETL Pipeline with Postgres in Kestra
      • 2.2.2 - Manage Scheduling and Backfills using Postgres in Kestra
      • 2.2.3 - Transform Data with dbt and Postgres in Kestra
    • 2.3 - ETL Pipelines in Kestra: Google Cloud Platform
      • 2.3.1 - Create an ETL Pipeline with GCS and BigQuery in Kestra
      • 2.3.2 - Manage Scheduling and Backfills using BigQuery in Kestra
      • 2.3.3 - Transform Data with dbt and BigQuery in Kestra
    • Bonus: Deploy to the Cloud
    • Homework
  • Module 3
    • Introduction to Module 3
    • 3.1 - Data Warehouse, Partitioning and Clustering
      • 3.1.1 - Data Warehouse and BigQuery
      • 3.1.2 - Partitioning and Clustering
    • 3.2 - BigQuery Internals and Best Practices
      • 3.2.1 - BigQuery Best Practices
      • 3.2.2 - Internals of Big Query
    • 3.3 - Machine Learning
      • 3.3.1 - BigQuery Machine Learning
      • 3.3.2 - BigQuery Machine Learning Deployment
    • Homework
  • Workshop
    • Workshop Week
    • Homework
  • Module 4
    • Introduction to Module 4
    • 4.1 - DBT the basics
      • 4.1.1 - Analytics Engineering Basics
      • 4.1.2 - What is dbt?
    • 4.2 - Creating your Project
      • 4.2.1 - Set Up Project
      • 4.2.2 - Start Your dbt Project BigQuery and dbt Cloud
      • 4.2.3 - Build the First dbt Models
      • 4.2.4 - Testing and Documenting the Project
    • 4.3 - Deployment & Visualizations
      • 4.3.1 - Deployment Using dbt Cloud
      • 4.3.2 - Visualising the data with Google Data Studio
    • Homework
  • Module 5
    • Introduction to Module 5
    • 5.1 - Install & Intro
      • 5.1.1 - Install
      • 5.1.2 - Intro to Batch Processing
      • 5.1.3 - Intro to Spark
    • 5.2 - Spark SQL and DataFrames
      • 5.2.1 - Spark & PySpark
      • 5.2.2 - Spark Dataframes
      • 5.2.3 - SQL with Spark
    • 5.3 - Spark Internals
      • 5.3.1 - Anatomy of a Spark Cluster
      • 5.3.2 - GroupBy in Spark
      • 5.3.3 - Joins in Spark
    • 5.4 - Running Spark in the Cloud
      • 5.4.1 - Connecting to Google Cloud Storage
      • 5.4.2 - Creating a Local Spark Cluster
      • 5.4.3 - Setting up a Dataproc Cluster
      • 5.4.4 - Connecting Spark to Big Query
    • Homework
  • Module 6
    • Introduction to Module 6
    • 6.1 - Stream Processing
      • 6.1.1 - Introduction
      • 6.1.2 - Intro to stream processing
      • 6.1.3 - What is Kafka?
      • 6.1.4 - Confluent cloud
      • 6.1.5 - Kafka producer consumer
      • 6.1.6 - Kafka configuration
    • Homework
  • Final Project
    • Final Project
    • How To!
      • 1 - Create a Google Cloud Project
      • 2 - API Key and Access Token Setup
      • 3 - Fork This Repo in Github
      • Ready to Run!
    • THE END
Powered by GitBook

Connect

  • Linkedin | Kayla Tinker
  • BlueSky | Cloudy Blue Wave
  • Blog | From Clouds to Code
  • Github | Tinker0425
On this page
  1. MODULE 1
  2. 1.2 - Docker & Docker-compose

1.2.6 - SQL Refresher

Last updated Jan 22, 2025

PreviousDocker-Compose SummaryNextOptional Docker Video

Last updated 4 months ago

Youtube Video | ~16min

In this video we go through a SQL refresher with a focus on JOIN and GROUP BY. This example using the yellow_taxi_data and zones is helpful for the homework set.

SQL Code difference between using JOIN vs. WHERE

SELECT 
	tpep_pickup_datetime, 
	tpep_dropoff_datetime, 
	total_amount, 
	CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pick_up_loc",
	CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "drop_off_loc"
FROM 
	yellow_taxi_data t JOIN zones zpu 
		ON t."PULocationID"=zpu."LocationID" 
	JOIN zones zdo 
		ON t."DOLocationID"=zdo."LocationID"
LIMIT 100;
SELECT 
	tpep_pickup_datetime, 
	tpep_dropoff_datetime, 
	total_amount, 
	CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pick_up_loc",
	CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "drop_off_loc"
FROM 
	yellow_taxi_data t, 
	zones zpu, 
	zones zdo
WHERE 
	t."PULocationID"=zpu."LocationID" AND 
	t."DOLocationID"=zdo."LocationID"
LIMIT 100;

GROUP BY 1 column vs. multi-column

SELECT 
	CAST(tpep_dropoff_datetime AS DATE) as "day", 
	COUNT(1) as "count",
	MAX(total_amount),
	MAX(passenger_count)
FROM 
	yellow_taxi_data t 
GROUP BY
	CAST(tpep_dropoff_datetime AS DATE)
ORDER BY "count" DESC;
SELECT 
	CAST(tpep_dropoff_datetime AS DATE) as "day", 
	"DOLocationID",
	COUNT(1) as "count",
	MAX(total_amount),
	MAX(passenger_count)
FROM 
	yellow_taxi_data t 
GROUP BY
	1, 2
ORDER BY "day" ASC, "DOLocationID" ASC;

Resources

Use an app like to review SQL or Python basics

🔗
Mimo
✍️
https://www.youtube.com/watch?v=QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10&pp=iAQB
Page cover image