🖥️
DE Zoomcamp Notes
Linkedin | Kayla TinkerGithub | Tinker0425Blog | From Clouds to CodeBlueSky | Cloudy Blue Wave
  • Welcome - Data Engineering Zoomcamp 2025 Notes
  • INTRODUCTION
    • Introduction & Set Up
      • Virtual Environments
  • MODULE 1
    • Introduction to Module 1
    • 1.1 - Google Cloud Platform GCP
      • 1.1.1 - Introduction to Google Cloud Platform
    • 1.2 - Docker & Docker-compose
      • 1.2.1 - Introduction to Docker
      • 1.2.2 - Ingesting NY Taxi Data to Postgres
      • 1.2.3 - Connecting pgAdmin and Postgres
      • 1.2.4 - Dockerizing the Ingestion Script
      • 1.2.5 - Running Postgres and pgAdmin with Docker-Compose
      • Docker-Compose Summary
      • 1.2.6 - SQL Refresher
      • Optional Docker Video
    • 1.3 - Setting up infrastructure on GCP with Terraform
      • 1.3.1 - Terraform Primer
      • 1.3.2 - Terraform Basics
      • 1.3.3 - Terraform Variables
    • Homework
  • Module 2
    • Introduction to Module 2
    • 2.1 - Introduction to Orchestration and Kestra
      • 2.1.1 - Workflow Orchestration Introduction
      • 2.1.2 - Learn Kestra
    • 2.2 - ETL Pipelines in Kestra: Detailed Walkthrough
      • 2.2.1 - Create an ETL Pipeline with Postgres in Kestra
      • 2.2.2 - Manage Scheduling and Backfills using Postgres in Kestra
      • 2.2.3 - Transform Data with dbt and Postgres in Kestra
    • 2.3 - ETL Pipelines in Kestra: Google Cloud Platform
      • 2.3.1 - Create an ETL Pipeline with GCS and BigQuery in Kestra
      • 2.3.2 - Manage Scheduling and Backfills using BigQuery in Kestra
      • 2.3.3 - Transform Data with dbt and BigQuery in Kestra
    • Bonus: Deploy to the Cloud
    • Homework
  • Module 3
    • Introduction to Module 3
    • 3.1 - Data Warehouse, Partitioning and Clustering
      • 3.1.1 - Data Warehouse and BigQuery
      • 3.1.2 - Partitioning and Clustering
    • 3.2 - BigQuery Internals and Best Practices
      • 3.2.1 - BigQuery Best Practices
      • 3.2.2 - Internals of Big Query
    • 3.3 - Machine Learning
      • 3.3.1 - BigQuery Machine Learning
      • 3.3.2 - BigQuery Machine Learning Deployment
    • Homework
  • Workshop
    • Workshop Week
    • Homework
  • Module 4
    • Introduction to Module 4
    • 4.1 - DBT the basics
      • 4.1.1 - Analytics Engineering Basics
      • 4.1.2 - What is dbt?
    • 4.2 - Creating your Project
      • 4.2.1 - Set Up Project
      • 4.2.2 - Start Your dbt Project BigQuery and dbt Cloud
      • 4.2.3 - Build the First dbt Models
      • 4.2.4 - Testing and Documenting the Project
    • 4.3 - Deployment & Visualizations
      • 4.3.1 - Deployment Using dbt Cloud
      • 4.3.2 - Visualising the data with Google Data Studio
    • Homework
  • Module 5
    • Introduction to Module 5
    • 5.1 - Install & Intro
      • 5.1.1 - Install
      • 5.1.2 - Intro to Batch Processing
      • 5.1.3 - Intro to Spark
    • 5.2 - Spark SQL and DataFrames
      • 5.2.1 - Spark & PySpark
      • 5.2.2 - Spark Dataframes
      • 5.2.3 - SQL with Spark
    • 5.3 - Spark Internals
      • 5.3.1 - Anatomy of a Spark Cluster
      • 5.3.2 - GroupBy in Spark
      • 5.3.3 - Joins in Spark
    • 5.4 - Running Spark in the Cloud
      • 5.4.1 - Connecting to Google Cloud Storage
      • 5.4.2 - Creating a Local Spark Cluster
      • 5.4.3 - Setting up a Dataproc Cluster
      • 5.4.4 - Connecting Spark to Big Query
    • Homework
  • Module 6
    • Introduction to Module 6
    • 6.1 - Stream Processing
      • 6.1.1 - Introduction
      • 6.1.2 - Intro to stream processing
      • 6.1.3 - What is Kafka?
      • 6.1.4 - Confluent cloud
      • 6.1.5 - Kafka producer consumer
      • 6.1.6 - Kafka configuration
    • Homework
  • Final Project
    • Final Project
    • How To!
      • 1 - Create a Google Cloud Project
      • 2 - API Key and Access Token Setup
      • 3 - Fork This Repo in Github
      • Ready to Run!
    • THE END
Powered by GitBook

Connect

  • Linkedin | Kayla Tinker
  • BlueSky | Cloudy Blue Wave
  • Blog | From Clouds to Code
  • Github | Tinker0425
On this page
  • Datasets
  • Running Postgres w/ Docker (Mac)
  • PGCLI
  • Jupyter Notebook
  • Resources
  1. MODULE 1
  2. 1.2 - Docker & Docker-compose

1.2.2 - Ingesting NY Taxi Data to Postgres

Last updated Jan 19, 2025

Previous1.2.1 - Introduction to DockerNext1.2.3 - Connecting pgAdmin and Postgres

Last updated 4 months ago

Youtube Video | ~29min

I recommend not pausing your workflow and going through this entire video. You may run into a number of issues related to pgcli, so allow for extra time spent here (maybe hours). Search Slack, Search FAQ, try a new virtual environment, ask for help.

The Taxi now provides data in .parquet format instead of .csv. The website gives directions on how to read .parquet files and convert it to Pandas data frame. For this course, we want to use the .csvbackup located :


Datasets


Running Postgres w/ Docker (Mac)

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:13

Be sure to run this from your terminal (base) and that you run it in the correct directory to point to your ny_taxi_postgres_data folder correctly


PGCLI

pip install pgcli

Be sure to run pip install pgcli on (base) not your environment.

If you are having issues with the above command, try: conda install -c conda-forge pgcli pip install -U mycli

Using pgcli to connect to Postgres

pgcli -h localhost -p 5432 -u root -d ny_taxi

hhostname p port u username d database name


Jupyter Notebook

Be sure jupyter is installed in your environment. Use pip install jupyter if not. Be sure your .csv taxi data is downloaded locally.

jupyter notebook

This should open a jupyter notebook web browser tab. Follow along with the youtube video to finalize your jupyter notebook.

import pandas as pd
from sqlalchemy import create_engine
from time import time

df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=100)
print(pd.io.sql.get_schema(df, name="yellow_taxi_data")) #ddl
pd.to_datetime(df.tpep_pickup_datetime)
pd.to_datetime(df.tpep_dropoff_datetime)

conn_string = "postgresql+psycopg2://root:root@localhost:5432/ny_taxi"
engine = create_engine(conn_string)
engine.connect()
print(pd.io.sql.get_schema(df, name="yellow_taxi_data", con=engine)) #ddl

df_iter = pd.read_csv('yellow_tripdata_2021-01.csv', iterator=True, chunksize=100000)
df = next(df_iter)
len(df)

df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
%time df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')


while True:
    t_start = time()
    df=next(df_iter)
    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
    df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
    t_end = time()
    print('inserted another chunk...took %.3f seconds' % (t_end - t_start))
    
# my addition for zones    
df_zones = pd.read_csv('taxi_zone_lookup.csv')
df_zones.head()
df_zones.to_sql(name='zones', con=engine, if_exists='replace')

Resources

My repo for this video can be found here

In this video we will learn how to configure and run Postgres in Docker. We will download the taxi NY dataset as a csv file and read it into a jupyter notebook. We will also look at the data using pgcli, but will use other options moving forward.

yellow_trip_data -

zones_data - and click:

(CSV)

Now you should have 2 .csv files locally

~minute 6

Terminal

After running, you should see postgres files in your ny_taxi_postgres_data directory

~minute 7

Terminal

You can now explore your dataset in therteminal window (once you have some)

If you run into issues, check out this video

Terminal

In future videos we use the zone csv data as well. I'm unsure if this was done in a video, but I added the steps in my repo jupyter notebook

In 1.2.4 we convert our python notebook into a python script and test loading in the data that way as well.

✍️
✅
📼
▪️
✅
📼
▪️
✅
▪️
⚒️
👀
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Taxi Zone Lookup Table
https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=6
📚
https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/
⛔
⚒️
TLC data website
link
here
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
Page cover image
https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5&pp=iAQB
https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/
https://github.com/Tinker0425/de-zoomcamp-my-work/tree/master/module-01/docker/video_2