1.2.2 - Ingesting NY Taxi Data to Postgres

Last updated Jan 19, 2025

Youtube Video | ~29min

⛔ I recommend not pausing your workflow and going through this entire video. You may run into a number of issues related to pgcli, so allow for extra time spent here (maybe hours). Search Slack, Search FAQ, try a new virtual environment, ask for help.

⚒️ The Taxi TLC data website now provides data in .parquet format instead of .csv. The website link gives directions on how to read .parquet files and convert it to Pandas data frame. For this course, we want to use the .csvbackup located here: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz

✍️ In this video we will learn how to configure and run Postgres in Docker. We will download the taxi NY dataset as a csv file and read it into a jupyter notebook. We will also look at the data using pgcli, but will use other options moving forward.

Datasets

yellow_trip_data - https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
zones_data - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and click:
- Taxi Zone Lookup Table (CSV)

✅ Now you should have 2 .csv files locally

Running Postgres w/ Docker (Mac)

📼 ~minute 6

▪️ Terminal

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:13

Be sure to run this from your terminal (base) and that you run it in the correct directory to point to your ny_taxi_postgres_data folder correctly

✅ After running, you should see postgres files in your ny_taxi_postgres_data directory

PGCLI

📼 ~minute 7

▪️ Terminal

pip install pgcli

Be sure to run pip install pgcli on (base) not your environment.

If you are having issues with the above command, try: conda install -c conda-forge pgcli pip install -U mycli

Using pgcli to connect to Postgres

pgcli -h localhost -p 5432 -u root -d ny_taxi

hhostname p port u username d database name

✅ You can now explore your dataset in therteminal window (once you have some)

If you run into issues, check out this video https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=6

Jupyter Notebook

Be sure jupyter is installed in your environment. Use pip install jupyter if not. Be sure your .csv taxi data is downloaded locally.

▪️ Terminal

jupyter notebook

This should open a jupyter notebook web browser tab. Follow along with the youtube video to finalize your jupyter notebook.

import pandas as pd
from sqlalchemy import create_engine
from time import time

df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=100)
print(pd.io.sql.get_schema(df, name="yellow_taxi_data")) #ddl
pd.to_datetime(df.tpep_pickup_datetime)
pd.to_datetime(df.tpep_dropoff_datetime)

conn_string = "postgresql+psycopg2://root:root@localhost:5432/ny_taxi"
engine = create_engine(conn_string)
engine.connect()
print(pd.io.sql.get_schema(df, name="yellow_taxi_data", con=engine)) #ddl

df_iter = pd.read_csv('yellow_tripdata_2021-01.csv', iterator=True, chunksize=100000)
df = next(df_iter)
len(df)

df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
%time df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')


while True:
    t_start = time()
    df=next(df_iter)
    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
    df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
    t_end = time()
    print('inserted another chunk...took %.3f seconds' % (t_end - t_start))
    
# my addition for zones    
df_zones = pd.read_csv('taxi_zone_lookup.csv')
df_zones.head()
df_zones.to_sql(name='zones', con=engine, if_exists='replace')

⚒️ In future videos we use the zone csv data as well. I'm unsure if this was done in a video, but I added the steps in my repo jupyter notebook

👀 In 1.2.4 we convert our python notebook into a python script and test loading in the data that way as well.

Resources

My repo for this video can be found here

Not found

📚 https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/

Previous1.2.1 - Introduction to Docker Next1.2.3 - Connecting pgAdmin and Postgres

Last updated 5 months ago