1.2.2 - Ingesting NY Taxi Data to Postgres
Last updated Jan 19, 2025
Youtube Video | ~29min
⛔ I recommend not pausing your workflow and going through this entire video. You may run into a number of issues related to pgcli, so allow for extra time spent here (maybe hours). Search Slack, Search FAQ, try a new virtual environment, ask for help.
✍️ In this video we will learn how to configure and run Postgres in Docker. We will download the taxi NY dataset as a csv file and read it into a jupyter notebook. We will also look at the data using pgcli, but will use other options moving forward.
Datasets
zones_data - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and click:
Taxi Zone Lookup Table (CSV)
✅ Now you should have 2 .csv files locally
Running Postgres w/ Docker (Mac)
📼 ~minute 6
▪️ Terminal
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
postgres:13
✅ After running, you should see postgres files in your ny_taxi_postgres_data directory
PGCLI
📼 ~minute 7
▪️ Terminal
pip install pgcli
Using pgcli
to connect to Postgres
pgcli -h localhost -p 5432 -u root -d ny_taxi
h
hostname p
port u
username d
database name
✅ You can now explore your dataset in therteminal window (once you have some)
Jupyter Notebook
▪️ Terminal
jupyter notebook
This should open a jupyter notebook web browser tab. Follow along with the youtube video to finalize your jupyter notebook.
import pandas as pd
from sqlalchemy import create_engine
from time import time
df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=100)
print(pd.io.sql.get_schema(df, name="yellow_taxi_data")) #ddl
pd.to_datetime(df.tpep_pickup_datetime)
pd.to_datetime(df.tpep_dropoff_datetime)
conn_string = "postgresql+psycopg2://root:root@localhost:5432/ny_taxi"
engine = create_engine(conn_string)
engine.connect()
print(pd.io.sql.get_schema(df, name="yellow_taxi_data", con=engine)) #ddl
df_iter = pd.read_csv('yellow_tripdata_2021-01.csv', iterator=True, chunksize=100000)
df = next(df_iter)
len(df)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
%time df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
while True:
t_start = time()
df=next(df_iter)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
t_end = time()
print('inserted another chunk...took %.3f seconds' % (t_end - t_start))
# my addition for zones
df_zones = pd.read_csv('taxi_zone_lookup.csv')
df_zones.head()
df_zones.to_sql(name='zones', con=engine, if_exists='replace')
⚒️ In future videos we use the zone csv data as well. I'm unsure if this was done in a video, but I added the steps in my repo jupyter notebook
👀 In 1.2.4 we convert our python notebook into a python script and test loading in the data that way as well.
Resources
My repo for this video can be found here
📚 https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/
Last updated