1.2.2 - Ingesting NY Taxi Data to Postgres
Last updated Jan 19, 2025
Last updated
Last updated Jan 19, 2025
Last updated
Youtube Video | ~29min
I recommend not pausing your workflow and going through this entire video. You may run into a number of issues related to pgcli, so allow for extra time spent here (maybe hours). Search Slack, Search FAQ, try a new virtual environment, ask for help.
The Taxi TLC data website now provides data in .parquet
format instead of .csv
. The website link gives directions on how to read .parquet
files and convert it to Pandas data frame. For this course, we want to use the .csv
backup located here: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
zones_data - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and click:
Taxi Zone Lookup Table (CSV)
Be sure to run this from your terminal (base) and that you run it in the correct directory to point to your ny_taxi_postgres_data folder correctly
Be sure to run pip install pgcli on (base) not your environment.
If you are having issues with the above command, try:
conda install -c conda-forge pgcli
pip install -U mycli
Using pgcli
to connect to Postgres
h
hostname p
port u
username d
database name
If you run into issues, check out this video https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=6
Be sure jupyter is installed in your environment. Use pip install jupyter if not. Be sure your .csv
taxi data is downloaded locally.
This should open a jupyter notebook web browser tab. Follow along with the youtube video to finalize your jupyter notebook.
My repo for this video can be found here
In this video we will learn how to configure and run Postgres in Docker. We will download the taxi NY dataset as a csv file and read it into a jupyter notebook. We will also look at the data using pgcli, but will use other options moving forward.
Now you should have 2 .csv files locally
~minute 6
Terminal
After running, you should see postgres files in your ny_taxi_postgres_data directory
~minute 7
Terminal
You can now explore your dataset in therteminal window (once you have some)
Terminal
In future videos we use the zone csv data as well. I'm unsure if this was done in a video, but I added the steps in my repo jupyter notebook
In 1.2.4 we convert our python notebook into a python script and test loading in the data that way as well.