> For the complete documentation index, see [llms.txt](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-1/1.2-docker-and-docker-compose/1.2.2-ingesting-ny-taxi-data-to-postgres.md).

# 1.2.2 - Ingesting NY Taxi Data to Postgres

Youtube Video | \~29min

:no\_entry:  I recommend not pausing your workflow and going through this entire video. You may run into a number of issues related to pgcli, so allow for extra time spent here (maybe hours). Search Slack, Search FAQ, try a new virtual environment, ask for help.

{% embed url="<https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5&pp=iAQB>" %}
<https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5&pp=iAQB>
{% endembed %}

{% hint style="info" %}
:hammer\_pick: The Taxi [TLC data website](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) now provides data in `.parquet` format instead of `.csv`. The website [link](https://www1.nyc.gov/assets/tlc/downloads/pdf/working_parquet_format.pdf) gives directions on how to read `.parquet` files and convert it to Pandas data frame. For this course, we want to use the `.csv`backup located [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz): <https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz>
{% endhint %}

:writing\_hand: In this video we will learn how to configure and run <mark style="background-color:purple;">Postgres</mark> in <mark style="background-color:blue;">Docker</mark>. We will download the taxi NY dataset as a csv file and read it into a jupyter notebook. We will also look at the data using pgcli, but will use other options moving forward.&#x20;

***

### Datasets

1. yellow\_trip\_data - <https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz>
2. zones\_data - <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page> and click:
   * [Taxi Zone Lookup Table](https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv) (CSV)

:white\_check\_mark: Now you should have 2 .csv files locally

***

{% embed url="<https://www.docker.com/wp-content/uploads/2022/09/postgres-official-docker-image-900x600-1.png>" %}
<https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/>
{% endembed %}

### Running Postgres w/ Docker (Mac)

:vhs: \~minute 6

:black\_small\_square: Terminal

```sh
docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:13
```

{% hint style="info" %}
Be sure to run this from your terminal (base) and that you run it in the correct directory to point to your ny\_taxi\_postgres\_data folder correctly
{% endhint %}

:white\_check\_mark: After running, you should see postgres files in your ny\_taxi\_postgres\_data directory

***

### PGCLI

:vhs: \~minute 7&#x20;

:black\_small\_square: Terminal

```bash
pip install pgcli
```

{% hint style="info" %}
Be sure to run pip install pgcli on (base) not your environment.&#x20;

If you are having issues with the above command, try:\
`conda install -c conda-forge pgcli`\
`pip install -U mycli`
{% endhint %}

Using `pgcli` to connect to <mark style="background-color:purple;">Postgres</mark>

```bash
pgcli -h localhost -p 5432 -u root -d ny_taxi
```

`h`hostname       `p` port         `u` username      `d` database name

:white\_check\_mark: You can now explore your dataset in therteminal window (once you have some)

{% hint style="info" %}
If you run into issues, check out this video <https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=6>
{% endhint %}

***

### Jupyter Notebook

{% hint style="info" %}
Be sure jupyter is installed in your environment. Use pip install jupyter if not. Be sure your `.csv` taxi data is downloaded locally.
{% endhint %}

:black\_small\_square: Terminal&#x20;

```bash
jupyter notebook
```

This should open a jupyter notebook web browser tab. Follow along with the youtube video to finalize your jupyter notebook.

```python
import pandas as pd
from sqlalchemy import create_engine
from time import time

df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=100)
print(pd.io.sql.get_schema(df, name="yellow_taxi_data")) #ddl
pd.to_datetime(df.tpep_pickup_datetime)
pd.to_datetime(df.tpep_dropoff_datetime)

conn_string = "postgresql+psycopg2://root:root@localhost:5432/ny_taxi"
engine = create_engine(conn_string)
engine.connect()
print(pd.io.sql.get_schema(df, name="yellow_taxi_data", con=engine)) #ddl

df_iter = pd.read_csv('yellow_tripdata_2021-01.csv', iterator=True, chunksize=100000)
df = next(df_iter)
len(df)

df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
%time df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')


while True:
    t_start = time()
    df=next(df_iter)
    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
    df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
    t_end = time()
    print('inserted another chunk...took %.3f seconds' % (t_end - t_start))
    
# my addition for zones    
df_zones = pd.read_csv('taxi_zone_lookup.csv')
df_zones.head()
df_zones.to_sql(name='zones', con=engine, if_exists='replace')

```

:hammer\_pick: In future videos we use the zone csv data as well. I'm unsure if this was done in a video, but I added the steps in my repo jupyter notebook

:eyes: In 1.2.4 we convert our python notebook into a python script and test loading in the data that way as well.

***

### Resources

My repo for this video can be found here

{% @github-files/github-code-block url="<https://github.com/Tinker0425/de-zoomcamp-my-work/tree/master/module-01/docker/video_2>" %}

:books: <https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-1/1.2-docker-and-docker-compose/1.2.2-ingesting-ny-taxi-data-to-postgres.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
