# 1.2.2 - Ingesting NY Taxi Data to Postgres

Youtube Video | \~29min

:no\_entry:  I recommend not pausing your workflow and going through this entire video. You may run into a number of issues related to pgcli, so allow for extra time spent here (maybe hours). Search Slack, Search FAQ, try a new virtual environment, ask for help.

{% embed url="<https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5&pp=iAQB>" %}
<https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5&pp=iAQB>
{% endembed %}

{% hint style="info" %}
:hammer\_pick: The Taxi [TLC data website](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) now provides data in `.parquet` format instead of `.csv`. The website [link](https://www1.nyc.gov/assets/tlc/downloads/pdf/working_parquet_format.pdf) gives directions on how to read `.parquet` files and convert it to Pandas data frame. For this course, we want to use the `.csv`backup located [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz): <https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz>
{% endhint %}

:writing\_hand: In this video we will learn how to configure and run <mark style="background-color:purple;">Postgres</mark> in <mark style="background-color:blue;">Docker</mark>. We will download the taxi NY dataset as a csv file and read it into a jupyter notebook. We will also look at the data using pgcli, but will use other options moving forward.&#x20;

***

### Datasets

1. yellow\_trip\_data - <https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz>
2. zones\_data - <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page> and click:
   * [Taxi Zone Lookup Table](https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv) (CSV)

:white\_check\_mark: Now you should have 2 .csv files locally

***

{% embed url="<https://www.docker.com/wp-content/uploads/2022/09/postgres-official-docker-image-900x600-1.png>" %}
<https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/>
{% endembed %}

### Running Postgres w/ Docker (Mac)

:vhs: \~minute 6

:black\_small\_square: Terminal

```sh
docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:13
```

{% hint style="info" %}
Be sure to run this from your terminal (base) and that you run it in the correct directory to point to your ny\_taxi\_postgres\_data folder correctly
{% endhint %}

:white\_check\_mark: After running, you should see postgres files in your ny\_taxi\_postgres\_data directory

***

### PGCLI

:vhs: \~minute 7&#x20;

:black\_small\_square: Terminal

```bash
pip install pgcli
```

{% hint style="info" %}
Be sure to run pip install pgcli on (base) not your environment.&#x20;

If you are having issues with the above command, try:\
`conda install -c conda-forge pgcli`\
`pip install -U mycli`
{% endhint %}

Using `pgcli` to connect to <mark style="background-color:purple;">Postgres</mark>

```bash
pgcli -h localhost -p 5432 -u root -d ny_taxi
```

`h`hostname       `p` port         `u` username      `d` database name

:white\_check\_mark: You can now explore your dataset in therteminal window (once you have some)

{% hint style="info" %}
If you run into issues, check out this video <https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=6>
{% endhint %}

***

### Jupyter Notebook

{% hint style="info" %}
Be sure jupyter is installed in your environment. Use pip install jupyter if not. Be sure your `.csv` taxi data is downloaded locally.
{% endhint %}

:black\_small\_square: Terminal&#x20;

```bash
jupyter notebook
```

This should open a jupyter notebook web browser tab. Follow along with the youtube video to finalize your jupyter notebook.

```python
import pandas as pd
from sqlalchemy import create_engine
from time import time

df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=100)
print(pd.io.sql.get_schema(df, name="yellow_taxi_data")) #ddl
pd.to_datetime(df.tpep_pickup_datetime)
pd.to_datetime(df.tpep_dropoff_datetime)

conn_string = "postgresql+psycopg2://root:root@localhost:5432/ny_taxi"
engine = create_engine(conn_string)
engine.connect()
print(pd.io.sql.get_schema(df, name="yellow_taxi_data", con=engine)) #ddl

df_iter = pd.read_csv('yellow_tripdata_2021-01.csv', iterator=True, chunksize=100000)
df = next(df_iter)
len(df)

df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
%time df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')


while True:
    t_start = time()
    df=next(df_iter)
    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
    df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
    t_end = time()
    print('inserted another chunk...took %.3f seconds' % (t_end - t_start))
    
# my addition for zones    
df_zones = pd.read_csv('taxi_zone_lookup.csv')
df_zones.head()
df_zones.to_sql(name='zones', con=engine, if_exists='replace')

```

:hammer\_pick: In future videos we use the zone csv data as well. I'm unsure if this was done in a video, but I added the steps in my repo jupyter notebook

:eyes: In 1.2.4 we convert our python notebook into a python script and test loading in the data that way as well.

***

### Resources

My repo for this video can be found here

{% @github-files/github-code-block url="<https://github.com/Tinker0425/de-zoomcamp-my-work/tree/master/module-01/docker/video_2>" %}

:books: <https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/>
