Page cover

1.2.4 - Dockerizing the Ingestion Script

Last updated Jan 19, 2025

Youtube Video | ~18 min

https://www.youtube.com/watch?v=B1WwATwf-vY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=8

✍️ In this video we will learn how to convert our jupyter notebook into a python script. Then we will learn a second way of how to ingest our data into postgres using our new python script.

♻️ Recall this information from 1.2.2 where we are loading in our taxi csv data using python notebook. This is another way to acheive the same goal, but likely a more common practice.

Convert Jupyter Notebook

◼️Terminal - Converting the jupyter notebook to a python script

jupyter nbconvert --to=script {notebook_name.ipynb}

Run from your project environment in the folder where the ipynb lives

🧹 Clean up your code as needed

Using argparse

The goal is to allow user inputs for different values such as url or password. You can read more about how to use argparse here /docs.python.org/3/library/argparse.html. The final python script called ingest_data.py can be found here. I recommend using this version because at ~11 min in the youtube Alexey mentions needed to add an exception to the code, which this version has - the 'try-except' statement. There is a link in resources if you are unsure what a 'try-except' statement is.


Second way to ingest data

♻️ Recall that our first way was in 1.2.2 using python notebook. To complete the second method you will need to drop your table following the youtube video.

This second way is still a 'manual' method. You need to manually drop your table here http://localhost:8080/ and then run a command.

⚒️ Note that our url link will look different because the taxi ny website no longer has the csv files

◼️ Terminal

URL="https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"

python ingest_data.py \
  --user=root \
  --password=root \
  --host=localhost \
  --port=5432 \
  --db=ny_taxi \
  --table_name=yellow_taxi_trips \
  --url=${URL}

This should print the loops in your terminal window and you should be able to now see the data here http://localhost:8080/ again (if you dropped your table).


Third way to ingest data

This is 'Dockerizing' the python script. This method will automatically 'replace' the table according to our python script.

1

Update Dockerfile

📝 Code editor

FROM python:3.12

RUN apt-get install wget
RUN pip install pandas sqlalchemy psycopg2

WORKDIR /app
COPY ingest_data.py ingest_data.py

ENTRYPOINT [ "python", "ingest_data.py" ]

◼️ Terminal window

docker build -t taxi_ingest:v001 .
2

Docker run

docker run -it \
  --network=pg-network \
  taxi_ingest:v001 \
    --user=root \
    --password=root \
    --host=pg-database \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_taxi_trips \
    --url=${URL}

Resources

Not found

📚 https://stackoverflow.com/questions/62062226/how-to-work-with-try-and-except-in-python

🔗 https://docs.python.org/3/library/argparse.html

Last updated