2.3.1 - Create an ETL Pipeline with GCS and BigQuery in Kestra
Last updated Jan 31, 2025
Last updated
Last updated Jan 31, 2025
Last updated
Youtube Video | ~20 min
Now that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using:
Google Cloud Storage (GCS) as a data lake
BigQuery as a data warehouse.
To connect GCP to Kestra, we use modify and execute our flow 4. You can learn more about KVs here
and then we run our flow 5 which will also create our bucket and bigquery dataset areas on GCP.
GCS site and we will do some work here setting up our service account, permissions, and grabbing a json key.
Be sure that your json key information is kept private and off github
I tried to change my location in flow 4, but that caused an error. Maybe the region was too large?
Your flows should now have connected you kestra to GCP and you should see a bucket created
Our end goal is for our flow to send out data (csv in our case) to our GCS data lake bucket where we can pass it over to bigquery. This dataset is large and would crash our computer locally.
Run flow 6 to see how it sends our data to GCP bucket and then to bigquery