2.3.1 - Create an ETL Pipeline with GCS and BigQuery in Kestra

Last updated Jan 31, 2025

Youtube Video | ~20 min

Now that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using:

Google Cloud Storage (GCS) as a data lake
BigQuery as a data warehouse.

🌐 GCS site and we will do some work here setting up our service account, permissions, and grabbing a json key.

⚠️ Be sure that your json key information is kept private and off github

To connect GCP to Kestra, we use modify and execute our flow 4. You can learn more about KVs here

Key Value (KV) Storekestra_io

🐛 I tried to change my location in flow 4, but that caused an error. Maybe the region was too large?

Regions and zones | Compute Engine Documentation | Google CloudGoogle Cloud

and then we run our flow 5 which will also create our bucket and bigquery dataset areas on GCP.

✅ Your flows should now have connected you kestra to GCP and you should see a bucket created

❓ Our end goal is for our flow to send out data (csv in our case) to our GCS data lake bucket where we can pass it over to bigquery. This dataset is large and would crash our computer locally.

👀 Run flow 6 to see how it sends our data to GCP bucket and then to bigquery

Previous2.3 - ETL Pipelines in Kestra: Google Cloud Platform Next2.3.2 - Manage Scheduling and Backfills using BigQuery in Kestra

Last updated 5 months ago