🖥️
DE Zoomcamp Notes
Linkedin | Kayla TinkerGithub | Tinker0425Blog | From Clouds to CodeBlueSky | Cloudy Blue Wave
  • Welcome - Data Engineering Zoomcamp 2025 Notes
  • INTRODUCTION
    • Introduction & Set Up
      • Virtual Environments
  • MODULE 1
    • Introduction to Module 1
    • 1.1 - Google Cloud Platform GCP
      • 1.1.1 - Introduction to Google Cloud Platform
    • 1.2 - Docker & Docker-compose
      • 1.2.1 - Introduction to Docker
      • 1.2.2 - Ingesting NY Taxi Data to Postgres
      • 1.2.3 - Connecting pgAdmin and Postgres
      • 1.2.4 - Dockerizing the Ingestion Script
      • 1.2.5 - Running Postgres and pgAdmin with Docker-Compose
      • Docker-Compose Summary
      • 1.2.6 - SQL Refresher
      • Optional Docker Video
    • 1.3 - Setting up infrastructure on GCP with Terraform
      • 1.3.1 - Terraform Primer
      • 1.3.2 - Terraform Basics
      • 1.3.3 - Terraform Variables
    • Homework
  • Module 2
    • Introduction to Module 2
    • 2.1 - Introduction to Orchestration and Kestra
      • 2.1.1 - Workflow Orchestration Introduction
      • 2.1.2 - Learn Kestra
    • 2.2 - ETL Pipelines in Kestra: Detailed Walkthrough
      • 2.2.1 - Create an ETL Pipeline with Postgres in Kestra
      • 2.2.2 - Manage Scheduling and Backfills using Postgres in Kestra
      • 2.2.3 - Transform Data with dbt and Postgres in Kestra
    • 2.3 - ETL Pipelines in Kestra: Google Cloud Platform
      • 2.3.1 - Create an ETL Pipeline with GCS and BigQuery in Kestra
      • 2.3.2 - Manage Scheduling and Backfills using BigQuery in Kestra
      • 2.3.3 - Transform Data with dbt and BigQuery in Kestra
    • Bonus: Deploy to the Cloud
    • Homework
  • Module 3
    • Introduction to Module 3
    • 3.1 - Data Warehouse, Partitioning and Clustering
      • 3.1.1 - Data Warehouse and BigQuery
      • 3.1.2 - Partitioning and Clustering
    • 3.2 - BigQuery Internals and Best Practices
      • 3.2.1 - BigQuery Best Practices
      • 3.2.2 - Internals of Big Query
    • 3.3 - Machine Learning
      • 3.3.1 - BigQuery Machine Learning
      • 3.3.2 - BigQuery Machine Learning Deployment
    • Homework
  • Workshop
    • Workshop Week
    • Homework
  • Module 4
    • Introduction to Module 4
    • 4.1 - DBT the basics
      • 4.1.1 - Analytics Engineering Basics
      • 4.1.2 - What is dbt?
    • 4.2 - Creating your Project
      • 4.2.1 - Set Up Project
      • 4.2.2 - Start Your dbt Project BigQuery and dbt Cloud
      • 4.2.3 - Build the First dbt Models
      • 4.2.4 - Testing and Documenting the Project
    • 4.3 - Deployment & Visualizations
      • 4.3.1 - Deployment Using dbt Cloud
      • 4.3.2 - Visualising the data with Google Data Studio
    • Homework
  • Module 5
    • Introduction to Module 5
    • 5.1 - Install & Intro
      • 5.1.1 - Install
      • 5.1.2 - Intro to Batch Processing
      • 5.1.3 - Intro to Spark
    • 5.2 - Spark SQL and DataFrames
      • 5.2.1 - Spark & PySpark
      • 5.2.2 - Spark Dataframes
      • 5.2.3 - SQL with Spark
    • 5.3 - Spark Internals
      • 5.3.1 - Anatomy of a Spark Cluster
      • 5.3.2 - GroupBy in Spark
      • 5.3.3 - Joins in Spark
    • 5.4 - Running Spark in the Cloud
      • 5.4.1 - Connecting to Google Cloud Storage
      • 5.4.2 - Creating a Local Spark Cluster
      • 5.4.3 - Setting up a Dataproc Cluster
      • 5.4.4 - Connecting Spark to Big Query
    • Homework
  • Module 6
    • Introduction to Module 6
    • 6.1 - Stream Processing
      • 6.1.1 - Introduction
      • 6.1.2 - Intro to stream processing
      • 6.1.3 - What is Kafka?
      • 6.1.4 - Confluent cloud
      • 6.1.5 - Kafka producer consumer
      • 6.1.6 - Kafka configuration
    • Homework
  • Final Project
    • Final Project
    • How To!
      • 1 - Create a Google Cloud Project
      • 2 - API Key and Access Token Setup
      • 3 - Fork This Repo in Github
      • Ready to Run!
    • THE END
Powered by GitBook

Connect

  • Linkedin | Kayla Tinker
  • BlueSky | Cloudy Blue Wave
  • Blog | From Clouds to Code
  • Github | Tinker0425
On this page
  • BigQuery Best Practices
  • For Cost reduction, be sure to:
  • For Query performance be sure to:
  1. Module 3
  2. 3.2 - BigQuery Internals and Best Practices

3.2.1 - BigQuery Best Practices

Last updated Feb 5, 2025

Previous3.2 - BigQuery Internals and Best PracticesNext3.2.2 - Internals of Big Query

Last updated 4 months ago

Estimated time spent on this lesson | ~15 min

Youtube Video | ~4 min

In this video we will learn the do's and don'ts of using BigQuery

BigQuery Best Practices

For Cost reduction, be sure to:

Avoid SELECT * | instead specify the name of the columns

Price your queries before running them, which is given in bytes to be processed in the upper right corner of the query screen

Use clustered or partitioned tables

Be cautious with streaming inserts as they can increase your costs drastically

Materialize query results in stages, which allows BigQuery to allocate resources more efficiently. Only the necessary compute and storage resources are used at each stage of the query execution, reducing overall resource consumption.

For Query performance be sure to:

Filter on partitioned or clustered columns

Denormalizing your data, which is a database design technique where redundant copies of data from one or more normalized tables are combined and stored together in a single table. Retrieving data from a single denormalized table is typically faster than joining multiple normalized tables.

Use nested or repeated columns to denormalize your tables

Use external data sources appropriately, as overuse could introduce latency and impact query performance. If too many of these sources live on GCP, you may run out of space.

Reduce data before using a JOIN, by a filter or aggregate the data.. This method should improve query performance and reduce resource usage by eliminating unnecessary data from the tables before executing the join.

NOT treat WITH clauses as prepared statements, because WITH clauses are primarily for readablity and not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.

Avoid oversharding tables. Sharding a table involves partitioning the data into smaller subsets or shards and distributing them across multiple storage locations or nodes to improve scalability, performance, and availability in distributed database systems. (Partitioning as we have already discussed is not a form of sharding.) But if you overshard, that means you create too many partitions and may incur a performance drop. Sometimes the complexity and overhead of sharding outweigh the potential performance benefits, especially in distributed systems with a large number of shards.

Avoid JavaScript user-defined functions | Use native UDFs instead

Use approximate aggregation functions (HyperLogLog++)

do ORDER BY last, for query operations to maximize performance

Optimize your JOIN patterns by placing the largest table first, then the smallest, and then order the rest of the tables (if any) by decreasing size. However, it's best to experiment with the order of the tables before executing the query, using the price given in bytes processed in the upper right of the query view.

💰
⚠️
☑️
🤑
📓
❗
⚡
💰
🚮
🗄️
♻️
❗
⬇️
❌
👍
❌
✔️
✔️
📁
🔵
🕓
✍️
Page cover image
BigQuery query performanceMedium
https://sudsk.medium.com/bigquery-query-performance-525a7b1797a4
https://www.youtube.com/watch?v=k81mLJVX08w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=29
Logo