Last updated Feb 5, 2025
Last updated 3 hours ago
🕓 Estimated time spent on this lesson | ~15 min
Youtube Video | ~4 min
✍️ In this video we will learn the do's and don'ts of using BigQuery
⚠️ Avoid SELECT * | ☑️ instead specify the name of the columns
🤑 Price your queries before running them, which is given in bytes to be processed in the upper right corner of the query screen
📓 Use clustered or partitioned tables
❗Be cautious with streaming inserts as they can increase your costs drastically
⚡ Materialize query results in stages, which allows BigQuery to allocate resources more efficiently. Only the necessary compute and storage resources are used at each stage of the query execution, reducing overall resource consumption.
🚮 Filter on partitioned or clustered columns
🗄️ Denormalizing your data, which is a database design technique where redundant copies of data from one or more normalized tables are combined and stored together in a single table. Retrieving data from a single denormalized table is typically faster than joining multiple normalized tables.
♻️ Use nested or repeated columns to denormalize your tables
❗Use external data sources appropriately, as overuse could introduce latency and impact query performance. If too many of these sources live on GCP, you may run out of space.
⬇️ Reduce data before using a JOIN, by a filter or aggregate the data.. This method should improve query performance and reduce resource usage by eliminating unnecessary data from the tables before executing the join.
❌ NOT treat WITH clauses as prepared statements, because WITH clauses are primarily for readablity and not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.
WITH
UNION ALL
👍 Avoid oversharding tables. Sharding a table involves partitioning the data into smaller subsets or shards and distributing them across multiple storage locations or nodes to improve scalability, performance, and availability in distributed database systems. (Partitioning as we have already discussed is not a form of sharding.) But if you overshard, that means you create too many partitions and may incur a performance drop. Sometimes the complexity and overhead of sharding outweigh the potential performance benefits, especially in distributed systems with a large number of shards.
❌ Avoid JavaScript user-defined functions | ✔️ Use native UDFs instead
✔️ Use approximate aggregation functions (HyperLogLog++)
📁 do ORDER BY last, for query operations to maximize performance
🔵 Optimize your JOIN patterns by placing the largest table first, then the smallest, and then order the rest of the tables (if any) by decreasing size. However, it's best to experiment with the order of the tables before executing the query, using the price given in bytes processed in the upper right of the query view.