# 3.1.1 - Data Warehouse and BigQuery

:clock4:  *Estimated time spent on this lesson | \~35 min*

Youtube Video | \~23 min

{% embed url="<https://www.youtube.com/watch?v=jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35>" %}
<https://www.youtube.com/watch?v=jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35>
{% endembed %}

:writing\_hand: In this video we learn about OLTP, OLAP, Data Warehouse, and <mark style="background-color:blue;">BigQuery</mark>. We will do some exploring in <mark style="background-color:blue;">BigQuery</mark> and then will have an introduction to partitioning and clustering.

### Definitions

{% tabs %}
{% tab title="OLTP" %}
OLTP: Online transaction processing

Online Transaction Processing (OLTP) refers to database systems designed to manage and process real-time transaction data, typically for operational applications. These systems handle tasks like order processing, inventory management, and customer interactions, where quick, reliable processing is crucial. In contrast to OLAP (Online Analytical Processing), which focuses on analyzing large datasets for decision-making, OLTP is all about immediate, transactional data processing.

**Example 1**: An e-commerce website using OLTP to process customer orders in real-time.\
**Example 2**: A bank’s ATM system that instantly processes withdrawals and deposits.
{% endtab %}

{% tab title="OLAP" %}
Online Analytical Processing (OLAP) is a computing approach designed for quickly answering multi-dimensional analytical queries, often used in business intelligence. Unlike OLTP, which focuses on transaction processing, OLAP helps in complex data analysis for areas like sales reporting, budgeting, and forecasting. It enables users to analyze large datasets and perform tasks such as trend analysis and data mining.

**Example 1**: A company using OLAP to analyze sales performance across different regions and time periods.\
**Example 2**: Financial analysts using OLAP to forecast next quarter's revenues based on historical data.
{% endtab %}

{% tab title="Data Warehouse" %}
A data warehouse (DW or EDW) is a centralized system used for reporting and data analysis, playing a key role in business intelligence. It stores both current and historical data from various operational systems, like sales or marketing, and organizes it for easy querying and reporting by analysts and managers. Data in the warehouse often undergoes cleansing and transformation before being used for insights and decision-making.

Examples of popular data warehouses include Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure Synapse Analytics.
{% endtab %}

{% tab title="BigQuery" %}
"<mark style="background-color:blue;">BigQuery</mark> is a managed, [serverless](https://en.wikipedia.org/wiki/Serverless) [data warehouse](https://en.wikipedia.org/wiki/Data_warehouse) product by [Google](https://en.wikipedia.org/wiki/Google), offering [scalable analysis](https://en.wikipedia.org/wiki/Analysis) over large quantities of data. It is a *Platform as a Service* ([PaaS](https://en.wikipedia.org/wiki/Cloud_computing#Platform_as_a_service_\(PaaS\))) that supports querying using a dialect of [SQL](https://en.wikipedia.org/wiki/SQL). It also has built-in [machine learning](https://en.wikipedia.org/wiki/Machine_learning) capabilities." Wiki <https://en.wikipedia.org/wiki/BigQuery>
{% endtab %}
{% endtabs %}

## BigQuery

:mag\_right: <mark style="background-color:blue;">BigQuery</mark> has some public tables and you can search for them:

<figure><img src="https://3375647567-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxgPnpaRKCInTeWOYlVUE%2Fuploads%2FIYFPZszAKe6Ms5hyprXD%2FScreen%20Shot%202025-02-04%20at%206.07.20%20PM.png?alt=media&#x26;token=ac3abb8c-7c59-4f1a-91f6-0caccfe6099d" alt=""><figcaption></figcaption></figure>

```sql
SELECT station_id, name
FROM bigquery-public-data.new_york_citibike.citibike_stations
LIMIT 100
```

:eyes: In Module-2, we used <mark style="background-color:purple;">Kestra</mark> to get out tables in <mark style="background-color:blue;">BigQuery</mark>. Another option is you can create a table if your data is in your GCP bucket:

```sql
-- Creating external table referring to gcs path
CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`
OPTIONS (
  format = 'CSV',
  uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv']
);
```

{% embed url="<https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver16&tabs=dedicated>" %}
<https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver16&tabs=dedicated>
{% endembed %}

### Partitioning

<figure><img src="https://lh7-rt.googleusercontent.com/slidesz/AGV_vUfvbluAQ8bNxMAWmNc_nBlvQQmxbJ1EF_ylSEupgQR3A-Y94ynrj1e2BTg-3AzL66UJlXSW8PWMDO7Ula-gubswj-eBHuOs8vDTrSsnm2qjvpwSseumPjDaNC5ViE9PX5wEpNSpTQDVrghtquqdBMHuqrSck4g=s2048?key=paeMfeNH865yUluq5ERR7g" alt=""><figcaption><p><a href="https://docs.google.com/presentation/d/1a3ZoBAXFk8-EhUsd7rAZd-5p_HpltkzSeujjRGB2TAI/edit#slide=id.g10eebc44ce4_0_41">https://docs.google.com/presentation/d/1a3ZoBAXFk8-EhUsd7rAZd-5p_HpltkzSeujjRGB2TAI/edit#slide=id.g10eebc44ce4_0_41</a></p></figcaption></figure>

```sql
-- Check yello trip data
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata limit 10;

-- Create a non partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitoned AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

-- Create a partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
PARTITION BY
  DATE(tpep_pickup_datetime) AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
```

In <mark style="background-color:blue;">BigQuery</mark> Details, it will show you if the table is partitioned. This is very cool, if you are often examining a table based on one column (pick up date in our example), this would save you time and money.

### Clustering

:question:It looks like you cannot cluster unless you partition first, as clustering is a second grouping. So first group by partitioning (date in our example), and then group by clustering (vendor in our example).&#x20;

<figure><img src="https://lh7-rt.googleusercontent.com/slidesz/AGV_vUcvfuT2jPUX2YtNUZfQgxJUcCqhIAb7SUpW_qBlC6Vv9ynLMtCaVP9zj5zocKEwItoTnJTisvEyVk6s7RJgP6wQkFdz6BG8P_N32rxyUv5bf49xMDv4Txsm-59HLQ6E2QdbrAgoS-F3qKVeIkeuYMoVeOfYlvrl=s2048?key=paeMfeNH865yUluq5ERR7g" alt=""><figcaption><p>Course Slides</p></figcaption></figure>

:bulb: I really like how in <mark style="background-color:blue;">BigQuery</mark> it gives you a different table icon to show that the table had been partitioned or partitioned & clustered, and then the details tell you by which field.

<figure><img src="https://3375647567-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxgPnpaRKCInTeWOYlVUE%2Fuploads%2FEslnCLHQ0egw2Pbw2dUz%2FScreen%20Shot%202025-02-04%20at%207.24.22%20PM.png?alt=media&#x26;token=aef2c009-35f2-44c7-bac4-870097804242" alt=""><figcaption></figcaption></figure>

:white\_check\_mark: We should now understand how to create a table from an external source, and how to partition and cluster a table in <mark style="background-color:blue;">BigQuery</mark>

I really enjoyed this lesson :star::star::star::star::star:


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-3/3.1-data-warehouse-partitioning-and-clustering/3.1.1-data-warehouse-and-bigquery.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
