5.2.1 - Spark & PySpark

Last updated Feb 23, 2025

🕓 Estimated time spent on this lesson | ~30 min

Youtube Video | ~18 min

✍️ Introduction to using Spark / Pyspark using ipynb.

I ended up changing a few things, because of the location of the csv file and I am using google collab:

```notebook-python
from pyspark import SparkFiles

file_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet'
spark.sparkContext.addFile(file_url)

# Read into Spark DF
df = spark.read.csv(SparkFiles.get('yellow_tripdata_2024-10.parquet'), header=True)

df.count()
```

In the video, we create the spark dataframe after using padas like spark.createDataFrame(df_pandas).schema

How to create Partitions?

How to save a parquet file?

Last updated