5.2.1 - Spark & PySpark
Last updated Feb 23, 2025
🕓 Estimated time spent on this lesson | ~30 min
Youtube Video | ~18 min
✍️ Introduction to using Spark / Pyspark using ipynb.
I ended up changing a few things, because of the location of the csv file and I am using google collab:
```notebook-python
from pyspark import SparkFiles
file_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet'
spark.sparkContext.addFile(file_url)
# Read into Spark DF
df = spark.read.csv(SparkFiles.get('yellow_tripdata_2024-10.parquet'), header=True)
df.count()
```
In the video, we create the spark dataframe after using padas like spark.createDataFrame(df_pandas).schema
How to create Partitions?
How to save a parquet file?
Last updated