Skip to content

Latest commit

 

History

History
221 lines (141 loc) · 5.99 KB

File metadata and controls

221 lines (141 loc) · 5.99 KB

Introduction to PySpark Using Python

Slide 1:

Introduction to PySpark

PySpark is the Python API for Apache Spark, a powerful open-source distributed computing framework. It allows you to process large datasets across multiple computers, making it a popular choice for big data analytics.

Code:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
                    .appName("PySpark Example") \
                    .getOrCreate()

Slide 2:

Reading Data in PySpark

PySpark provides various methods to read data from different sources, such as CSV, JSON, Parquet, and more. Here's an example of reading a CSV file into a DataFrame.

Code:

# Read a CSV file
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

Slide 3:

Exploring Data with PySpark

Once you have loaded your data into a DataFrame, you can explore and analyze it using various DataFrame methods. Here's an example of showing the first few rows and the schema of the DataFrame.

Code:

# Show the first few rows of the DataFrame
df.show(5)

# Print the schema of the DataFrame
df.printSchema()

Slide 4:

Data Transformation with PySpark

PySpark provides a rich set of transformations to manipulate and clean your data. Here's an example of filtering and selecting columns from a DataFrame.

Code:

# Filter rows based on a condition
filtered_df = df.filter(df["age"] > 30)

# Select specific columns
selected_df = df.select("name", "age")

Slide 5:

Grouping and Aggregating Data

PySpark allows you to group data and perform aggregations, such as sum, count, or average, on the grouped data. This is useful for data analysis and reporting.

Code:

# Group by "department" and count the number of employees
dept_counts = df.groupBy("department").count()

# Group by "department" and calculate the average salary
avg_salaries = df.groupBy("department").avg("salary")

Slide 6:

Joining DataFrames

In PySpark, you can join two or more DataFrames based on a common column, similar to SQL joins. This is useful for combining data from multiple sources.

Code:

# Join two DataFrames on a common column
joined_df = df1.join(df2, df1["id"] == df2["id"], "inner")

Slide 7:

User-Defined Functions (UDFs)

PySpark allows you to create custom User-Defined Functions (UDFs) to perform complex transformations on your data. UDFs can be written in Python and applied to DataFrame columns.

Code:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Define a UDF to convert strings to uppercase
upper_case = udf(lambda x: x.upper(), StringType())

# Apply the UDF to a DataFrame column
upper_df = df.select(upper_case("name").alias("uppercase_name"))

Slide 8:

Caching and Persistence

PySpark provides mechanisms to cache or persist intermediate results in memory or on disk, which can significantly improve performance for iterative or repeated computations.

Code:

# Cache a DataFrame in memory
df.cache()

# Persist a DataFrame on disk
df.persist(StorageLevel.DISK_ONLY)

Slide 9:

Partitioning and Repartitioning

Partitioning and repartitioning are techniques used in PySpark to optimize data processing by distributing data across multiple partitions. This can improve performance and reduce data skew.

Code:

# Repartition a DataFrame into 8 partitions
repartitioned_df = df.repartition(8)

# Partition a DataFrame by a column
partitioned_df = df.repartition("department")

Slide 10:

Spark SQL and DataFrames

PySpark allows you to use SQL-like syntax to query and manipulate DataFrames. This can be useful for those familiar with SQL or when working with complex data transformations.

Code:

# Register a DataFrame as a temporary view
df.createOrReplaceTempView("employees")

# Run SQL queries on the temporary view
results = spark.sql("SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department")

Slide 11:

PySpark Streaming

PySpark Streaming enables you to process real-time data streams from sources like Apache Kafka, Amazon Kinesis, or TCP sockets. It provides a high-level abstraction for streaming computations.

Code:

from pyspark.sql.functions import explode
from pyspark.sql.types import StructType, StructField, StringType

# Define the schema for the streaming data
schema = StructType([StructField("text", StringType(), True)])

# Create a streaming DataFrame
lines = spark.readStream.format("socket") \
                .option("host", "localhost") \
                .option("port", 9999) \
                .load()

# Split the lines into words
words = lines.select(explode(split("value", " ")).alias("word"))

Slide 12:

PySpark on YARN and Mesos

PySpark can run on various cluster managers, such as Apache YARN (Hadoop's resource manager) and Apache Mesos. This allows you to leverage existing cluster resources for distributed computing.

Code:

# Create a SparkSession with YARN as the master
spark = SparkSession.builder \
                    .appName("PySpark on YARN") \
                    .master("yarn") \
                    .getOrCreate()

# Create a SparkSession with Mesos as the master
spark = SparkSession.builder \
                    .appName("PySpark on Mesos") \
                    .master("mesos://host:port") \
                    .getOrCreate()

Slide 13:

Additional Resources

For further learning and exploration of PySpark, here are some additional resources:

Reference from ArXiv: