A Beginner’s Guide to Preprocess and Handle Data in PySpark

This blog covers different steps to preprocess and handle data in PySpark. Preprocessing techniques can certainly vary from case to case, and many different methods can be used to massage the data into the desired form.

What is PySpark?

The Scala programming language is used to create Apache Spark. PySpark is a tool developed by the Apache Spark Community to facilitate Python with Spark. You may also use PySpark to work with RDDs in the Python programming language. They were able to accomplish this thanks to a library called Py4j.

PySpark provides PySpark Shell, a tool that connects the Python API to the Spark core and sets up the Spark context. Python is used by the majority of data scientists and analysts nowadays due to its extensive library set. Python integration with Spark is a godsend for them.

What is Azure Databricks?

Databricks + Apache Spark + enterprise cloud = Azure Databricks
It’s a fully managed version of the open-source Apache Spark analytics, with optimized storage platform connectors for the fastest data access.
It provides a notebook-oriented Apache Spark as-a-service workspace environment that enables interactive data exploration and cluster management simple.
It’s a cloud-based machine learning and big data platform that’s safe and secure.
Scala, Python, R, Java, and SQL are among the languages supported.

Preprocess and Handle Data in PySpark

All the following steps are written in Azure Databricks. If you have not created an Azure DataBricks Instance and Cluster, then you can create one from here.

1) To use Spark, the first step is to build a SparkSession object. We also use spark.sql to import all of the necessary functions and datatypes.

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('data_processing').getOrCreate()
import pyspark.sql.functions as F
from pyspark.sql.types import *

SparkSession

2) We’re now working on a new dataframe with five columns of various data types (string and integer).

schema=StructType().add("user_id","string").add("country","string").add("browser", "string").add("OS",'string').add("age", "integer")
df=spark.createDataFrame([("A203",'India',"Chrome","WIN",33),("A201",'China',"Safari","MacOS",35),("A205",'UK',"Mozilla","Linux",25)],schema=schema)
df.printSchema()

dataframe

3) As you can see, the new dataframe is constructed with three rows and five columns with the values supplied by us when we call show on it.

df.show()

show-data-frame

4) The presence of null values in a dataset is quite prevalent. As a result, adding a step to the data processing pipeline to handle null values becomes crucial. Null values can be dealt with in Spark by either replacing them with a specific value or dropping the rows/columns that include them.

First, we create a new dataframe ( df_na ) that contains null values in two of its columns (the schema is the same as in the earlier dataframe).

df_na=spark.createDataFrame([("A203",None,"Chrome","WIN",33),("A201",'China',None,"MacOS",35),("A205",'UK',"Mozilla","Linux",25)],schema=schema)
df_na.show()

null-value

5) We now set all null values in the current dataframe to 0, which provides a simple workaround. All null values in the dataframe are replaced with 0 using the fillna function.

df_na.fillna('0').show()

null-values-01

6) we replace the null values in specific columns (country, browser) with ‘USA’ and ‘Safari’, respectively.

df_na.fillna( { 'country':'USA', 'browser':'Safari' } ).show()

fillna

7) In order to drop the rows with any null values, we can simply use the na.drop functionality in PySpark.

df_na.na.drop().show()

drop the rows

8) we can pass the set of column names as well.

df_na.na.drop(subset='country').show()

set of column

9) Another very common step in data processing is to replace some data points with particular values. We can use the replace function for this.

df_na.replace("Chrome","Google Chrome").show()

replace function

10) To drop the column of a dataframe, we can use the drop functionality of PySpark.

df_na.drop('user_id').show()

drop function

Conclusion

In this blog, we discussed different techniques to read, clean, and preprocess data in PySpark.

Related/References

Next Task For You

In our Azure Data Engineer training program, we will cover all the exam objectives, 27 Hands-On Labs, and practice tests. If you want to begin your journey towards becoming a Microsoft Certified: Azure Data Engineer Associate by check our FREE CLASS.

The post A Beginner’s Guide to Preprocess and Handle Data in PySpark | Azure DataBricks appeared first on Cloud Training Program.

A Beginner’s Guide to Preprocess and Handle Data in PySpark | Azure DataBricks

What is PySpark?

What is Azure Databricks?

Preprocess and Handle Data in PySpark

Conclusion

Related/References

Next Task For You

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List