• Home
  • About
  • Blog
  • Speaking
  • Contact

Synapse Spark – Playing with Data Frames with Python

11 May 2021Kevin ConanAzure Synapse AnalyticsNo Comments

Background

It’s time for us to take our next step in our Synapse Spark Journey.  Today, we’re going to play with Data Frames. Data Frames are where our data is temporarily stored within Spark. Like last time, we are going to focus on using Python.

Setup

For our examples, we will build off from what we learned previously about reading a .CSV file into a data frame.

Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python – SQL Stijn (sql-stijn.com)

For my examples, I’m going to be using one of the New York Taxi sample data dimension files (dimlocation.csv).

So, let’s go ahead and read in a file using this code template:

from pyspark.sql import SparkSession
from pyspark.sql.types import *

account_name = "Your account name"
container_name = "Your container name"
relative_path = "Your path"

adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)

dataframe1 = spark.read.option('header', 'true') \
.option('delimiter', ',') \
.csv(adls_path + '/FileName.csv')

dataframe1.show()

Remember, the first-time running Spark code will take longer as the pool has to start up. 

So Many Demos!

Let’s begin with seeing what the schema is for data frame: 

dataframe1.printSchema()

 

Now that we know the schema, let’s return the first 5 records: 

dataframe1.limit(5).show()

 

Next, let’s assume that we do not need to include the service_zone. So let’s drop it to clean up our data frame a bit.  We will also display the first 5 records again to see the difference: 

dataframe1 = dataframe1.drop("service_zone")

dataframe1.limit(5).show()

 

Let’s create a new data frame that just has the records where the Borough is Queens and return the first 10 records: 

dataframeQueens = dataframe1.filter(dataframe1[ 'borough' ] == "Queens")

dataframeQueens.limit(10).show()
 

What if we want more than one Borough?  No problem!  Let’s look at using “isin”: 

dataframeQueensManhattan = dataframe1.filter(dataframe1[ 'borough' ].isin(["Queens", "Manhattan"]))

dataframeQueensManhattan.limit(5).show()
Now that we are getting the hang of this, let’s try two new things!  First, let’s create a new data frame that contains only records where the LocationID is 100 or higher. Next, let’s return 10% of the records. 

dataframeLocationIDOver100 = dataframe1.filter(dataframe1[ 'LocationID' ] >= 100 )

dataframeLocationIDOver100.sample(fraction=0.1).show()
For our last example today, let’s look at two things.  The first is using Count with a GroupBy.  Next, we’ll use Sum with a GroupBy and rename one of the columns: 

dataframeCount = dataframe1.groupBy('borough') .count()

dataframeCount.show()

dataframeLocationSum = dataframe1.groupBy('borough') .agg({"LocationID":"sum"}) .withColumnRenamed("sum(LocationID)","LocationIDSum")

dataframeLocationSum.show()
 

Conclusion

At this point, you might be wondering why we keep creating new data frames when we are selecting only some of the data.  The reason is that Spark leverages something called Lazy Evaluation.  That means that Spark will not actually create these new data frames until the data is returned (with a show statement or written to a file/database).  Spark keeps track of all the transformations and only when it needs to execute them, does so.

This means that we need to separate the transformation step from the display/return step.  There are some exceptions like we saw when we dropped a column from the data frame.  The idea is to keep our transformations of the original data frame isolated so that other Spark Jobs can also do any transformations they need from the same source data frame with impacting each other. 

With that, let’s stop here for today.  Next time, we’ll look at creating a sub function that uses arrays and creates a bar chart! 

 4,833 total views,  12 views today

Tags: Azure Synapse Analytics, Azure Synapse Spark Pools, Azure Synapse Spark Python

Related Articles

Azure Synapse Dedicated SQL Pools: Workload Management

15 December 2021SQL Stijn

Azure Synapse SQL Pools: Loading Staging Tables

16 December 2020SQL Stijn

Azure Synapse Analytics – Serverless SQL Pools: Partitioning

12 January 2021SQL Stijn

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Recent Posts

  • Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake
  • Azure Synapse Dedicated SQL Pools: Workload Management
  • Synapse Spark – Playing with Data Frames with Python
  • Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python
  • Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

Recent Comments

    Archives

    • June 2022
    • December 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • August 2016
    • July 2016
    • March 2016
    • February 2016
    • December 2015
    • October 2015
    • August 2015
    • March 2015
    • December 2014
    • November 2014
    • July 2014
    • February 2014
    • January 2014
    • December 2013

    Categories

    • Azure Synapse Analytics
    • Building a Modern Lakehouse Data Warehouse
    • Geen categorie
    • MS PDW
    • SQL 2016
    • SQL Maintenance
    • SQL Server Migration
    • SQL Server Misconfiguration Chronicles
    • SQL2014
    • Troubleshooting

    Contact Stijn now


    © SQL Stijn 2017 - Privacy Policy - Cookie Policy
    Manage Cookie Consent
    To provide the best experiences, we use technologies like cookies to store and/or access device information.
    Functional Always active
    The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
    Preferences
    The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
    Statistics
    The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
    Marketing
    The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
    Manage options Manage services Manage vendors Read more about these purposes
    View preferences
    {title} {title} {title}