• Home
  • About
  • Blog
  • Speaking
  • Contact

Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python

20 April 2021Kevin ConanAzure Synapse AnalyticsNo Comments

Background

I am learning Spark Pools which is one of the newest features of Azure Synapse.  If you are new also, I hope you join me on this journey!  Let’s start small and with something that many people will want to do.  We will read a CSV file that is stored in Azure Data Lake Store Gen 2.

Getting into a Spark Notebook

When you setup your Synapse Workspace via the Portal, it guides you through creating a storage account that is linked for you (including permissions to access it from Spark Pools).  If you want to read more about how to do this manually, here is a link to documentation on how to configure access for Synapse Spark to Azure Data Lake Store Gen 2: Introduction to Microsoft Spark utilities – Azure Synapse Analytics | Microsoft Docs 

I’m going to assume that you have created a Synapse Workspace with a Spark Pool. I also assume you have a storage account and its access is linked.  Lastly, you will need a CSV file sitting in a container. 

So now, let’s go to the Synapse Analytics Studio!  

From the main screen of the Synapse Analytics Studio, click on “Develop.  On the next screen click on the “+” beside Develop and choose “Notebook”. 

 

To clean up our screen, you can collapse the right-hand side Properties Pane.  You can do this by clicking on the second from the right icon just above the pane. I’ve circled in the below screenshot. 

Next, you will want to click on the drop down for “Attach to” and select your Spark Pool. 

 

Reading a CSV File

In our code block, let’s use this template: 

from pyspark.sql import SparkSession
from pyspark.sql.types import *

account_name = "Your account name"
container_name = "Your container name"
relative_path = "Your path"

adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)

dataframe1 = spark.read.option('header', 'true') \
.option('delimiter', ',') \
.csv(adls_path + '/FileName.csv')

dataframe1.show()

This code is in Python, or PySpark.  First, we create an object of SparkSession to hold the path to the CSV file.  Next, we create a data frame and then select the top 20 records from the data frame. 

I’m going to populate the above template to use one of the NY Taxi Data’s dimension CSV files in my data lake.  Next, I will click on the blue play button on the left side of the code block to execute the code.   

Please note that if this is the first code being run in a while, the spark cluster might need to start.  Starting a Spark Pool can can take a few minutes. 

Here is what my code block looks like and what my output looks like: 


Conclusion and Further Reading

Now that we have the CSV file in a data frame!  With a data frame, we can filter the data, transform it and write it back out to a data lake or database.  

Here is the link to documentation that helped me out – Synapse Studio notebooks – Azure Synapse Analytics | Microsoft Docs 

If you prefer C#, here is some documentation for doing this with C# – Introduction to Microsoft Spark utilities – Azure Synapse Analytics | Microsoft Docs 

 6,258 total views,  9 views today

Tags: Azure Synapse, Azure Synapse Spark Pools, Azure Synapse Spark Python

Related Articles

Azure Synapse Dedicated SQL Pools: Workload Management

15 December 2021SQL Stijn

Synapse Spark – Playing with Data Frames with Python

11 May 2021Kevin Conan

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Recent Posts

  • Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake
  • Azure Synapse Dedicated SQL Pools: Workload Management
  • Synapse Spark – Playing with Data Frames with Python
  • Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python
  • Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

Recent Comments

    Archives

    • June 2022
    • December 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • August 2016
    • July 2016
    • March 2016
    • February 2016
    • December 2015
    • October 2015
    • August 2015
    • March 2015
    • December 2014
    • November 2014
    • July 2014
    • February 2014
    • January 2014
    • December 2013

    Categories

    • Azure Synapse Analytics
    • Building a Modern Lakehouse Data Warehouse
    • Geen categorie
    • MS PDW
    • SQL 2016
    • SQL Maintenance
    • SQL Server Migration
    • SQL Server Misconfiguration Chronicles
    • SQL2014
    • Troubleshooting

    Contact Stijn now


    © SQL Stijn 2017 - Privacy Policy - Cookie Policy
    Manage Cookie Consent
    To provide the best experiences, we use technologies like cookies to store and/or access device information.
    Functional Always active
    The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
    Preferences
    The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
    Statistics
    The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
    Marketing
    The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
    Manage options Manage services Manage vendors Read more about these purposes
    View preferences
    {title} {title} {title}