• Home
  • About
  • Blog
  • Speaking
  • Contact

Azure Synapse Analytics – Serverless SQL Pools: Partitioning

12 January 2021SQL StijnAzure Synapse AnalyticsNo Comments

Welcome to the second post of the blog series I am writing on Azure Synapse Analytics. As you know Azure Synapse Analytics went GA December of 2020, and in the following blogposts I would like to walk you through the out-of-the-box capabilities you have to report easily on a vast dataset.

Azure Synapse Analytics series

  1. Azure Synapse Analytics – Serverless SQL Pools
  2. Azure Synapse Analytics – Serverless SQL Pools: Partitioning
  3. Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

Setting the stage

In this blogpost we will again be using the TLC taxi dataset which contains over 1.4B rows.

In Today’s post I would like to show you how you can optimize your queries built by Serverless Synapse SQL Pools by partitioning your data. I will start by giving a bit of background on how Serverless SQL Pools handle queries in the background and tell you how partitioning fits in here.

Serverless SQL Pools has decoupled not only compute and storage, but it has also decoupled state. Decoupling state means that we are no longer depended on a compute node failure, which is the key difference with stateful architectures. The compute nodes in Serverless SQL Pools do not hold any state information, this allows it to partially restart queries in an event of a node failure AND This will also allow us to scale out without impacting any running transactions.

Another thing that makes the engine unique is that each compute node will be based on the SQL Engine which will then optimize the local data handling through the cost-based SQL System and help with scaling up the compute nodes.

This gives the Serverless SQL Pools the ability to not only scale-out, but also to scale-up nodes!

When a query gets send, what Serverless SQL Pools does, is create a logical list of “cells” from a dataset, these cells are then used to achieve parallelism by assigning them to multiple compute nodes. The data extraction of a cell is the responsibility of a single node, which is a SQL Server in the background.

The mapping of the “cells” to the compute for query execution is achieved through hash distribution, this will map the different distributions(“cells”) to compute nodes. With proper distribution, we can avoid data movement for the group by columns or join clause columns.

Data movement (Intermediate Steps) between compute nodes will slow down queries and is what we want to avoid. So, the Distribution of your data is key to having a fast-performing Serverless SQL Pool (Same as Stateful SQL Pools btw).

Next to distribution, we can also benefit from partition elimination by partitioning the data on equality or range predicates.

The Distribution and Partitioning strategy will allow the Serverless SQL Pools to reduce the amount of read action to the dataset and selectively reference grouping of “cells’ or single “cells” depending on the query you write.

Today’s post will focus on the partition elimination using range predicates or equality predicates on a parquet source.

Partitioning data

The dataset I used in the previous blog were parquet files located in a single folder without using any partitioning.

This means that we would not be able to benefit from partition elimination to optimize the queries we are using in our report. When we are not leveraging partitioning, queries will be slower, read more data and will also be more expensive.

I will start by partitioning my data by each year.

As you can see in the image above, I created a folder for each year and in those folders the data will reside for that specific year.

I will now create a new view in the ServerlessTaxi database I created in the previous blogpost, this view will reference the new partitioned parquet structure. I will be using a specific functionality for the openrowset which is called FilePath.


CREATE VIEW dbo.TaxiYearPartitioned
as
SELECT *,
cast(Taxi.filepath(1) as varchar(4)) as Year
FROM
OPENROWSET(
BULK 'https://xxx.dfs.core.windows.net/raw/Taxi/*/*.parquet',
FORMAT='PARQUET'
)
WITH
(
Passenger_Count int,
Trip_Distance float,
store_and_forward bit,
Payment_Type int,
mta_tax numeric(10,2),
VendorID int,
PULocationID int,
DOLocationID int,
tpep_pickup_datetime datetime,
tpep_dropoff_datetime datetime,
RatecodeID int,
Fare_amount numeric(10,2),
Extra float,
Tip_amount numeric(10,2),
Tolls_amount numeric(10,2),
Total_amount numeric(10,2),
improvement_surcharge numeric(10,2),
congestion_surcharge int,
PickupDate date,
DropoffDate date
)AS [Taxi]

Filepath(1) here specifies the first * in the folderpath of my openrowset which is the foldername containing the year, if I would like to specify the second part which would be the filename, I would have to write Filepath(2).

Reporting Impact

After creating this view, I would like to see what impact this would make on queries in my report using that year column.

I will start by creating a new Power BI Report. I will use Direct Query to get taxi dataset view unpartitioned and the taxi dataset partitioned by year. The unpartitioned dataset is a new view which has the year() function in there to fetch the year from the tpep_pickup_date.

On these two datasets I will create a year slicer for each dataset, and I will create a line graph showing me the number of sales by year and amount of passengers.

This is the small dashboard I build. The left side is the unpartitioned and the right side is the partitioned.

I opened up the performance analyzer and ran the refresh of the visuals.

The difference with the partitioned view is quite big, as for the year function we will need to analyze almost all of the data, while the partitioned data is able to just scan the partitions we want data from. This translates into the following queries in the query monitor in the Synapse Workspace.

Conclusion

By partitioning the data of the Taxi dataset, we can drastically reduce the amount of data being read. This minimizes the cost when running this specific report.

This will also give us a drastic performance increase as we can filter the number of “cells” pushed to compute.

In this specific case we read only 10% of the data when loading the graph, and almost read no data when loading the slicer. Next to this we reduced the query runtime to 10% of the original which are great performance improvements!

The filepath option you have in openrowset allows you to effectively tune your datasets, as this is not limited to one level, we can possibly create multiple partitioning layers and optimize queries even further.

This concludes this blogpost on Serverless SQL Pools and Partitioning, in the next blog of this series we will be talking about modeling your data with Serverless SQL pools.

Thank you for reading and stay tuned!

 5,846 total views,  9 views today

Tags: Azure Synapse Analytics, Partitioning, Power BI, Serverless SQL Pools

Related Articles

Azure Synapse SQL Pools: Loading Staging Tables

16 December 2020SQL Stijn

Azure Synapse Analytics – Serverless SQL Pools

4 January 2021SQL Stijn

Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

9 March 2021SQL Stijn

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Recent Posts

  • Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake
  • Azure Synapse Dedicated SQL Pools: Workload Management
  • Synapse Spark – Playing with Data Frames with Python
  • Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python
  • Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

Recent Comments

    Archives

    • June 2022
    • December 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • August 2016
    • July 2016
    • March 2016
    • February 2016
    • December 2015
    • October 2015
    • August 2015
    • March 2015
    • December 2014
    • November 2014
    • July 2014
    • February 2014
    • January 2014
    • December 2013

    Categories

    • Azure Synapse Analytics
    • Building a Modern Lakehouse Data Warehouse
    • Geen categorie
    • MS PDW
    • SQL 2016
    • SQL Maintenance
    • SQL Server Migration
    • SQL Server Misconfiguration Chronicles
    • SQL2014
    • Troubleshooting

    Contact Stijn now


    © SQL Stijn 2017 - Privacy Policy - Cookie Policy
    Manage Cookie Consent
    To provide the best experiences, we use technologies like cookies to store and/or access device information.
    Functional Always active
    The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
    Preferences
    The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
    Statistics
    The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
    Marketing
    The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
    Manage options Manage services Manage vendors Read more about these purposes
    View preferences
    {title} {title} {title}