• Home
  • About
  • Blog
  • Speaking
  • Contact

Azure Synapse Analytics – Serverless SQL Pools

4 January 2021SQL StijnAzure Synapse AnalyticsNo Comments

Welcome to the first of the blog series I am writing on Azure Synapse Analytics. As you know Azure Synapse Analytics went GA end of 2020, and in the following blogposts I would like to walk you through the out-of-the-box capabilities you have to report easily on a vast dataset.

Azure Synapse Analytics Series:

  1. Azure Synapse Analytics – Serverless SQL Pools
  2. Azure Synapse Analytics – Serverless SQL Pools: Partitioning
  3. Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

Serverless SQL Pools allows you to build a logical structure on your files hosted in Azure Data Lake and query them directly, in this blogpost I will be showing you how to create that structure and how to report on this structure with Power BI.

Setting the stage

First things first, I downloaded the TLC yellow taxi dataset from 2010 until 2020 and converted this from CSV to Parquet. I pushed these parquet files in one folder. This did require some cleaning, so please do not hesitate to leave a comment if you want to have access to the cleansed set to test these things out yourself!

SELECT
count(*)
FROM
OPENROWSET(BULK ‘https://xxx.dfs.core.windows.net/raw/tx_pq_full/’,FORMAT=‘PARQUET’) AS [result]

In the image below you can see the count I did using Serverless SQL Pools in my Azure Synapse Studio. The set we are working with has more than 1.4 billion rows!

Creating a logical structure

I will start of by creating a logical structure for the parquet files which are in the folder.

The easiest way to get started with this, is by querying this data set with openrowset. This way you can explore your data and see what is in it. Openrowset supports CSV and parquet files, in this case we are using parquet files.

Let’s start off by doing a “select top 100 *” to see what the data looks like. In Synapse Studio, I navigate to the Data tab. In the Data tab I will click on linked services and select the Azure Data Lake which contains the folder with all the data. I will then open the folder hosting the parquet files and right click one of the files.
I will then hover over New SQL Script then it gives us the option to do a select top 100 *. By clicking this, it will provide you with the syntax you need to select data from that specific file.

This is the code that will be generated:

SELECT
TOP 100 *
FROM
    OPENROWSET(BULK ‘https://XXX.dfs.core.windows.net/raw/tx_pq_full/XXX.snappy.parquet’,FORMAT=‘PARQUET’) AS [result]

This however will only select data from that one parquet file while we would like to do this on the full set of data in that folder. To query the full folder, we remove the XXX.snappy.parquet at the end or we specify the * wildcard instead of XXX.

SELECT
    TOP 100 *
FROM
    OPENROWSET(BULK ‘https://openhackdai.dfs.core.windows.net/raw/tx_pq_full/’,FORMAT=‘PARQUET’) AS [result]

OR

SELECT
    TOP 100 *
FROM
    OPENROWSET(BULK ‘https://openhackdai.dfs.core.windows.net/raw/tx_pq_full/*.snappy.parquet’,FORMAT=‘PARQUET’) AS [result]

In the image below you can see that we will fetch the first 100 rows from the full dataset.

As we now are able to select on the folder, we can persist this query into a view to endure it in a logical structure. This way I can just specify the view instead of specifying the openrowset with the folder structure.
To start creating this logical structure, we will first need to create a database and in that database we can then specify the view referencing the openrowset query. This new view will be used in the Power BI report later on.
(Next to this you also have the option to create an external table on external files to generate your logical structure: more information https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables)

CREATE DATABASE ServerlessTaxi
GO
USE ServerlessTaxi
GO
CREATE VIEW dbo.Taxi
AS
SELECT
     *
FROM
    OPENROWSET(BULK ‘https://openhackdai.dfs.core.windows.net/raw/tx_pq_full/*.snappy.parquet’,FORMAT=‘PARQUET’) AS [result]

The statement above will create the following database with the following view

From here on we can use the view to query the data.

We will use this view to start building a report in Power BI.

Building a report

Open Power BI and click on get data. As Serverless SQL Pools have a SQL Endpoint, you can specify SQL Server and connect to your Serverless SQL Endpoint (You can find this in the overview page of the synapse workspace in the azure portal)

We will be using the ServerlessTaxi database we created and as you can see below, we see the Taxi View we created. As we are building a report on 1.4Billion rows, we should use Direct Query instead of import!

After this I created a simple report and looked at the performance analysis to see how this report is performing on this huge dataset

I created a few card’s which include a calculation, added some slicers and a matrix representing some sales by locationID. This report is on +1.4 Billion rows and the slowest card which is the Total Sales amount gives its result back in less than 11 seconds. Yes, you are reading this correctly, IN LESS THAN 11 SECONDS ON 1.4 BILLION ROWS OUT OF THE BOX!

This just shows you the power that is behind the Serverless SQL Pools, it can generate quick insights on very large datasets all in native SQL.

Now there must be a downside to this, the one thing that pops to mind is the cost of the solution.
You will need to watch out with the number of queries that you are doing and the amount of data you are processing in those queries. In Serverless SQL Pools your cost is based on the amount of data you process. There is a great blogpost elaborating on Serverless SQL Pools costs by Nikola Ilic which you can find here https://data-mozart.com/cost-control-in-synapse-analytics-serverless-sql-easy-way/.
To reduce costs and increase performance, you can optimize the queries and the amount of data that will be processed. And this is exactly what I will be showing you in the next blogpost of this series!

Summary

Serverless SQL Pools give you an easy and fast way to start reporting and exploring your data on very large datasets. You do not have to load and transform your data and put them in a relational structure, with Serverless SQL Pools you can create your own logical structure using openrowset views and external tables referencing the files directly. When you have created your own logical structure, you can then query it in SQL. Serverless SQL Pools allows us to report on huge datasets in seconds and opens a whole new world of opportunities when analyzing and reporting on your data.

In the next blog post, I will be showing you a bit more on how Serverless SQL Pools interact with the underlying parquet layer and how you can optimize your queries both on performance and cost.

 5,498 total views,  9 views today

Tags: Azure Synapse Analytics, Power BI, Serverless SQL Pools

Related Articles

Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

9 March 2021SQL Stijn

Synapse Spark – Playing with Data Frames with Python

11 May 2021Kevin Conan

Azure Synapse SQL Pools: Loading Staging Tables

16 December 2020SQL Stijn

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Recent Posts

  • Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake
  • Azure Synapse Dedicated SQL Pools: Workload Management
  • Synapse Spark – Playing with Data Frames with Python
  • Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python
  • Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

Recent Comments

    Archives

    • June 2022
    • December 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • August 2016
    • July 2016
    • March 2016
    • February 2016
    • December 2015
    • October 2015
    • August 2015
    • March 2015
    • December 2014
    • November 2014
    • July 2014
    • February 2014
    • January 2014
    • December 2013

    Categories

    • Azure Synapse Analytics
    • Building a Modern Lakehouse Data Warehouse
    • Geen categorie
    • MS PDW
    • SQL 2016
    • SQL Maintenance
    • SQL Server Migration
    • SQL Server Misconfiguration Chronicles
    • SQL2014
    • Troubleshooting

    Contact Stijn now


    © SQL Stijn 2017 - Privacy Policy - Cookie Policy
    Manage Cookie Consent
    To provide the best experiences, we use technologies like cookies to store and/or access device information.
    Functional Always active
    The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
    Preferences
    The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
    Statistics
    The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
    Marketing
    The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
    Manage options Manage services Manage vendors Read more about these purposes
    View preferences
    {title} {title} {title}