• Home
  • About
  • Blog
  • Speaking
  • Contact

Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake

14 June 2022SQL StijnAzure Synapse Analytics, Building a Modern Lakehouse Data WarehouseNo Comments

Hi and welcome to the first post on the Building a Modern Lakehouse Data Warehouse
series. As you can see in the video [here] we will be building a modern data warehouse using Azure Synapse Analytics with a Lake House strategy.

Today we will be focusing on moving your data from your database to the lake dynamically and in a folder structure.

Building a dynamic pipeline

Using the integration hub in Azure Synapse Studio we can build a single pipeline which can handle all the copy activities using metadata queries from our source database. For this example, we will be using an Azure SQL Database, however you can also do this for other RDBMS (relational database management systems) and on premises sources. There are some catches to building this for other RDBMS and on premises sources, however the framework we will be showing you can remain the same.

Our pipeline will consist of three main components, a Lookup Activity (or script task) to fetch our table list, a foreach loop, to loop through the metadata and a Copy Task inside the foreach loop to dynamically copy our data.

Setting up the Lookup Activity

With the lookup we will be fetching the list of tables and schemas to loop over, we will also use a dataset (Query or Table) that uses a linked service (sort of a connection string).

Let’s start by adding ia lookup task, which you can find under the general tab in the activities of a new pipeline.

Click on settings, which is where we will configure the source dataset that the lookup will use to output a JSON (JavaScript Object Notation) result set in the pipeline.

In our case we will use an Azure SQL DB as source, so we will create a new dataset based on an Azure SQL DB linked service.

First, we will name our dataset to something generic because we will reuse the same dataset for all further actions in our pipeline.

The next item we need to configure is the linked service. This is where the first dynamic part will come into play. A linked service can point to one specific server, or you can use the linked service for all your azure SQL DB connection by parameterizing the connection string.

Start off by creating two parameters on the linked service level named “logicalservername” and “databasename“. You can find that at the bottom of the Edit Linked Service Tab. Choose string ast he type and leave the default value blank.

After specifying the parameters, move to the top of tab and change the account selection method to enter manually (instead of from azure subscription) and add dynamic content.

Create and Add a parameter for the logical server name repeat same steps to create a database name parameter. This will allow us to drive the connection string from metadata as well and allows us to reuse this pipeline for various Azure SQL DB sources.

If we look at the authentication method, we have chosen to use the system managed identity. This means that we will need to add the managed identity of the synapse workspace (automatically created for each workspace) as a user to the Azure SQL DB with sufficient rights to read system tables and read from all the tables we want to copy to the lake.

When we hit the create button, hit the OK button on the next screen and our dataset and linked service will be created.

After creating the dataset, open it up again as we need to pass through the parameters we have for the linked service to the actual pipeline.

Create two dataset scoped parameters (under the parameters tab)

and add them to linked service properties dynamic content.


After altering the dataset go back to the pipeline and add two pipeline-based parameters (click anywhere in the pipeline except on an activity and you will see a tab with parameters).

Add these to the dataset properties

As we are going to fetch all tables, we will not use a single table dataset, so we change this to a query. We will use sys.tables to fetch all schemas and tables of the Azure SQL DB.

After adding this, debug your pipeline and you should see the following output from the lookup activity.

Setting up the foreach loop

Now that we have the lookup activity set up, we need to use the output of this activity in a foreach loop, in the activities pane click on iterations and conditionals and drag in a foreach loop. Link the lookup to the foreach loop by dragging the green arrow there.

Click on the items pane and add the activity output value node. (As you can see in the output in the image above containing the output of the lookup activity)

After this, click on the activities tab in the foreach task and click on the pencil or just double click the foreach task in the GUI (graphical user interfaces).

Setting up the nested copy data task

From the activities pane click on move and transform and drag in a copy data task. Navigate to the source tab and select the previously created Azure SQL DB dataset.
Add the two pipeline parameters to the dataset properties again and change the use query to Query instead of table.

Click in the query field and choose to add dynamic content, we will now build a dynamic expression based on the items passed on by the value output of the lookup activity. (Watch out if these are case sensitive!)

This dynamic expression will select all columns from the table of the current iteration, this concludes the work we need to do on the source tab.

Next up is our Sink (Destination), our sink dataset does not exist yet so we will need to specify the storage account, container, and folder we want to send our data to.

We also need to decide whether we will write to a CSV or a parquet file, in most cases I prefer to use parquet files which is a columnar based file which contains metadata (Data types etc.) about the data that resides in it. It will also give you great compression and the best IO is the IO you do not have to do.

Click on a new sink dataset and look for Azure Data Lake Storage Gen2.

For the file format choose parquet as stated above. As this is again a dataset, this dataset will be linked to linked service (connection string). As we are using Azure Synapse, we can use the primary storage account linked to the synapse workspace.

Leave the file path open for now as we will be building that dynamically. Click ok and open the dataset again. Create two dataset-based parameters (on the parameters tab of the dataset), one for Filesystem (Container) and one for Directory. We will not create one for files as we don’t really care about the names of the files. We are working in a data lake so we will be pointing to folders.

Head back over to the pipeline and create a pipeline parameter (click anywhere in the pipeline except on an activity to see the pipeline parameters tab) for the filesystem and add it to the dataset properties of the sink dataset.

For the path we will use the items from the foreach loop and the database name to build this dynamically.

After adding this the pipeline is ready for publishing and can be used to move data from any Azure SQL DB source to a specific container in your Synapse Linked Storage.

Additional notes

  1. Column names with spaces are not supported in parquet files
  2. Certain data types are not supported in parquet files (Geography for example)

Result

If you followed all the steps above, you should have the following structure in your data lake

This concludes this first blog on moving your data to the lake, this framework can be used for various sources. However, most environments are unique so it can be subject to some changes for edge cases in your environment!

Thank you for reading!

 

 

 

 

 

Tags: adls, adls gen 2, Azure Data Factory, data lake, lakehouse, parameterized pipeline, parquet, pipeline, Synapse, synapse integrate

Related Articles

Azure Synapse Dedicated SQL Pools: Workload Management

15 December 2021SQL Stijn

Importing DLL into SSIS Integration Runtime

3 February 2021SQL Stijn

Azure Synapse SQL Pools: Loading Staging Tables

16 December 2020SQL Stijn

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Recent Posts

  • Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake
  • Azure Synapse Dedicated SQL Pools: Workload Management
  • Synapse Spark – Playing with Data Frames with Python
  • Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python
  • Azure Synapse Analytics – Serverless SQL Pools: Data Elimination

Recent Comments

    Archives

    • June 2022
    • December 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • August 2016
    • July 2016
    • March 2016
    • February 2016
    • December 2015
    • October 2015
    • August 2015
    • March 2015
    • December 2014
    • November 2014
    • July 2014
    • February 2014
    • January 2014
    • December 2013

    Categories

    • Azure Synapse Analytics
    • Building a Modern Lakehouse Data Warehouse
    • Geen categorie
    • MS PDW
    • SQL 2016
    • SQL Maintenance
    • SQL Server Migration
    • SQL Server Misconfiguration Chronicles
    • SQL2014
    • Troubleshooting

    Contact Stijn now


    © SQL Stijn 2017 - Privacy Policy - Cookie Policy
    Manage Cookie Consent
    To provide the best experiences, we use technologies like cookies to store and/or access device information.
    Functional Always active
    The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
    Preferences
    The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
    Statistics
    The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
    Marketing
    The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
    Manage options Manage services Manage vendors Read more about these purposes
    View preferences
    {title} {title} {title}