Today I want to talk a bit more about Distribution and Replication and how it is used in a PDW. I’m going to start by answering two questions.
What is Distribution?
Because MS PDW is a MPP system (http://swyssql.wordpress.com/2013/12/13/basic-introduction-to-ms-pdw/), it uses different nodes to ensure fast querying! With the Distribution theory we take our large tables and distribute them over the different nodes and thus avoiding the I/O, Ethernet, disk bottleneck.
In a basic Dwh you will choose your Fact Table as your distributed table, in the case of a quarter stack PDW this will mean that your fact table will get distributed using the Distribution key on your 2 datanodes, the control node will contain metadata about the distribution of the fact table. There are cases when you will replicate your Fact table but in these cases your Fact table will be very small.
What is Replication?
With the replication theory we will replicate a certain table to all your nodes to make your queries return results faster. In most datawarehouses this will be your dimension tables, these tables are small compared to the fact table. Because these tables only need a tiny amount of space per node, it is better for your query-JOIN performance to replicate these tables on all your nodes.
Now that we know what distribution & replication is, I’m going to continue by explaining how this affects your table design in PDW. The first thing you will have to do is identify which tables are replicated or distributed, this analysis can be done quite fast by checking the size of these tables in your original system.
Once you have decided which tables are distributed & replicated, you can start on your table design.
Replicated table
I will start off with giving you an example create table statement for a replicated table.
CREATE TABLE DIM_FUNCTION(
[FUNCTION_ID] [int] NOT NULL, [FUNCTION_NAME] [varchar(30)] NOT NULL, [FUNCTION_NAME_SHORT] [char(5)] NOT NULL,)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
);
The first part of your create table statement is exactly the same as your statement in a normal DWH, but the WITH section is something new you will have to use. To define this table as a replicated table you put DISTRIBUTION = REPLICATE in your table design and PDW will automatically replicate this table on the different nodes of your PDW.
You can choose to add the Clustered Columnstore Index part of the statement. In my experience your PDW queries will work better with a Clustered Columnstore Index.This is because the data is now stored column per column (versus row by row) and also compressed in an efficient way, this means if your queries are only doing a SELECT f.e. SELECT LASTNAME,FIRSTNAME you will not read the jobname or function or … . This means you will read less off the disk, this will make your queries go a lot faster!
That’s all there is to creating a replicated table, I will continue now with the Distributed table which is a bit more complex.
Distributed table
Again I will start by giving you an example statement for a distributed table.
CREATE TABLE [FACT](
[file_id] [int] NOT NULL, [Function] [int] NOT NULL, [Place] [int] NOT NULL,)
WITH
(
DISTRIBUTION = HASH (file_id),
CLUSTERED COLUMNSTORE INDEX
);
As said in the Replication table part, the first part of your statement is exactly the same like a normal SQL server. But the WITH part is something new! You see that this time in the DISTRIBUTION we are hashing some sort of key, this key is your Distribution Key and this is the most important thing you will configure in your database, the right choice of key will result in lightning fast querying!
How do you choose your Distribution Key?
When I choose my distribution key I will analyze my all my columns by asking 4 questions.
- Is this column being used in a lot of my queries?
- If your column is being used for many joins, then it might be good to consider this column as a distribution key. The most intensive process in your PDW while querying will be the Data Movement Shuffle (Recreating a distributed table on a single node to make a join compatible) aka DMS, this DMS will significantly slow down your queries. If you choose a column which is used in a lot of queries this will avoid incompatible joins, and ensure an excellent query performance.
- Does this column have a High Cardinality?
- This question is actually the question how to avoid SKEW, skew is something that appears when you distributed your table on a wrong column. If this column contains a lot of NULL values & does not have a lot of distinct values, your system will not be able to highly distribute your data over your different nodes. This will result in one node having a lot of data and the other node having just a small amount of data. When this happens you actually only use One node of your MPP system and I think I don’t have to explain why this is bad for your query performance.
- Is this column compatible with my Aggregates?
- If your column is compatible with your aggregates this will improve your query performance.
- Will this column ensure a nice balance between workload & query behavior per node?
- This is what you ask yourself after the 3 previous questions and then you can determine what the best distribution key is for your DWH, I tend to use a point system which I give to my different columns to compare them so I can easily choose the most logical distribution key.
In future posts I will be explaining the relation between your table design and your load speed (DWLOADER PART 2), and how you will tune queries until you have maximum performance.
Stay tuned!