In previous posts I talked about DWLOADER and the distribution & replication theory, today I’m going to explain how you can combine this knowledge to achieve the ultimate PDW load speed with DWLoader.
There are 3 things that determine your load speed
- File size
- Destination Table Design
- Structure of the file
I will start off with giving you more information how file size will affect your load speed, f.e. you have got a Fact table that has about 1 000 000 000 rows, if you want to load this file straight into your PDW with DWloader, you will see that your load speed is not getting the results you expect while working with Infiniband. How did I solve this issue? The solution for this decrease in load speed is quite straight forward, you just use a file split program to split your file up in the desired number of rows, I tend to use files which contain 100 000 000 rows. You can achieve this with GSPLIT tool or other. After splitting these files I loaded them in parallel using a BATCH to max out the Infiniband.
Now once you have split up your file we can move on to the second thing that affects your load speed, namely your Table Design and more specifically your choice of Distribution Key. If you want more information on how to select a good Distribution Key I suggest you read my previous post (http://swyssql.wordpress.com/2014/01/21/pdw-distribution-replication/). If you have followed the four steps for selecting your distribution key, your load speed will not be affected by your table design, but when you notice that your load speed is a lot slower than expected it might be interesting to look at the key you chose.
The last and final part that defines your load speed is the structure of your file. As you know PDW achieves incredible results when it comes to processing unstructured data, however when you load data into your PDW that is structured it might cause problems to your load speed. The thing you have to avoid is choosing a distribution key which is structured in your source file. And this brings me to the most important conclusion I made while tuning your DWloader: Sometimes it might be better to load your data into your PDW using a staging table with a different distribution key then your ideal distribution key. Afterwards you can use the CTAS magic to transform your table, to a table with the best queryperforming distribution key. (ELT principle http://swyssql.wordpress.com/2013/12/13/basic-introduction-to-ms-pdw/ ).
If you keep these 3 hints I gave you in mind you should normally achieve your ultimate dwloader speed (1.3GBPS is what we achieved so far J ).
In future posts I will be talking about the Polybase feature from PDW and how you set up your very own Hadoop cluster to combine Big Data using External table commands.
Stay tuned!