Since I have been working with a Parallel Data Warehouse during the past month, it was only logical that my next post would be on this subject. Because a PDW is a relatively new product from Microsoft, I thought an introduction post would be in order. In this post I’ll give you a small amount of info about the hardware/architecture a PDW uses, and the new concept of ELT vs ETL
-
The hardware!
The PDW uses the INFINIBAND network connection, this great new piece of technology ensures about 25gbps, which makes loads and data traffic of the PDW lightning fast! With this new technology you will be able to upload giant amounts of data in your PDW in minutes instead of hours!
The PDW uses MPP, which means he splits up the work he is doing under different nodes, in my case, this is on 2 compute and 1 control node (which have their own server/processors/cores). (Quarter PDW) This kind of architecture makes your (complex) queries 10 times faster. Check the table for the different options of PDW configuration. (Source = http://saldeloera.wordpress.com/2012/07/09/lesson-1-of-parallel-data-warehouse-basic-architecture-overview/)
Configuration | Servers | Processors | Cores | Space (Tb) |
HP PDW Full Rack | 17 | 22 | 132 | 125 |
HP PDW Full Rack with 4 Data Racks | 47 | 82 | 492 | 500 |
HP PDW Half Rack | 11 | 8 | 48 | 15-60 (Optional disc sizes available) |
Since I’m not a real hardware expert I will not be going in deeper on this subject, if you want to know more about the MPP and the hardware PDW uses I suggest you read the blog post from Mark Stacey on http://markgstacey.net/2013/07/07/introduction-to-pdw-parallel-data-warehouse/, this was very helpful for me when I started working on PDW since it explains how it actually works, and this helps you understand how the PDW does the things it does so fast!
You can also find different pdf’s from Microsoft on this subject.
-
ELT , the new ETL!
PDW can easily load large volumes of data into your Data warehouse, that is why the basic ETL process might not be the best option for the PDW.
In most systems today, the fastest method for data manipulation/integration is to extract your data, then transform it and then load it into you Data Warehouse. This is not the case for a PDW, because of its incredible load speed on unstructured unaltered data.
This is why for PDW we choose to use ELT, so we first extract the data, then load it into our PDW, and then start transforming it on the PDW using CTAS statements (Create Table as Select à more info in a future blog post). Due to the MPP, these CTAS statements are a very fast process.
Using ELT will prove to be a way better method for your Data manipulation/integration while using PDW, it is way less time-consuming then the normal ETL process!
In my future post I will become a lot more technical, explaining different user(My own fault J ) errors I encountered while configuring my first PDW, these are easy mistakes to make and once you know about them you will never make them again. The next post will be explaining all the secrets that the DWLOADER.exe has.