Since I have been working with a Parallel Data Warehouse during the past month, it was only logical that my next post would be on this subject. Because a PDW is a relatively new product from Microsoft, I thought an introduction post would be in order. In this post I’ll give you a small amount of info about the hardware/architecture a PDW uses, and the new concept of ELT vs ETL
-
The hardware!
The PDW uses the INFINIBAND network connection, this great new piece of technology ensures about 25gbps, which makes loads and data traffic of the PDW lightning fast! With this new technology you will be able to upload giant amounts of data in your PDW in minutes instead of hours!
The PDW uses MPP, which means he splits up the work he is doing under different nodes, in my case, this is on 2 compute and 1 control node (which have their own server/processors/cores). (Quarter PDW) This kind of architecture makes your (complex) queries 10 times faster. Check the table for the different options of PDW configuration. (Source = http://saldeloera.wordpress.com/2012/07/09/lesson-1-of-parallel-data-warehouse-basic-architecture-overview/)
Configuration | Servers | Processors | Cores | Space (Tb) |
HP PDW Full Rack | 17 | 22 | 132 | 125 |
HP PDW Full Rack with 4 Data Racks | 47 | 82 | 492 | 500 |
HP PDW Half Rack | 11 | 8 | 48 | 15-60 (Optional disc sizes available) |
-
ELT , the new ETL!
PDW can easily load large volumes of data into your Data warehouse, that is why the basic ETL process might not be the best option for the PDW.
In most systems today, the fastest method for data manipulation/integration is to extract your data, then transform it and then load it into you Data Warehouse. This is not the case for a PDW, because of its incredible load speed on unstructured unaltered data.
This is why for PDW we choose to use ELT, so we first extract the data, then load it into our PDW, and then start transforming it on the PDW using CTAS statements (Create Table as Select à more info in a future blog post). Due to the MPP, these CTAS statements are a very fast process.
Using ELT will prove to be a way better method for your Data manipulation/integration while using PDW, it is way less time-consuming then the normal ETL process!
In my future post I will become a lot more technical, explaining different user(My own fault J ) errors I encountered while configuring my first PDW, these are easy mistakes to make and once you know about them you will never make them again. The next post will be explaining all the secrets that the DWLOADER.exe has.