Welcome to the world (new face) of Knowledge Sharing Network.
To track the older blog posts refer to our historical knowledge sharing site  and you will find this as your destination for SQL Server knowledge pool.

Follow SQLMaster on Twitter SqlServer-QA.net - Knowledge Sharing Network (@sqlmaster)

How SQL Server 2008 can help to build fast track data warehouse? - SQL Server Knowledge Sharing Network (@sqlmaster)

How SQL Server 2008 can help to build fast track data warehouse?

Beginning of this year I have posted here about Build your BI skills in fast track data warehouse  and based on that I have had few questions and feedback on how best we can take help of SQL Server 2008 features to build the fast track data warehouse.

Based on the material and resources available on Microsoft site the approach must be on data access layer, configuration (hardware, software) and maintaining the normalization practices.When you talk about Business Intelligence then data access layer is key to work on consecutive I/O access for data warehouse workloads.The platform must be scalable in terms of performance & access methods on hardware side in addition to data layer side. By design the data warehouse store will have large file reads with less writes, this involves one-block operation on disk level to obtain the data having a penalty (minimum) of latency with the disk drive head having to move to a new location. Whereas the hardware configuration of Random I/O, on the other hand, involves large numbers of seeks and rotations, and is usually much slower.

As an Analyst for BI implementation you need to understand the performance constraints related to configuration techniques on hardware, software & application (if needed) to maximize the sequential I/O data access methods. By design the I/O from an instance of the SQL Server Database Engine includes logical and physical reads, such logical read occurs every time the DB Engine requests a page from the buffer cache. If the page is not currently in the buffer cache, a physical read first copies the page from disk into the cache.You can help the DB engine optimizer in a way that relational engine determines the most effective access method (such as a table scan, an index scan, or a keyed read); the access methods and buffer manager components of the storage engine determine the general pattern of reads to perform, and optimize the reads required to implement the access method.

Such performance optimization methods are called as read-ahead, BOL clarifies that "....the data and index pages needed to fulfill a query execution plan and brings the pages into the buffer cache before they are actually used by the query. This allows computation and I/O to overlap, taking full advantage of both the CPU and the disk". This mechanism helps the DB engine to manage upto 64 contiguous pages (512KB) from one file. The read is performed as a single scatter-gather read to the appropriate number of (probably non-contiguous) buffers in the buffer cache. This is where you will observe table scans and index scans, this is entirely a different topic that will be covered on performance tuning blog section on this site.

Coming back to reference of configuration model (Fast Track Data Warehouse) approach from Microsoft, it is key that maintaining performance over time is preserving the sequential order of the data on disk by preventing and/or minimizing the fragmentation of the data under normal operating conditions. The programme is highlighted with an objectives such as:

  • Providing an initial SSDW sequential I/O data access experience with known performance and scalability characteristics.
  • Developing a method that is focused on understanding the performance characteristics, limitations, and the cause-and-effect relationships of each component contained in the configuration.
  • Providing proven recommendations of SSDW best practices for sequential I/O data access that are simple, easy to implement, and include holistic recommendations for database setup, configuration, and data loading.

One of the big question was How the fast track approach is different?

Microsoft documentation highlights that approach is specifically focused on building scalable CPU core-balanced configurations to support SSDW sequential I/O data access workloads. A holistic and vendor-agnostic approach is used to address the CPU to I/O balance of the system to efficiently design a configuration that is intentionally targeted at and optimized for sequential data access. This is designed to achieve the best possible performance by matching all internal components to their individual maximum performance level while not exceeding the capacity of any components above it in the stack, and then applying principles in physical and logical layout that allow the hardware to provide this maximum performance in a real environment.

So you need to be aware that to attain such a scalable platform you need to determine the right mix of hardware, software, layout, and configuration settings differs significantly from conventional reference architectures because the focus is on determining the right file and software configuration while maintaining maximum performance of all items in the hardware stack, starting with the CPU and the cores, and working back down to the storage.

The underlying the optimization technique includes usage of building-block approach, to keep performance constant per MB of data processed, there must be enough complete building blocks used to scale up the CPU and scale out storage together. Scaling up the CPU cores in unison with storage components provides near constant performance for larger data volumes. It is also possible to increase storage capacity without adding more CPU cores by using larger drives or more drives per LUN; however this implies that queries covering a wider range of data will have lower performance. Important to note is that adding more data but still querying the same sized ranges per query will not decrease performance.

 Lastly you can download Spreadsheet Calculator that has been developed to assist with the optimization of the software configuration on the selected hardware based upon specified component performance ratings and internal data feed requirements for SQL Server.

  • The spreadsheet helps to advise the number of storage arrays, disks, and LUN’s required to achieve the specified throughput given the hardware detailed in the calculator.
  • The spreadsheet has pre-populated ratings determined from component testing and research for both the EMC CX4-240 and HP MSA2000 storage options. See above for details on how to determine the SQL Server per-core requirement for any new compute node server configuration.
  • New configurations and new hardware can be added to the calculator by cloning an existing EMC or HP sheet, renaming it to the new configuration, and entering the specific factors for the hardware being reviewed. The new sheet will then utilize the new factors when doing its calculations for LUN’s, drives, and the number of required storage arrays.

Review the Implementing a SQL Server Fast Track Data Warehouse An Introduction to Fast Track Data Warehouse Architectures links too.

Published Monday, November 2, 2009 1:41 PM by ssqa.net


No Comments