Processing data is like driving a truck. Assume there are no posted speed limits, our example truck can only travel at 50 miles per hour. If a given destination is 100 miles away and the truck can only reach the maximum 50mph it is going to take two hours to reach the destination. If you need to reach your destination faster you might spend money to buy a bigger truck, or you might be able to split your load into smaller faster trucks. Having the load split would enable all of the trucks to cary more cargo in the same time.
Think of data processing as a truck. There are only 24 hours in a day. Once processing time exceeds the processing time window then we have a problem. One simple solution to this problem is to simply throw more hardware at the problem. Add more processors and ram to a server. Make the storage faster, and buy more of it. This solution is called scaling the compute platform up.
Eventually the data processing problem will grow so large that a single server platform can no longer grow to accept more processors and ram. Just like a truck that can't possibly go any faster, the only solution is to break up the load. Add more servers until the given problem can be completed in the allotted time window. There are many computer science problems that can be solved using this scale out approach. Unfortunately, scaling out a data processing (also known as distributed processing) system will not solve every problem.
One excellent solution to this performance problem is a shared disk approach. Each server would have instant access to every memory register and disk block located within the processing grid. Unfortunately this approach can be just as expensive and as impractical as a scale up approach. How does the application in this scenario keep data consistent across processing nodes? What happens in the event of failure?
Microsoft SQL server uses a shared nothing architecture to ensure data consistency between servers, memory, and disks. Only one server has access to a data file at a time. SQL server only sends and receives data via the SQL engine using the server network. SQL first writes data a log file and into server memory. Data is written from server memory to physical disks on a regular basis. This process is known as flushing unwritten (or dirty) data pages from server memory to physical disks, therefore ensuring data consistency. Other servers are prevented from having simultaneous access to a data file because this access might miss data changes in held in server memory, but not yet written to disk.
Fortunately there is a solution to this thorny problem of sharing data between nodes. Microsoft has engineered SQL 2005, 2008, and 2008R2 with a technology called scalable shared data base (SSDB). SSDB allows a read only data copy to be mounted to one or many secondary reporting servers.
The SSDB hybrid approach allows processing to be completed on a single scale up node while reporting can be separated onto one or more reporting nodes.
The following posts will cover the implementation of Scalable Shared Data Base in detail:
Part 1: Scalable Shared Data Base overview (what you are reading now).
Part 2: Setting up the Scalable Shared Data Base (next week).
Part 3: Managing Clones
Part 4: Configuring a load balancer
Part 5: Performance Considerations
For more information on Scalable shared Data base check out:
MSDN: Deploying a Scalable shared data base SQL Server 2008 R2: http://msdn.microsoft.com/en-us/library/ms345584.aspx
SQLCAT: Scale-Out Querying for Analysis Services with Read-Only Databases http://sqlcat.com/whitepapers/archive/2010/06/08/scale-out-querying-for-analysis-services-with-read-only-databases.aspx
Recent Comments