Database Sharding
The Problem: Economic Database Scalability
It is not a secret that enterprise databases are growing in size as more data is collected. While many factors contribute to the problem, the primary factors are: a) the amount of information required to run critical systems only seems to grow over time; and b) the number of users and transaction volumes rise in proportion to the number of "on-line", Web-based and Software as a Service (SaaS) applications.
As databases grow in a linear fashion, response times unfortunately increase at an exponential rate. This puts a tremendous burden on overall application performance, with the database often becoming the key bottleneck in mission critical systems.
Further, as databases grow they become significantly more costly, difficult to manage, and reliability schemes are far slower to "recover."
Many techniques have been tried to overcome these issues, often at great expense and complexity:
- "Buy a Bigger Box" – The most common initial approach to a database performance problem is to purchase a new database server, adding more and faster CPUs and more memory. This vertical scalability approach is often extremely costly, both in terms of the hardware itself plus the additional database CPU license fees. Often this solution provides only a short-term fix, resulting in only marginal performance improvements.
- Database Clustering – This technique is valuable – up to a point. Database clusters rely on shared resources that ultimately introduce new bottlenecks. This approach is also extremely costly in terms of high-end hardware requirements and commercial database license fees.
- Proprietary Data Warehousing Appliances – In recent years, many new Data Warehouse Appliances have been introduced. These solutions are both proprietary and expensive, and while they can assist with typical "read-only" analytics applications, they cannot support overloaded mission critical OLTP applications.
In summary, these techniques are complex, often require significant cost and application development changes, and often fall short of expectations.
The Solution: Database Sharding
"Database Sharding" is a new concept in horizontally partitioning databases, not just across disks, but across servers. Many of the largest online service providers, most notably Google, have developed in-house implementations of this architecture to meet application scalability requirements.
In essence, Database Sharding allows databases to be split across servers using application-specific logic. By understanding the dynamics of a specific application, large, transaction-intensive databases can be divided across independent database servers in a way that is compatible with the requirements of the application itself.
When a database is "sharded", it is essentially broken down into multiple smaller databases across multiple (typically low-cost commodity) servers, each yielding far greater performance through this highly distributed model. In addition to delivering better performance, individual "shards" (or database "chunks") can be managed much more easily, including replication, fault tolerance, and cross-data center reliability.
As with all valid scalability and performance techniques, Database Sharding is not appropriate for every type of database application. However, a large segment of applications can work extremely well in this environment, most often those that are transaction intensive with high volumes and size. Some common target application categories include:
- SaaS applications
- Billing systems
- Batch processing
- Stock trading
- Travel and other reservation systems
- Data Warehousing
Although Database Sharding is gaining popularity, was up to now a lack of implementation tools. Without such tools, it is often difficult to shard existing applications without extensive re-architecture and re-development. For many companies, re-development is completely infeasible, given the need to continue business operations.
The Origins of Database Sharding
There seems to be general agreement that the term 'sharding' and its derivative Database Sharding was introduced by Google (there are examples of the use of the term prior to its use in Google, but not in the same context). It has been reported that the sharding terminology is in widespread and general use within Google's engineering teams.
Benefits of Database Sharding
There are many potential benefits of Database Sharding, some of which are application-specific and do not apply to all use cases. The most immediate benefit and the primary reason for Database Sharding is to provide horizontal scalability for applications and databases.
Database Sharding White Paper
Read about how Database Sharding helps many major companies to linearly scale their database applications.
Fields marked with an asterisk (*) are required.
