dbShards
dbShards Live!


BootCamp Registration


Useful Links


Previous Posts

Archives

Categories

Syndication
Contact Us

And get a unique database solution that grows when you grow and never costs more than it should.

Featured product

 

Use our no-charge dbShards/Analyze driver to identify critical performance issues in your database tier. A dbShards consultant will then coordinate with you to identify key hotspots, helping you to develop a plan for optimizing your database.

Black Box vs. Application-Aware Sharding

Monday, September 20, 2010

There are many techniques for partitioning a database, all under the heading of “sharding.” But regardless of which technique you are referring to, the purpose is basically the same – spread your data out in a manner that allows parallel processing across multiple servers, allowing your database to scale with load. Done correctly, the results can be pretty amazing, frankly even surprising some of us on the dbShards team. In short, sharding really works.

Lately we have seen a lot of talk about the black box sharding capabilities offered by some products. The way this works is to have the data sharded mechanically by the product, so that the developer or user of the database doesn’t need to pay attention to it. In essence, the sharding product that “auto-magically” partitions your data.

When you drill down further, you discover that such techniques distribute data at the row level (for relational databases), or at the object level (for NoSQL databases). Each row request is then satisfied by the particular data store that contains it, retrieving the row (or rows) via a network. Most often there is a middle-tier server doing this work, so that your typical database client (e.g., the MySQL driver) can just connect to the middle-tier, and the whole thing looks just like a single monolithic database.

This concept understandably has lot of appeal, as who wouldn’t want a “plug and play” setup that does “auto-magic” scaling for your database?

The answer is that there is a cost to this type of approach. (In fact we investigated it early on for dbShards and concluded it was not the best long-term way to address sharding of a relational database.) Because rows for all tables are sharded mechanically, that means that any row can end up in any shard. The network access for each row adds overhead, even if queries are somehow optimized so that groups of rows are grabbed from a given partition with a single call. For single row reads/writes (just like the NoSQL databases) this is not much overhead, but when it comes to sets of data, the cost can be expensive in terms of application performance. Think of joins, aggregates, sorts – all of the things you do frequently in an application, and imagine that instead of a local disk or memory read (as is the case with MyISAM or InnoDB) you have to read across the network access instead. This is especially burdensome in cloud environments where you don’t have control of shared network I/O, which is generally slower than dedicated environments to begin with.

It follows then that when you add more partitions (dozens or 100s as these products claim) we believe the degradation for multi-row sets will be dramatic. This means that as your database gets bigger, you could see a “hockey stick” degradation curve, something that is hard to predict and the last thing you want to encounter mid-stream in a production application.

With dbShards we use a technique called Application-Aware Sharding, which really means that you as the developer decide up front how to shard your database, based on your application’s specific requirements.

The primary rationale for black box partitioning approaches is that “application-based sharding requires a lot of sophistication” to implement and operate. It’s true that doing this totally on your own can be challenging, but honestly the concepts of how to go about it are extremely simple and easy to grasp.

Here is why.

The truth is that in any given application, only a few tables grow to a large enough size with high transaction volumes to even justify sharding. From our experience this is usually 5% – 10% of the total number of tables, meaning that you really only need to shard 5 or 10 tables in an application that has 100 tables in its schema. Identifying them is trivial, just look at your largest, most active tables and the list will pop out at you. These tables naturally form a “shard tree” with a single parent table (e.g., a “user” table), and there you have all of the basics for your sharding strategy. (Sometimes we find applications with 2 or 3 shard trees, but still the number of tables is small).

By definition, the shard tree is made up of related tables (after all, that’s what a relational database is all about). With Application-Specific Sharding, all related data is located in the same shard (e.g., “order” data for a given user), and with optimal shard sizing, this technique uses the power of proven database engines to achieve incredibly fast read/write access. Just remember how fast your database was when your application first started out, and multiply that times a number of shards, and you immediately see the potential. Joins, multi-row result sets, aggregates perform extremely well because your most frequent accesses stay within a single shard. There is no middle-tier, with nothing between your application and your database. Sharding decisions are made by the database driver itself, with application requests going directly to the database engine (just like they always have). Less frequent “Go Fish” queries are performed in parallel across multiple shards when needed, again totally seamless to your application.

Herein lies the biggest difference between this technique and the black box sharding described above. With mechanical sharding techniques, you give up control of the partitioning logic, and all of your data is distributed (not just the tables you really need to shard). Because related data can be anywhere, over time and with more nodes in the data layer, the results can be very unpredictable, especially if you need multi-table joins, aggregates or other complex query support.

Once you have identified your sharding strategy, a product like dbShards makes the rest very easy to do. A simple dump from your current database can be loaded into the sharded environment, and other features such as Global Tables, reliable replication, continuous operation, re-sharding, and parallel “Go Fish” queries, make the whole environment virtually transparent to your application. The point is, however, that Application-Aware Sharding does take some up front planning and analysis to make it work (an hour or less if you know your app), and sometimes minimal application changes are required if you want to achieve the best performance possible. This is the case with any sensible performance tuning effort, so regardless or your direction you should plan for it.

Our tools guide you through the process, and based on the results our customers are experiencing, it’s well worth the effort. We have real production applications writing billions of rows, with incredibly fast read rates. For example, we have seen 650,000 row reads/second in cloud-based applications using dbShards in a MySQL environment with just a small number of shards.

Another nice benefit of Application-Aware Sharding is that you are in total control of your database performance, with simple tools and “hints” to further tune data access behavior when necessary, just as you would with any other database environment.

The result is predictable, linear (or better) scaling for your application for its entire lifespan.

In summary, if your application needs to scale, spend the time to drill down to the details, investigating all the alternatives and make sure that you choose the best solution for your particular needs.

Posted in News & Commentary by Cory Isaacson - 2 Comments »