Tuesday, December 21, 2010
Object Relational Mapping frameworks such as Hibernate (Java), PDO (PHP) and Django (Python) take care of persisting objects without requiring the developer to hand-code SQL for most tasks.
Primary Shard Tables
In most cases these frameworks generate SQL that works with dbShards. Taking a simple example of a top-level primary shard table ‘customer’ table that is sharded by a ‘customer_id’ column, which is also the primary key, these frameworks would generate SQL statements similar to the following examples:
INSERT INTO customer (customer_id, customer_name) VALUES (?, ?)
UPDATE customer SET customer_name = ? WHERE customer_id = ?
DELETE customer WHERE customer_id = ?
SELECT * FROM customer WHERE customer_id = ?
In each case, the SQL contains the ‘customer_id’ shard key and dbShards will be able to determine which shard to execute the SQL against.
Child and Grandchild Tables
For a table that is a direct child of ‘customer’, such as ‘customer_order’, there will already be a ‘customer_id’ column to support the foreign key relationship to the parent table and this column can also be used as the shard key. However, in this case the shard key is different from the primary key (the ‘customer_order’ table would most likely have its own auto increment ‘order_id’ column which would be used as the primary key).
For a table that is a direct child of ‘customer_order’ and therefore the grandchild of ‘customer’, such as ‘customer_order_item’, there would typically not be a ‘customer_id’ column in a normalized relational database design and there would just be a ‘customer_order_id’ column to support the foreign key relationship to the ‘customer_order’ table. To support sharding without adding undue complexity, it makes sense to denormalize the schema and add the ‘customer_id’ column to this table so that we have a natural shard key. It is desirable for performance reasons to keep all data for a single customer (including orders and order items) in a single shard.
ORM Mappings for Child Tables
When a table has a shard key that is different from the primary key, as is the case in the ‘customer_order’ and ‘customer_order_item’ child tables, then the ORM configuration must be updated to force the ORM to include the shard key in WHERE clauses, otherwise the WHERE clause will only contain the primary key and there will not be sufficient information for dbShards to determine the target shard.
Example: UPDATE on child table without shard key
UPDATE customer_order SET quantity = ? WHERE order_id = ?
This UPDATE will fail since there is no shard key (although, if the ORM allows comments to be inserted into the SQL then the dbShards Shard Hints feature can be used to specify the shard key).
The solution to this issue is actually very simple. The ORM configuration can be updated to include the shard key in the primary key definition (note that no schema changes are required since we are just tricking the ORM into generating the correct SQL). With the ORM configured to use ‘order_id’ and ‘customer_id’ as the primary key for the ‘customer_order’ table, the following SQL will now be generated:
UPDATE customer_order SET quantity = ? WHERE order_id = ? AND customer_id = ?
Each query now contains the ‘customer_id’ shard key and dbShards can once again automatically determine which shard or shards to execute the query against without requiring the use of Shard Hints.
Hibernate/JPA Code Example
Here is the original CustomerOrder class using JPA annotations:
@Entity
public class CustomerOrder implements Serializable {
@Id
private int orderId;
private int quantity;
private long totalPrice;
}
Here is the modified CustomerOrder class to support sharding by ‘customer_id’:
@Entity
public class CustomerOrder implements Serializable {
@EmbeddedId
private CustomerOrderPK pk;
private int quantity;
private long totalPrice;
}
@Embeddable
public class CustomerOrderPK implements Serializable {
@Id
private int orderId;
private int customerId;
}
Posted in Articles, Tutorial by Andy Grove - No Comments »

