The Sharded database structure

Making use of a good-old technique in case of DB meltdown

Posted by creatic on Thu 11 July 2019

When the DB grows

You have a database you enhanced from the start of the project. Its schema is glued to the business intelligence, the structure cannot be easily changed because of the amount of code involved on the software side. It can be ran by a single DB server or - if you want to enhance read operations - one can add slave server(s) and for a long time everything works just fine.

Sooner or later however either the table sizes or (depending on the schema and the correlation of the tables) the sizes of views grows to the point that event separate read-only slaves cannot serve the data in the required timeframe. In real-time searches it's under 1-2 secs. So what can you do? Refactor the database? Replace the SQL DB with a No-SQL one? Try out parallel SQL server solutions? Maybe. But in all these case you have to refactor your code, too, including your tests, and possibly some program logic.

The Sharded solution

Sharding data in this case seems to me the most cost-effective and least obtrusive solution. But what is sharding? Imagine you have 100K+ records of articles, 10K+ users that can access part of the articles (so the user table correlates somehow to the article table) and dozens of tables belonging to the articles like features, shops these articles available, various price information, etc. And you want to make a real time search with a particular user.

If you run the search in a usual way, that means you have a correlation on a 100K x 10K x n records view (n is a relatively small number here, cca. 10 or 20 depending on how you store the article data). The main problem is the correlation between tables which are large by themselves. So what to do? Let's cut the correlation at the most meaningful point and simplify our view to the extent we can easily manage with a simple DB setup.

Measurements and the proper dimension

The key point is finding out what amount of data (how many records in a view) is feasible with your DB setup. It really depends on the schema, the underlying hardware or virtual server plan, etc. But with some tests you can find out. You also have to add the potential growth, so a 30-50% threshold can be a good idea. Anyway you find out that you can handle 400M records and currently you have a 4000M records DB view. That means you have to slice your original view at least 40 separate views (better 80 - think about the growth).

At this point you have to find out the proper dimension you want to do this cut.

For example if you do it at the article features (price, optional accessories, color, etc) that would mean that when you load a particular article you have to make separate queries to read the separate article properties. This maybe what you want, maybe not. Or else we could cut the view by geographical area: which article is accessible in which shop, etc. In this case I would argue for slicing the view at the users. In most cases when we do the read on this view we do it using one user id, so we can fix this one and load all the other information at once.

Storing the sharded data

So - in theory - we cut our table of 10k+ users into 10 smaller ones. And at this point we have at least two possibilities. The first is to run 10 slave databases with identical data - except the user table. Lets call this server level sharding. This is the easier one (except that we have to manage 10 extra DBs). The data updates realtime automatically, we only have to take care for connecting the proper DB with our user id.

The second option is a little bit more complex. At least a part of our sliced views sould be pre-calculated and we use one database having all of these sliced views. Lets call this virtual sharding. How you do this pre-calculation depends on the business logic. But for example if you find out that the user address and article availability correlates, then a view that holds this information together (e.g. in a "switch table") and links this to the rest of the original view - you simplify and speed up the read process dramatically. And these sliced views with the "switch table" can be stored in one database. The dark part of this second solution that you have to update your switch and sliced views regularly - these are not part of the original schema, but are calculated from it, so as data changes the calculated views have to change with them, too. And this of course causes some latency in the search.

Conclusion

The growth of business data can lead to a slowing down database. Sharding the database can solve this issue, but it also adds some complexity to the setup. Here I mentioned two distinct approach to do the sharding: server level and virtual sharding. Both has advantages and dis-advantages so it really depends on the actual schema and logic which one is a better solution. But in the end we can handle a situation which would run out of control easily.