Codementor Events

Data Architecture Design Basics

Published Sep 16, 2019

What this study covers

Where does the user authentication data go? What about our regularly read content? What about images and regularly served files?

This study is meant to make it easier for a data architect to decide on where each piece of data in his/her project is supposed to go by making decisions based on the strengths and weaknesses of each DB. It is meant to simplify a software architect’s decision process when designing a new data architecture for a project.

What this study does not cover

How much will this cost? Can I find developers that can implement my design? How much data can this design handle? How can I monitor the efficiency of my choice?

This study does not cover the cost, ease of adoption, or size of the data in the chosen databases.

If your project is small, it would be best to stick to a single database as they are all capable of handling up to 105 records of data very efficiently. Once this threshold is exceeded, you should start looking for ways to make it more efficient.

Monitoring is not addressed in this study as it is a topic of its own and each implementation requires separate considerations. Just keep in mind that as long as you set up proper indices, the DB is rarely the first culprit for efficiency problems.

Performance tests in existing research

I have based my decisions on the following papers and articles
2017: Microsoft Azure article “NoSQL vs SQL” (link)

2014: “Performance Evaluation of NoSQL Databases: A Case Study” (link)
John Klein, Ian Gorton, Neil Ernst, Patrick Donohoe
Software Engineering Institute
Carnegie Mellon University
Pittsburgh, PA, USA
{jklein, igorton, nernst, pd}@sei.cmu.edu

Kim Pham, Chrisjan Matser
Telemedicine and Advanced Technology Research Center
US Army Medical Research and Material Command
Frederick, MD, USA
kim.solutionsit@gmail.com, cmatser@codespinnerinc.com

2013: “A performance comparison of SQL and NoSQL databases” (link)
Yishan Li and Sathiamoorthy Manoharan
Department of Computer Science
University of Auckland
New Zealand

General considerations for data architecture design

  1. At the design phase, measurable and immeasurable metrics impact the choices we make to optimize the efficiency of the services.
    Measurable metrics are:
    a. The calls that each API makes to the DB
    b. The number of complex queries in the APIs
    c. The number of writes and reads that each API makes
    Immeasurable metrics are:
    a. The frequency of each API
    b. Whether or not the load on the DB will increase slowly or in spikes
    c. The impact of data-locking operations on the efficiency of the system

  2. These can only be uncovered through monitoring, but will remain unreliable as it will depend on averaged data that will change daily.

  3. Where is most of the delay going to come from? DB, webserver, or 3rd party APIs
    3rd party API calls will be slower than DB or webserver functions and can only be optimized in the sense that they can be queued and asynchronously called.

  4. Reports slow down the entire system significantly since they are often very complicated to extract for analysis and consume much of the resources of the system. Reporting data will always require special handling in order to limit its impact on system performance as a whole and will almost exclusively be for read purposes.

Luckily, we no longer have to choose just one database to house all of our data. The main types are SQL and NoSQL databases, but there is a great degree of variance in the different available brands. Add to that the different deployment options for each DB like Master-Slaves, replication, sharding, and Hadoop-like services among others and you have a plethora of options for deciding which data goes where.
Flavors of SQL: MySQL, Postgres, MariaDB, Oracle, SQL Server
Flavors of NoSQL: Couchbase, CouchDB, MongoDB, Cassandra, Azure DocumentDB and HBase

Scale

database_sizes_table.png

Strictly speaking, you will have to consider all your DB options very carefully and almost all rules go out the window when it comes to very large data sets. This table is just a starting strategy to set a baseline for your design.
The important part is to monitor performance and be able to compare and test different configurations so that when efficiency starts to drop, you have reasoned alternatives to all out panic! I recommend having no more than 5 basic load tests that measure the latency of your system on its core features when under significant stress. Such as a lot of users trying to login at the same time or a lot of users trying to acquire a lock on 1 record at the same time.
testing_criteria.jpg

Questions about the measurable metrics

Answer the following questions for your own project. I will answer them in a later section for my own project.

  1. What is the percentage of requests that run complex queries?
  2. What is the percentage of requests that run simple queries?
  3. What is the percentage of requests that lock tables?
  4. What is the percentage of requests that insert/update/delete from a single table?
  5. What is the percentage of requests that insert/update/delete from multiple tables?

DB Scenarios to Test

The following test cases should be run on a test environment that is similar to your expected production environment. The examples are written for MySQL.

The seed data should be as large as your expected number of records for the current architecture.

Lastly, they need to run through your backend code and not just on the DB for more realistic results.

  1. Query for 1 record in a single table. eg:

SELECT * FROM User LIMIT 1;

  1. Insert 1 record into a single table. eg:

INSERT INTO User (Name, email) VALUES (‘Tom Hardy’, ‘tom@hardy.com’);

  1. Query multiple tables in a single query for 1 record. eg:

SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID
JOIN Image i ON i.UserID=u.ID
JOIN Block b ON b.PreferenceID=p.ID;

  1. Query for 10,000 records. eg:

SELECT * FROM User LIMIT 10000;

  1. Query for 10,000 records from 4 or more tables. eg:

SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID
JOIN Block b ON b.PreferenceID=p.ID
LIMIT 10000;

  1. Query multiple tables in a single query with indexed and unindexed criteria for 1 record. eg:

SELECT *
FROM User u
JOIN UserPreferences p ON p.UserID=u.ID
JOIN UserImage ON ui ui.UserID=u.ID
JOIN Image i ON i.UserID=u.ID
JOIN Block b ON b.PreferenceID=p.ID
WHERE u.email=’tom@hardy.com’ (Indexed)
AND p.InterestID=5 (Indexed)
AND p.CreationDate=’14/10/2016’ (Unindexed)
AND b.State=2; (Unindexed)

  1. Insert into multiple tables. eg:

INSERT INTO User (Name, email) VALUES (‘Tom Hardy’, ‘tom@hardy.com’);
INSERT INTO Preferences (Name, UserID, State) VALUES (‘Groceries’, 12, 1);
INSERT INTO Image (Name, Source) VALUES (‘Taurus’, ‘http://img.google.com/qn104m.jpg’);

  1. Query multiple tables for 1 record and insert into other tables. eg:

SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID;
INSERT INTO Preferences (Name, UserID, State) VALUES (‘Pets’, 13, 1);

  1. Select and lock multiple tables for 1 record and insert into one of them before releasing the lock. eg:

START TRANSACTION;
SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID
WHERE u.ID=9001
FOR UPDATE;
INSERT INTO Preferences (Name, UserID, State) VALUES (‘Pets’, 13, 1);
COMMIT;

Use Case

A project is expected to start with 107 records without a data spike upon its launch. After that, the rate of increase is expected to average 220,000 records per day for the first year.

The following is an obfuscated version of the full API of this project and all the tables that each API uses. They are grouped by Select, Update, and Insert statements. The percentage number next to each API is the expected frequency of the use of that API.

  1. API1 (Used by all APIs) (24%)
    Insert: TableF
  2. API2 (Used by most APIs) (17%)
    Select: TableG, TableH
    Update: TableH
  3. API3 (11%)
    Select: TableL, TableM, TableJ, TableN, TableO, TableP, TableI, TableQ, TableR, TableE, TableS, TableT, TableU, TableV
    Insert: TableW
  4. API4 (9%)
    Update: TableA, TableB, TableC
    Insert: TableD
  5. API5 (5%)
    Select: TableE, TableI, TableK
  6. API6 (4%)
    Insert: TableE
  7. API7 (4%)
    Update: TableG
    Insert: TableG
  8. API8 (4%)
    Select: TableL, TableM, TableJ, TableN, TableO, TableP, TableI, TableQ, TableR, TableE, TableS, TableT, TableU, TableV
  9. API9 (4%)
    Select: TableL, TableO, TableY
    Insert: TableO
  10. API10 (4%)
    Select: TableG, TableX, TableH
    Update: TableG, TableX
    Insert: TableG, TableX, TableH
  11. API11 (2%)
    Select: TableI, TableJ
  12. API12 (2%)
    Select: TableL, TableM, TableJ, TableN, TableO, TableP, TableI, TableQ, TableR, TableE, TableS, TableT, TableU, TableV, TableK
  13. API13 (2%)
    Insert: TableM
  14. API14 (2%)
    Select: TableL, TableO, TableY, TableQ, TableR, TableE
    Insert: TableL
  15. API15 (2%)
    Update: TableG
  16. API16 (2%)
    Update: TableX
  17. API7 (1%)
    Select: TableQ, TableJ, TableI, TableX
  18. API18 (1%)
    Select: TableJ
    Update: TableJ, TableG
    Insert: TableJ
  19. API19 (0%)
    Update: TableG
  20. API20 (0%)
    Update: TableG

Analysis

The amount of records in this DB is 107 with an increase of 220,000 per day for the first year:
10,000,000 + (365 x 220,000) = 90,300,000 = ~10^8 records in 1 year

The expected API frequencies for this project yield 135 writes for every 333 reads.

The following tables are the ones that will be written to. The number is the percentage of API requests:
TableF: 24%
TableH: 21%
TableG: 19%
TableW: 11%
TableX: 10%
TableA: 9%
TableB: 9%
TableC: 9%
TableD: 9%
TableE: 4%
TableO: 4%
TableJ: 2%
TableL: 2%
TableM: 2%

The following tables are the ones that will be read from:
TableI: 25%
TableE: 24%
TableL: 23%
TableG: 21%
TableH: 21%
TableJ: 20%
TableQ: 20%
TableR: 19%
TableM: 17%
TableN: 17%
TableP: 17%
TableS: 17%
TableT: 17%
TableU: 17%
TableK: 7%
TableY: 6%
TableX: 5%

Recommendations

This number is well below 109 records and therefore it is a large sized DB. Our rate of increase is not expected to spike outside what an SQL DB with memcache and a NoSQL DB can handle for the first year. So the recommendation is an SQL DB with memcache and Couchbase.

We should set special handling for the top 2 most written-to tables as anything above 20% should be separated from any read tables when possible.

About the author

Abdallah Chamas is a software architect with 11 years of experience as of this writing April, 2017. He is a Scrum Master and a backend developer using Java, PHP, MySQL, Microsoft SQL Server, and Oracle. He has been a core developer of ApstrataDB and several user-oriented technology startups.

Discover and read more posts from Abdallah Chamas
get started
post commentsBe the first to share your opinion
Show more replies