Design patterns for SQL databases
If you don't know the name of something, you will end up reinventing that something and also feel very smart by doing it. One's pride is the comedy of others.
So, let's be humble and be aware of cool techniques will cool names that everyone doing SQL databases end up in need of at some point in a project.
Some names you already heard about already, back in the school. Some techniques you already uses but didn't knew the name. Maybe there is something new, but this is precisely why this article exists.
So, let's get started with the easy and popular ones first.
Normalization
This is what you learn back in the school. The Edgar F. Cobb normal forms are the basic technique for database modelling.
In short:
- First normal form is all about to get rid of duplicate data.
- Second normal form is about primary key definition.
- Third normal form is about to get information expressed in terms of other information to rely on keys as well, either primary or secondary keys.
Superior normal forms are about how atomic, consistent data relates to each other in join operations.
The important action on normalization s to get rid of inconsistencies and duplicated information.
For example, one careless way to model "Alice and Bob are married and live in Elm Street, 123" is:
┌──────────┐
│ Marriage │ ┌────────┐
├──────────┤ │ Person │
│ person1 │ ├────────┤
│ person2 │ │ name │
└──────────┘ └────────┘
┌─────────┐
│ Address │
├─────────┤
│ person1 │
│ person2 │
│ street │
└─────────┘
The schema above is bad in many ways:
- Not a single primary key, two "Bobs" will make the task of figure out who is married with "Alice" an impossible task.
- Share the same address with more than two people is also troublesome.
- In some cultures, it's possible to marry more than one people.
Now, consider this one:
┌──────────┐
│ Marriage │ ┌───────────────┐
├──────────┤ │ Person │
│ id │ ├───────────────┤
└──────────┘ │ id │
│ name │
│ marriage_id │
│ address_id │
└───────────────┘
┌─────────┐
│ Address │
├─────────┤
│ id │
│ street │
└─────────┘
- Now people can have the same name but be different persons.
- Marriage with more than a couple is solved, but this introduces the chance of a lone married person to appear, as well as marriages with no persons.
- Addresses can now house as many people as needed.
Lots of new information can now be extracted with join operations.
Denormalization
The second database design pattern is The opposite of the the first one.
But how come it to be considered a design pattern?
One important thing to consider when designing a database is performance. Indeed, data consistency is paramount, this is why the fist step is always design a consistent database schema, but when put to do real work, the bottlenecks appears and you start to add small inconsistencies to save you from lots of extra lookups.
In short, identify on your running database critical points demanding extra operations to produce the information in high demand and then denormalize it.
Take this voting system scenario:
┌──────────────┐
│ Vote │ ┌─────────────┐
├──────────────┤ │ Candidate │
│ id │ ├─────────────┤
│ candidate_id │ │ id │
└──────────────┘ │ election_id │
┌──────────┐ └─────────────┘
│ Election │
├──────────┤
│ id │
│ year │
└──────────┘
This schema is pristine. All by the book. However:
- If i ask 'who won 2022 elections?', i will always need to check all candidates of that election, and for each one count the votes.
- Election results are not time-resilient: one flip in a past vote can change past results. "It should never happen" are famous last words written in many tombstones all over the world.
To tackle down such issues, the schema can be modified to something like this:
┌──────────────┐
│ Vote │ ┌─────────────┐
├──────────────┤ │ Candidate │
│ id │ ├─────────────┤
│ candidate_id │ │ id │
└──────────────┘ │ total_votes │
│ election_id │
┌───────────┐ └─────────────┘
│ Election │
├───────────┤
│ id │
│ year │
│ winner_id │
└───────────┘
Changes are simple, just materializes derivative data as first-class data in the database schema. On the other hand, they save us potentially from performing expensive operations to deliver a highly demanded information.
Historical columns
The third design i present is one that you can find when searching for the past value issue. Whenever keep track of changes that a certain column suffers over time is needed, this pattern is the one to apply.
Take this simple example:
┌─────────────┐
│ Product │
├─────────────┤
│ id │ ┌────────────┐
│ description │ │ Sale │
│ price │ ├────────────┤
└─────────────┘ │ id │
│ product_id │
└────────────┘
This table is barely enough to describe a product and its price.
However, what happens if we decide to change the product price? We update it, yes, but by doing so, we not only promote loss of information but also corrupts existing sales entries.
To avoid this,add a third table to track prices:
┌──────────────────┐
│ Product │
├──────────────────┤ ┌──────────────┐
│ id │ ┌────────────┐ │ Sale │
│ description │ │ PriceTag │ ├──────────────┤
│ current_price_id │ ├────────────┤ │ id │
└──────────────────┘ │ id │ │ price_tag_id │
│ product_id │ └──────────────┘
│ start_date │
│ price │
└────────────┘
With this arrangement, it's now possible to update a product price and still keep track of earings sales correctly.
In the example, there is also a key for the current price tag on the product. It is added just to avoid funky operations to discover the latest price tag available for a product. But this is a denormalization, so a careless operation can introduce inconsistencies to the data.
Append-only ledger
Another design patter focused on keep track of changes over time, this pattern differentiates from historical columns by registering changes not only in time, but also the nature of the change.
Additionally, as the name suggests, Append-only ledger forbids updates on the table storing the transactions.
Take this simple banking schema as example:
┌─────────┐ │ Account │ ├─────────┤ │ id │ │ balance │ └─────────┘
Common operations, like deposit, withdraw and transfer can be translated into queries like this:
-- a transfer
update Acvcount set balance = balance + 100 where id = 1;
update Acvcount set balance = balance - 100 where id = 2;
-- deposit
update Acvcount set balance = balance + 200 where id = 2;
-- withdraw
update Acvcount set balance = balance - 50 where id = 2;
However, there is a huge loos of information here, since those distinct banking operations are reduced to updates and then vanishes for good.
Creating a ledger allows the proper representation of those events:
┌─────────┐
│ Account │
├─────────┤ ┌────────────────┐
│ id │ │ AccountLedger │
│ balance │ ├────────────────┤
└─────────┘ │ id │
│ account_id │
│ operation_type │
│ amount │
│ date │
└────────────────┘
Now the operations can be stored like this:
-- a transfer
insert into AccountLedger (account_id,operation_type,amount) values (1,'deposit',100);
update Acvcount set balance = balance + 100 where id = 1;
insert into AccountLedger (account_id,operation_type,amount) values (2,'withdraw',-100);
update Acvcount set balance = balance - 100 where id = 2;
-- deposit
insert into AccountLedger (account_id,operation_type,amount) values (2,'deposit',200);
update Acvcount set balance = balance + 200 where id = 2;
-- withdraw
insert into AccountLedger (account_id,operation_type,amount) values (2,'withdraw',-50);
update Acvcount set balance = balance - 50 where id = 2;
The account balance becomes a derivative property, based on all ledger operations involving that account.
Another side effect of this is the need of transactions, because both insert and update operations must complete together or fail altogether.
How it differs rom historical tables?
While ledgers focus on events acting on a single column historical columns are the state of that column over the time.
Event sourcing
Now it's time to talk about event sourcing.
This pattern become widely popular because all the microservice mania.
This pattern is especially useful for distributed transactions, since event sourcing creates a single source of truth for several and somewhat distinct tables.
Let's look at the following schema:
┌─────────────┐
│ Order │
├─────────────┤ ┌────────────┐
│ id │ │ OrderItem │
│ customer_id │ ├────────────┤
│ status │ │ id │
└─────────────┘ │ order_id │
│ product_id │
│ amount │
└────────────┘
The order journey could be like this:
insert into Order(customer_id,"status") values (123,'open');
-- returned id 11
insert into OrderItem(order_id,product_id,amount) values (11,121,10);
insert into OrderItem(order_id,product_id,amount) values (11,122,1);
insert into OrderItem(order_id,product_id,amount) values (11,123,200.4);
-- order was assembled in warehouse and shipped
update Order set "status" = 'shipped' where id = 11;
Again, we witness the loss of information regarding the events needed to build all those states.
Let's modify the schema:
┌─────────────┐
│ Order │
├─────────────┤ ┌────────────┐
│ id │ │ OrderItem │
│ customer_id │ ├────────────┤
│ status │ │ id │
└─────────────┘ │ order_id │
│ product_id │
│ amount │
└────────────┘
┌────────────┐
│ OrderEvent │
├────────────┤
│ id │
│ date │
│ type │
│ payload │
└────────────┘
Now, the order journey can be represented by those operations:
insert into Order(customer_id,"status") values (123,'open');
-- returned id 11
insert into OrderEvent("type",payload) values ('create_order','id:11;customer_id:123;status:open');
insert into OrderItem(order_id,product_id,amount) values (11,121,10);
-- returned id 1111
insert into OrderEvent("type",payload) values ('add_order_item','id:1111;order_id:11;product_id:121;amount:10');
insert into OrderItem(order_id,product_id,amount) values (11,122,1);
-- returned id 1112
insert into OrderEvent("type",payload) values ('add_order_item','id:1112;order_id:11;product_id:122;amount:1');
insert into OrderItem(order_id,product_id,amount) values (11,123,200.4);
-- returned id 1115
insert into OrderEvent("type",payload) values ('add_order_item','id:1115;order_id:11;product_id:123;amount:200.4');
-- order was assembled in warehouse and shipped
update Order set "status" = 'shipped' where id = 11;
insert into OrderEvent("type",payload) values ('ship_order','id:11;status:shipped');
Again, a transaction will be needed when sourcing the events.
Note that the payload may cause loss of information if some column does not participate in the set of values, or if columns cease to exist, or new columns appear and get specific rules that does not apply on older versions of database schema, or if sor some reason one event is deleted. But this is a ner concern, schema evolution.
Simpler databases doesn't have the maturity or the age for his to be a problem.
Event sourcing differs from append-only ledger because while ledgers tracks one single column, on events we encode entire journeys, enabling us to reproduce the current system state by simply replaying the events.
Note also the choice for an transparent structure to model the format of the payload. It is transparent for the database schema because, for the database, the payload is just a huge string.
The tradeoff is high, but it is common on event sourcing because of two main reasons:
- Events mus be plastic enough to produce one single timeline for all distinct operations involving distinct tables.
- Events are usually propagated over the network of systems acting on the same distributed transaction. Therefore a kind of esperanto is useful for communication in a highly heterogeneous field of (micro)services.
Longitudinal tables
This database design pattern is one of the most powerful ones, it stores raw data in a way that several levels of information refinement are possible.
It is also known as time-series table.
It introduces several redundancies that might seem a waste at first, but even on repetition lies invaluable information encoded.
Let's sample this starting with a stocks portfolio:
┌─────────────┐ ┌───────────────┐
│ Portfolio │ │ PortfolioItem │
├─────────────┤ ├───────────────┤
│ id │ │ id │
│ account_id │ │ portfolio_id │
│ description │ │ stock_id │
└─────────────┘ │ amount │
└───────────────┘
We can extract useful information from this.
For example, what is our top 3 most valuable stocks in our portfolio?
with most_valuables as
(select pi.portfolio_id,
st.code,
st.value,
pi.amount,
dense_rank()
over(partition by pi.portfolio_id
order by st.value * pi.amount desc) as rank_pos
from PortfolioItem pi
join Stock st on pi.stock_id = st.id
order by st.value * pi.amount desc)
select s1.rank_pos as pos1, s1.code as code1, s1.amount * s1.value as total1,
s2.rank_pos as pos2, s2.code as code2, s2.amount * s2.value as total2,
s3.rank_pos as pos3, s3.code as code3, s3.amount * s3.value as total3
from most_valuables s1,
most_valuables s2,
most_valuables s3
where s1.portfolio_id = s2.portfolio_id
and s1.portfolio_id = s3.portfolio_id
and s2.portfolio_id = s3.portfolio_id
and s1.rank_pos = 1
and s2.rank_pos = 2
and s3.rank_pos = 3
Assuming a portfolio based on this:
| id | portfolio_id | stock_id | amount |
|---|---|---|---|
| 11 | 1 | 5 | 11 |
| 12 | 1 | 6 | 15 |
| 13 | 1 | 7 | 5 |
| 14 | 1 | 8 | 20 |
| 15 | 1 | 9 | 300 |
| id | code | value |
|---|---|---|
| 5 | AAPL | 150 |
| 6 | GOOG | 100 |
| 7 | MSFT | 200 |
| 8 | TSLA | 50 |
| 9 | MGLU | 3 |
The presented query would produce something like this:
| pos1 | code1 | total1 | pos2 | code2 | total2 | pos3 | code3 | total3 |
|---|---|---|---|---|---|---|---|---|
| 1 | AAPL | 1650 | 2 | GOOG | 1500 | 3 | MSFT | 1000 |
But hey, this is a snapshot. How well this portfolio does over the time?
To answer this, let's modify our schema to incorporate the time dimension:
┌─────────────┐ ┌───────────────┐
│ Portfolio │ │ PortfolioItem │
├─────────────┤ ├───────────────┤
│ id │ │ id │
│ account_id │ │ portfolio_id │
│ description │ │ stock_id │
└─────────────┘ │ date │
│ amount │
└───────────────┘
This simple addition changes everything, since now it's possible to ask new questions, like "What are my top 3 stocks in the last 3 days?":
with most_valuables as
(select pi.portfolio_id,
pi.date,
st.code,
st.value,
pi.amount,
dense_rank()
over(partition by pi.portfolio_id,pi.date
order by st.value * pi.amount desc) as rank_pos
from PortfolioItem pi
join Stock st on pi.stock_id = st.id
order by st.value * pi.amount desc)
select s1.date,
s1.rank_pos as pos1, s1.code as code1, s1.amount * s1.value as total1,
s2.rank_pos as pos2, s2.code as code2, s2.amount * s2.value as total2,
s3.rank_pos as pos3, s3.code as code3, s3.amount * s3.value as total3
from most_valuables s1,
most_valuables s2,
most_valuables s3
where s1.portfolio_id = s2.portfolio_id
and s1.portfolio_id = s3.portfolio_id
and s2.portfolio_id = s3.portfolio_id
and s1.rank_pos = 1
and s2.rank_pos = 2
and s3.rank_pos = 3
and s1.date > '2025-12-01'
Now we get information over time, simple as that.
One drawback is that longitudinal tables is the growth over time. Long tuples with many relations (which can be replicated over time as well) can accumulate and slow proper filtering the old the database gets.
Time series databases are like an extra dimension over the entire database. The extra space can make queries more complex, but this design pattern produces very dense databases, capable of produce lots of information.
Snapshot tables
The last design pattern for today is the snapshot table.
All patterns presented here so far are meant to prevent loss of information and
make it denser. The drawback is that the schema becomes complex. For example, in
the PortfolioItem table, to recover the latest items in the portfolio is no
longer a simple select, but:
select portfolio_id,
stock_id,
max("date"),
max(amount),
from PortfolioItem
group by portfolio_id,
stock_id;
Enter the snapshot table pattern.
This pattern is a bit funny because it's like a return to innocence: we create an entire table, or even a set of tables, to represent just the latest, most recent value of the current state of the data.
The snapshot pattern is built on top of time series / longitudinal pattern. It is a complete copy of the longitudinal table, except for the time component.
Like this:
┌─────────────┐ ┌───────────────┐ ┌───────┐
│ Portfolio │ │ PortfolioItem │ │ Stock │
├─────────────┤ ├───────────────┤ ├───────┤
│ id │ │ id │ │ id │
│ account_id │ │ portfolio_id │ │ code │
│ description │ │ stock_id │ │ value │
└─────────────┘ │ amount │ └───────┘
└───────────────┘
┌───────────────────┐
│ PortfolioItemLog │
├───────────────────┤ ┌──────────┐
│ id │ │ StockLog │
│ portfolio_item_id │ ├──────────┤
│ portfolio_id │ │ id │
│ stock_log_id │ │ stock_id │
│ amount │ │ code │
│ date │ │ value │
└───────────────────┘ │ date │
└──────────┘
And just like that, longitudinal tables keep growing and condensing more and more information, but for latest values there is the snapshot values.
Note also that, like what happened with append-only ledgers, each update in the snapshots an insert into the longitudinal tables.
Conclusion
Those approaches are quite common and may have other names. Those thing keep being rediscovered, so many names are expected.
There are also many others, we may cover more in the future.
That's it and happy hacking!