Classic relational databases enable you to create highly normalized data models with schema that might contain a lot of tables. Logical entities are broken into several tables and every complex property of the primary entity (for example, list, array, collection) is placed into separate table.
An example of simple Person entity that has many email addresses and phone numbers is shown on the following diagram:
This simple but still highly normalized data model is optimized for large number of concurrent users who can update any of the tables in the diagram. We can have one user who is updating Person row, other who is concurrently updating EmailAddress for the same person, third that is updating PersonPhone, and fourth that is updating PhoneNumberType table without affecting each other. Furthermore, we can have several users that are updating different email addresses for the same user without affecting each other because locks are applied at the row-level. As we can see this model is designed for high concurrent updates of the same logical records.
The drawback of this approach is the fact that it requires many JOINs or separate queries to retrieve all necessary data if a user needs to get the data for a single person. Also, in order to insert new person that is physically split in several tables you would need to use transactions to update many table, and in many case we need to follow parent/child order defined by referential integrity, read primary key of the parent row before we insert child row, lock several tables, etc.
There are many workloads that cannot fully leverage this design because they have the following characteristics:
- There are no large number of separate users that update fine-grained pieces of information for the same entity at the same time.
- User who are reading information about primary/logical entity (Person) usually must read all related information.
- Users are updating logical entity in one transaction, and not individual pieces (emails, phones). Complex entities are updated in batches and involve insert/updates both in primary table (Person in this case), and all related secondary tables, which might cause locking issues.
In the schema displayed above, we can notice that email addresses and phone numbers for persons are not frequently changed (how many times you have changed email address or phone number in the last month?) For these types of workloads, highly normalized schema doesn’t provide the main high concurrency benefit but introduces the cost of reading and updating highly normalized data.
Architects sometime choose NoSQL databases for that kind of models and workloads, because NoSQL collections enable them to physically store logical entities as single record (usually formatted as JSON). However, this choice is a problem if some parts of the schema must stay in SQL Database because they would need to join relational and non-relational data from the different sources in the application layer.
SQL Server and Azure SQL Database are general-purpose multimodal databases that enable you to fine-tune your models and combine both relational and non-relational design concepts in your database. You can identify parts of the database schema that do not leverage highly normalized design and de-normalize them into hybrid relational + semistructured tables, that are optimized for the workload.
In the example above, instead of dividing person related data in separate tables that will not be frequently updated, you can store these information in Person table as collections formatted as JSON arrays.
This is equivalent to the design approach used in NoSQL database where both primary logical entity all related information are placed together. Access operations in this design become simple:
SELECT * FROM Person WHERE BusinessEntityID = @id; INSERT INTO Person(FirstName, LastName,Emails,Phones) VALUES (@firstName, @lastName,@emails,@phones); UPDATE Person SET Emails = @emails WHERE BusinessEntityID = @id; DELETE Person WHERE BusinessEntityID = @id;
Instead of transactions and locking across several tables, you can have single atomic requests that updates entities at logical level.
With native JSON functions in SQL Server 2016, you can even perform fine-grained updates in the collections of entities:
UPDATE Person SET Emails = JSON_MODIFY(Emails, '$.EmailAddress', @emails) WHERE BusinessEntityID = @id;
In this case, we are updating third email address in the collection of emails using the parameter @email. Although updating JSON field with JSON_MODIFY function could not match performance of updating scalar columns in the separate table, this might be valid trade-off if most of the read/write queries run faster.
FOR JSON clause in SQL Server 2016+ enables you to easily de-normalize parts of your highly normalized schema using simple T-SQL query:
UPDATE Person SET Emails = (SELECT EmailAddress FROM EmailAddresses e WHERE e. BusinessEntityID = Person. BusinessEntityID FOR JSON PATH)
SQL Server and Azure SQL Database enable you to combine both relational and non-relational models in the same database. If you identify parts of the database schema that are suitable for NoSQL-like design models, you don’t need to export these parts into separate NoSQL database. You can leverage the same design concepts without going out of the database.
Source: Microsoft Blog – SQL Server Storage Engine – Simplify data access using de-normalized models