skip to Main Content
"No problem can be solved from the same level of consciousness that created it." ~ Albert EinsteinRelational & NoSQL - Data Modeling Principles

Relational Data Modeling is typically driven by the Structure of Available Data.
NoSQL Data Modeling is typically driven by Application-specific Access Patterns, i.e. the Queries to be Supported.

RDBMS Data Model

RDBMS Data Model

NoSQL (Key-Value) Data Model

NoSQL (Key-Value) Data Model

1. Relational Data Modeling – Conceptual, Logical & Physical Models

1. Conceptual Data Model

1. Conceptual Data Model

• Establishes the Entities, their Attributes & their Relationships

2. Logical Data Model

2. Logical Data Model

• Defines Data Elements Structure
• Sets the Relationships Between Them

3. Physical Data Model

3. Physical Data Model

• Describes DB-specific Implementation
• PK’s FK’s, Not Null & Other Constraints


Original Article – Data Modelling: Conceptual, Logical, Physical Data Model Types, Guru99

• Data modeling (data modelling) is the process of creating a data model for the data to be stored in a database. This data model is a conceptual representation of Data objects, the associations between different data objects, and the rules. Data modeling helps in the visual representation of data and enforces business rules, regulatory compliances, and government policies on the data. Data Models ensure consistency in naming conventions, default values, semantics, security while ensuring quality of the data.


• The Relational Data Model is defined as an abstract model that organizes data description, data semantics, and consistency constraints of data. The data model emphasizes on what data is needed and how it should be organized instead of what operations will be performed on data. Data Model is like an architect’s building plan, which helps to build conceptual models and set a relationship between data items.
The two types of Relational Data Modeling Techniques are
. 1. Entity Relationship (E-R) Model
. 2. UML (Unified Modelling Language)


• The primary goal of using relational data model are:
. • Ensures that all data objects required by the database are accurately represented. Omission of data will lead to creation of faulty reports and produce incorrect results.
. • A data model helps design the database at the conceptual, physical and logical levels.
. • Data Model structure helps to define the relational tables, primary and foreign keys and stored procedures.
. • It provides a clear picture of the base data and can be used by database developers to create a physical database.
. • It is also helpful to identify missing and redundant data.
. • Though the initial creation of data model is labor and time consuming, in the long run, it makes your IT infrastructure upgrade and maintenance cheaper and faster.


• Types of Data Models: There are mainly three different types of data models: conceptual data models, logical data models, and physical data models, and each one has a specific purpose. The data models are used to represent the data and how it is stored in the database and to set the relationship between data items.

. 1. Conceptual Data Model: This Data Model defines WHAT the system contains. This model is typically created by Business stakeholders and Data Architects. The purpose is to organize, scope and define business concepts and rules.

. 2. Logical Data Model: Defines HOW the system should be implemented regardless of the DBMS. This model is typically created by Data Architects and Business Analysts. The purpose is to developed technical map of rules and data structures.

. 3. Physical Data Model: This Data Model describes HOW the system will be implemented using a specific DBMS system. This model is typically created by DBA and developers. The purpose is actual implementation of the database.


• A Conceptual Data Model is an organized view of database concepts and their relationships. The purpose of creating a conceptual data model is to establish entities, their attributes, and relationships. In this data modeling level, there is hardly any detail available on the actual database structure. Business stakeholders and data architects typically create a conceptual data model.
• The 3 basic tenants of Conceptual Data Model are
. • Entity: A real-world thing
. • Attribute: Characteristics or properties of an entity
. • Relationship: Dependency or association between two entities

. • Customer and Product are two entities. Customer number and name are attributes of the Customer entity
. • Product name and price are attributes of product entity
. • Sale is the relationship between the customer and product

. • Offers Organisation-wide coverage of the business concepts.
. • This type of Data Models are designed and developed for a business audience.
. • The conceptual model is developed independently of hardware specifications like data storage capacity, location or software specifications like DBMS vendor and technology. The focus is to represent data as a user will see it in the “real world.”
• Conceptual data models known as Domain models create a common vocabulary for all stakeholders by establishing basic concepts and scope.


• The Logical Data Model is used to define the structure of data elements and to set relationships between them. The logical data model adds further information to the conceptual data model elements. The advantage of using a Logical data model is to provide a foundation to form the base for the Physical model. However, the modeling structure remains generic.
• At this Data Modeling level, no primary or secondary key is defined. At this Data modeling level, you need to verify and adjust the connector details that were set earlier for relationships.

. • Describes data needs for a single project but could integrate with other logical data models based on the scope of the project.
. • Designed and developed independently from the DBMS.
. • Data attributes will have datatypes with exact precisions and length.
. • Normalization processes to the model is applied typically till 3NF.

Relational Data Model - 3 Types


• A Physical Data Model describes a database-specific implementation of the data model. It offers database abstraction and helps generate the schema. This is because of the richness of meta-data offered by a Physical Data Model. The physical data model also helps in visualizing database structure by replicating database column keys, constraints, indexes, triggers, and other RDBMS features.

. • The physical data model describes data need for a single project or application though it maybe integrated with other physical data models based on project scope.
. • Data Model contains relationships between tables that which addresses cardinality and nullability of the relationships.
. • Developed for a specific version of a DBMS, location, data storage or technology to be used in the project.
. • Columns should have exact datatypes, lengths assigned and default values.
. • Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined.



. • The main goal of a designing data model is to make certain that data objects offered by the functional team are represented accurately.
. • The data model should be detailed enough to be used for building the physical database.
. • The information in the data model can be used for defining the relationship between tables, primary and foreign keys, and stored procedures.
. • Data Model helps business to communicate the within and across organizations.
. • Data model helps to documents data mappings in ETL process
. • Help to recognize correct sources of data to populate the model


. • To develop Data model one should know physical data stored characteristics.
. • This is a navigational system produces complex application development, management. Thus, it requires a knowledge of the biographical truth.
. • Even smaller change made in structure require modification in the entire application.
. • There is no set data manipulation language in DBMS.


. • Relational Data modeling is the process of developing data model for the data to be stored in a Database.
. • Relational Data Models ensure consistency in naming conventions, default values, semantics, security while ensuring quality of the data.
. • Relational Data Model structure helps to define the relational tables, primary and foreign keys and stored procedures.
. • There are three types of conceptual, logical, and physical.
. • The main aim of conceptual model is to establish the entities, their attributes, and their relationships.
. • Logical data model defines the structure of the data elements and set the relationships between them.
. • A Physical Data Model describes the database specific implementation of the data model.
. • The main goal of a designing data model is to make certain that data objects offered by the functional team are represented accurately.
. • The biggest drawback is that even smaller change made in structure require modification in the entire application.

2. NoSQL Data Modeling

Original Article – NoSQL Data Modeling Techniques by Ilya Katsov, 2012

• NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the CAP theorem apply well to NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

• To explore data modeling techniques, we have to start with a more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:

• First, we should note that SQL and relational model in general were designed long time ago to interact with the end user. This user-oriented nature had vast implications:
• The end user is often interested in aggregated reporting information, not in separate data items, and SQL pays a lot of attention to this aspect.
• No one can expect human users to explicitly control concurrency, integrity, consistency, or data type validity. That’s why SQL pays a lot of attention to transactional guaranties, schemas, and referential integrity.
• On the other hand, it turned out that software applications are not so often interested in in-database aggregation and able to control, at least in many cases, integrity and validity themselves. Besides this, elimination of these features had an extremely important influence on the performance and scalability of the stores. And this was where a new evolution of data models began:
. • Key-Value storage is a very simplistic, but very powerful model. Many techniques that are described below are perfectly applicable to this model.
. • One of the most significant shortcomings of the Key-Value model is a poor applicability to cases that require processing of key ranges. Ordered Key-Value model overcomes this limitation and significantly improves aggregation capabilities.
. • Ordered Key-Value model is very powerful, but it does not provide any framework for value modeling. In general, value modeling can be done by an application, but BigTable-style databases go further and model values as a map-of-maps-of-maps, namely, column families, columns, and timestamped versions.
. • Document databases advance the BigTable model offering two significant improvements. The first one is values with schemes of arbitrary complexity, not just a map-of-maps. The second one is database-managed indexes, at least in some implementations. Full Text Search Engines can be considered a related species in the sense that they also offer flexible schema and automatic indexes. The main difference is that Document database group indexes by field names, as opposed to Search Engines that group indexes by field values. It is also worth noting that some Key-Value stores like Oracle Coherence gradually move towards Document databases via addition of indexes and in-database entry processors.
• Finally, Graph data models can be considered as a side branch of evolution that origins from the Ordered Key-Value models. Graph databases allow one model business entities very transparently (this depends on that), but hierarchical modeling techniques make other data models very competitive in this area too. Graph databases are related to Document databases because many implementations allow one model a value as a map or document.


• The rest of this article describes concrete data modeling techniques and patterns. As a preface, I would like to provide a few general notes on NoSQL data modeling:
. • NoSQL data modeling often starts from the application-specific queries as opposed to relational modeling:
. • Relational modeling is typically driven by the structure of available data. The main design theme is “What answers do I have?”
. • NoSQL data modeling is typically driven by application-specific access patterns, i.e. the types of queries to be supported. The main design theme is “What questions do I have?”
. • NoSQL data modeling often requires a deeper understanding of data structures and algorithms than relational database modeling does. In this article I describe several well-known data structures that are not specific for NoSQL, but are very useful in practical NoSQL modeling.
. • Data duplication and denormalization are first-class citizens.
. • Relational databases are not very convenient for hierarchical or graph-like data modeling and processing. Graph databases are obviously a perfect solution for this area, but actually most of NoSQL solutions are surprisingly strong for such problems. That is why the current article devotes a separate section to hierarchical data modeling.

• Although data modeling techniques are basically implementation agnostic, this is a list of the particular systems that I had in mind while working on this article:
. • Key-Value Stores: Oracle Coherence, Redis, Kyoto Cabinet
. • BigTable-style Databases: Apache HBase, Apache Cassandra
. • Document Databases: MongoDB, CouchDB
. • Full Text Search Engines: Apache Lucene, Apache Solr
. • Graph Databases: neo4j, FlockDB

• This section is devoted to the basic principles of NoSQL data modeling.


• Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model. Most techniques described in this article leverage denormalization in one or another form.
• In general, denormalization is helpful for the following trade-offs:
. • Query data volume or IO per query VS total data volume. Using denormalization one can group all data that is needed to process a query in one place. This often means that for different query flows the same data will be accessed in different combinations. Hence we need to duplicate data, which increases total data volume.
. • Processing complexity VS total data volume. Modeling-time normalization and consequent query-time joins obviously increase complexity of the query processor, especially in distributed systems. Denormalization allow one to store data in a query-friendly structure to simplify query processing.
APPLICABILITY: Key-Value Stores, Document Databases, BigTable-style Databases


• All major genres of NoSQL provide soft schema capabilities in one way or another:
. • Key-Value Stores and Graph Databases typically do not place constraints on values, so values can be comprised of arbitrary format. It is also possible to vary a number of records for one business entity by using composite keys. For example, a user account can be modeled as a set of entries with composite keys like UserID_name, UserID_email, UserID_messages and so on. If a user has no email or messages then a corresponding entry is not recorded.
. • BigTable models support soft schema via a variable set of columns within a column family and a variable number of versions for one cell.
. • Document databases are inherently schema-less, although some of them allow one to validate incoming data using a user-defined schema.
• Soft schema allows one to form classes of entities with complex internal structures (nested entities) and to vary the structure of particular entities.This feature provides two major facilities:
. • Minimization of one-to-many relationships by means of nested entities and, consequently, reduction of joins.
. • Masking of “technical” differences between business entities and modeling of heterogeneous business entities using one collection of documents or one table.
• These facilities are illustrated in the figure below. This figure depicts modeling of a product entity for an eCommerce business domain. Initially, we can say that all products have an ID, Price, and Description. Next, we discover that different types of products have different attributes like Author for Book or Length for Jeans. Some of these attributes have a one-to-many or many-to-many nature like Tracks in Music Albums. Next, it is possible that some entities can not be modeled using fixed types at all. For example, Jeans attributes are not consistent across brands and specific for each manufacturer. It is possible to overcome all these issues in a relational normalized data model, but solutions are far from elegant. Soft schema allows one to use a single Aggregate (product) that can model all types of products and their attributes:
• Embedding with denormalization can greatly impact updates both in performance and consistency, so special attention should be paid to update flows.
APPLICABILITY: Key-Value Stores, Document Databases, BigTable-style Databases


• Joins are rarely supported in NoSQL solutions. As a consequence of the “question-oriented” NoSQL nature, joins are often handled at design time as opposed to relational models where joins are handled at query execution time. Query time joins almost always mean a performance penalty, but in many cases one can avoid joins using Denormalization and Aggregates, i.e. embedding nested entities. Of course, in many cases joins are inevitable and should be handled by an application. The major use cases are:
. • Many to many relationships are often modeled by links and require joins.
. • Aggregates are often inapplicable when entity internals are the subject of frequent modifications. It is usually better to keep a record that something happened and join the records at query time as opposed to changing a value . For example, a messaging system can be modeled as a User entity that contains nested Message entities. But if messages are often appended, it may be better to extract Messages as independent entities and join them to the User at query time:
APPLICABILITY: Key-Value Stores, Document Databases, BigTable-style Databases, Graph Databases

• In this section we discuss general modeling techniques that applicable to a variety of NoSQL implementations.


• Many, although not all, NoSQL solutions have limited transaction support. In some cases one can achieve transactional behavior using distributed locks or application-managed MVCC, but it is common to model data using an Aggregates technique to guarantee some of the ACID properties.
• One of the reasons why powerful transactional machinery is an inevitable part of the relational databases is that normalized data typically require multi-place updates. On the other hand, Aggregates allow one to store a single business entity as one document, row or key-value pair and update it atomically:
• Of course, Atomic Aggregates as a data modeling technique is not a complete transactional solution, but if the store provides certain guaranties of atomicity, locks, or test-and-set instructions then Atomic Aggregates can be applicable.
APPLICABILITY: Key-Value Stores, Document Databases, BigTable-style Databases


• Perhaps the greatest benefit of an unordered Key-Value data model is that entries can be partitioned across multiple servers by just hashing the key. Sorting makes things more complex, but sometimes an application is able to take some advantages of ordered keys even if storage doesn’t offer such a feature. Let’s consider the modeling of email messages as an example:
. 1. Some NoSQL stores provide atomic counters that allow one to generate sequential IDs. In this case one can store messages using userID_messageID as a composite key. If the latest message ID is known, it is possible to traverse previous messages. It is also possible to traverse preceding and succeeding messages for any given message ID.
. 2. Messages can be grouped into buckets, for example, daily buckets. This allows one to traverse a mail box backward or forward starting from any specified date or the current date.


• Dimensionality Reduction is a technique that allows one to map multidimensional data to a Key-Value model or to other non-multidimensional models.
• Traditional geographic information systems use some variation of a Quadtree or R-Tree for indexes. These structures need to be updated in-place and are expensive to manipulate when data volumes are large. An alternative approach is to traverse the 2D structure and flatten it into a plain list of entries. One well known example of this technique is a Geohash. A Geohash uses a Z-like scan to fill 2D space and each move is encoded as 0 or 1 depending on direction. Bits for longitude and latitude moves are interleaved as well as moves. The encoding process is illustrated in the figure below, where black and red bits stand for longitude and latitude, respectively:
• An important feature of a Geohash is its ability to estimate distance between regions using bit-wise code proximity, as is shown in the figure. Geohash encoding allows one to store geographical information using plain data models, like sorted key values preserving spatial relationships. The Dimensionality Reduction technique for BigTable was described in [6.1]. More information about Geohashes and other related techniques can be found in [6.2] and [6.3].
APPLICABILITY: Key-Value Stores, Document Databases, BigTable-style Databases


• Index Table is a very straightforward technique that allows one to take advantage of indexes in stores that do not support indexes internally. The most important class of such stores is the BigTable-style database. The idea is to create and maintain a special table with keys that follow the access pattern. For example, there is a master table that stores user accounts that can be accessed by user ID. A query that retrieves all users by a specified city can be supported by means of an additional table where city is a key:
• An Index table can be updated for each update of the master table or in batch mode. Either way, it results in an additional performance penalty and become a consistency issue.
• Index Table can be considered as an analog of materialized views in relational databases.
APPLICABILITY: BigTable-style Databases


• Composite key is a very generic technique, but it is extremely beneficial when a store with ordered keys is used. Composite keys in conjunction with secondary sorting allows one to build a kind of multidimensional index which is fundamentally similar to the previously described Dimensionality Reduction technique. For example, let’s take a set of records where each record is a user statistic. If we are going to aggregate these statistics by a region the user came from, we can use keys in a format (State:City:UserID) that allow us to iterate over records for a particular state or city if that store supports the selection of key ranges by a partial key match (as BigTable-style systems do):
. 1 SELECT Values WHERE state=”CA:*”
. 2 SELECT Values WHERE city=”CA:San Francisco*”
APPLICABILITY: BigTable-style Databases


• Composite keys may be used not only for indexing, but for different types of grouping. Let’s consider an example. There is a huge array of log records with information about internet users and their visits from different sites (click stream). The goal is to count the number of unique users for each site. This is similar to the following SQL query:
. 1 SELECT count(distinct(user_id)) FROM clicks GROUP BY site
• We can model this situation using composite keys with a UserID prefix:
• Counting Unique Users using Composite Keys
. • The idea is to keep all records for one user collocated, so it is possible to fetch such a frame into memory (one user can not produce too many events) and to eliminate site duplicates using hash table or whatever. An alternative technique is to have one entry for one user and append sites to this entry as events arrive. Nevertheless, entry modification is generally less efficient than entry insertion in the majority of implementations.
APPLICABILITY: Ordered Key-Value Stores, BigTable-style Databases


• This technique is more a data processing pattern, rather than data modeling. Nevertheless, data models are also impacted by usage of this pattern. The main idea of this technique is to use an index to find data that meets a criteria, but aggregate data using original representation or full scans. Let’s consider an example. There are a number of log records with information about internet users and their visits from different sites (click stream). Let assume that each record contains user ID, categories this user belongs to (Men, Women, Bloggers, etc), city this user came from, and visited site. The goal is to describe the audience that meet some criteria (site, city, etc) in terms of unique users for each category that occurs in this audience (i.e. in the set of users that meet the criteria).
• It is quite clear that a search of users that meet the criteria can be efficiently done using inverted indexes like {Category -> [user IDs]} or {Site -> [user IDs]}. Using such indexes, one can intersect or unify corresponding user IDs (this can be done very efficiently if user IDs are stored as sorted lists or bit sets) and obtain an audience. But describing an audience which is similar to an aggregation query like
. 1 SELECT count(distinct(user_id)) … GROUP BY category
• cannot be handled efficiently using an inverted index if the number of categories is big. To cope with this, one can build a direct index of the form {UserID -> [Categories]} and iterate over it in order to build a final report. This schema is depicted below:
• Counting Unique Users using Inverse and Direct Indexes
. • And as a final note, we should take into account that random retrieval of records for each user ID in the audience can be inefficient. One can grapple with this problem by leveraging batch query processing. This means that some number of user sets can be precomputed (for different criteria) and then all reports for this batch of audiences can be computed in one full scan of direct or inverse index.
APPLICABILITY: Key-Value Stores, BigTable-style Databases, Document Databases



. • Trees or even arbitrary graphs (with the aid of denormalization) can be modeled as a single record or document.
. • This techniques is efficient when the tree is accessed at once (for example, an entire tree of blog comments is fetched to show a page with a post).
. • Search and arbitrary access to the entries may be problematic.
. • Updates are inefficient in most NoSQL implementations (as compared to independent nodes).
APPLICABILITY: Key-Value Stores, Document Databases


• Adjacency Lists are a straightforward way of graph modeling – each node is modeled as an independent record that contains arrays of direct ancestors or descendants. It allows one to search for nodes by identifiers of their parents or children and, of course, to traverse a graph by doing one hop per query. This approach is usually inefficient for getting an entire subtree for a given node, for deep or wide traversals.
APPLICABILITY: Key-Value Stores, Document Databases


• Materialized Paths is a technique that helps to avoid recursive traversals of tree-like structures. This technique can be considered as a kind of denormalization. The idea is to attribute each node by identifiers of all its parents or children, so that it is possible to determine all descendants or predecessors of the node without traversal:
• This technique is especially helpful for Full Text Search Engines because it allows one to convert hierarchical structures into flat documents. One can see in the figure above that all products or subcategories within the Men’s Shoes category can be retrieved using a short query which is simply a category name.
• Materialized Paths can be stored as a set of IDs or as a single string of concatenated IDs. The latter option allows one to search for nodes that meet a certain partial path criteria using regular expressions. This option is illustrated in the figure below (path includes node itself):
. • Query Materialized Paths using RegExp
APPLICABILITY: Key-Value Stores, Document Databases, Search Engines


• Nested sets is a standard technique for modeling tree-like structures. It is widely used in relational databases, but it is perfectly applicable to Key-Value Stores and Document Databases. The idea is to store the leafs of the tree in an array and to map each non-leaf node to a range of leafs using start and end indexes, as is shown in the figure below:
• This structure is pretty efficient for immutable data because it has a small memory footprint and allows one to fetch all leafs for a given node without traversals. Nevertheless, inserts and updates are quite costly because the addition of one leaf causes an extensive update of indexes.
APPLICABILITY: Key-Value Stores, Document Databases


• Search Engines typically work with flat documents, i.e. each document is a flat list of fields and values. The goal of data modeling is to map business entities to plain documents and this can be challenging if the entities have a complex internal structure. One typical challenge mapping documents with a hierarchical structure, i.e. documents with nested documents inside. Let’s consider the following example:
• Nested Documents Problem
• Each business entity is some kind of resume. It contains a person’s name and a list of his or her skills with a skill level. An obvious way to model such an entity is to create a plain document with Skill and Level fields. This model allows one to search for a person by skill or by level, but queries that combine both fields are liable to result in false matches, as depicted in the figure above.
• One way to overcome this issue was suggested in [4.6]. The main idea of this technique is to index each skill and corresponding level as a dedicated pair of fields Skill_i and Level_i, and to search for all these pairs simultaneously (where the number of OR-ed terms in a query is as high as the maximum number of skills for one person):
• This approach is not really scalable because query complexity grows rapidly as a function of the number of nested structures.


• The problem with nested documents can be solved using another technique that were also described in [4.6]. The idea is to use proximity queries that limit the acceptable distance between words in the document. In the figure below, all skills and levels are indexed in one field, namely, SkillAndLevel, and the query indicates that the words “Excellent” and “Poetry” should follow one another:


• Graph databases like neo4j are exceptionally good for exploring the neighborhood of a given node or exploring relationships between two or a few nodes. Nevertheless, global processing of large graphs is not very efficient because general purpose graph databases do not scale well. Distributed graph processing can be done using MapReduce and the Message Passing pattern that was described, for example, in one of my previous articles. This approach makes Key-Value stores, Document databases, and BigTable-style databases suitable for processing large graphs.
APPLICABILITY: Key-Value Stores, Document Databases, BigTable-style Databases

Back To Top