Skip to main content

Backend Development

Data Modeling How-to

I agree with Graeme Simsion's answer to the question "Why should we do data modeling?", which is that "Data modeling is not optional." (from his excellent book Data Modeling Essentials, 3rd edition)  You wouldn't build a house without first drawing a blueprint for it, would you?

The first edition of the book was published 10 years prior to publication of the third edition in 2007, and Simsion lamented then "[after ten years between the editions] the biggest issue in data modeling remains the basic one of recognizing it as a fundamental activity."  As of this writing in 2017, I would claim that the same issue sadly remains -- you can easily work as a software engineer for ten years in the industry without having to learn how to do data modeling!

At Qualtrics, we have made data modeling a required step in our enterprise integration projects, and is a highly recommended step in everyday feature development within our product. In this blog post, I would like to explain how to do basic data modeling, and hopefully make a compelling case for doing it.

Is Data Modeling a Part of Requirements Gathering?

The fundamental question of whether data modeling is required at all seems to reflect the difficulty in answering a question as basic as "where in the software development life cycle should data modeling occur?" There are two schools of thought on this.

At one extreme, the practitioners will make data modeling a part of requirements gathering process. In this view, one learns the data relationships at the same time that one gathers the business objectives and requirements. After all, the same set of stakeholders could provide inputs into both of these areas.

The other view is to regard business requirements gathering and data modeling as two distinct activities.  I personally subscribe to the latter view, and here is why:

  1. Data modeling is not only a fact-finding activity, but also a design activity.  Mixing the two during the requirements phase confuses the goal of the activity, and may introduce the designer's bias into how business requirements are expressed.
  2. If you capture the data model as part of business requirement, you are locking yourself into a framework designed by the business owners, expressed as requirements; the business folk may best understand their business needs, but may not have the nuanced understanding of software design that software engineers possess.
  3. Trying to design the data model as early as in the requirements phase may result in being locked into a premature design, as described in argument #2 above.  Once you have gone through full requirements gathering, and let the requirements settle in for a bit, you are more likely to come up with an elegant way to organize the data.

The following diagram illustrates the design steps we use for the enterprise integration projects at Qualtrics.

As you can see, we also clearly separate the process of architectural design from data modeling; that is, we don't get into architecture design until we have clear understanding of what data and data relationships need to be represented in the application.  Too often I observe software engineers deciding on a specific technology (e.g. choosing a relational database like MySQL or PostgreSQL, or NoSQL database like MongoDB or Cassandra), and attempting to force-fit the application and data into the technology.  This is, essentially, putting the cart before the horse.

A Simple Approach to Data Modeling

Let me digress a little bit and discuss my background, since I by no means consider myself a data modeling expert.  In fact, my experience has been mostly around operating systems, distributed application and storage systems.  However, I have been in the industry for 23 years and, I can say from experience, that data modeling is essential when designing any application or system.

Traditionally, the practice of data modeling is associated with designing relational database schema.  In this day and age, relational databases are not the only game in town; we now have NoSQL stores, graph databases, object stores, and search engines, yet data modeling is still fundamental.

A traditional approach to data modeling consists of 3 phases:

  • Conceptual Data Modeling designs the data model independent of technology decisions.
  • Logical Data Modeling translates the conceptual model to logical representation of the data specific to a broad architecture choice (traditionally RDBMS).  We are talking tables, columns, primary and foreign keys.
  • Physical Data Modeling further refines the logical data model to produce the actual schema to be used for a particular system of choice, such as MySQL, Oracle, or SQL Server.

Of these phases, I feel that the conceptual data modeling is the most fundamental.  The logical and physical data modeling are still useful, but those phases involve well-defined techniques to translate the conceptual data model to efficient database schemas.

In the following sections, I would like to present a basic approach to conceptual data modeling.

Define User Scenarios

First, we need clear understanding of the business requirements.  We use user scenarios in such a way that they represent the actual business scenarios described by the business owner -- rather than imagined, generalized or abstracted versions written by an engineer.

Here is an example user scenario:

The analyst can collect 4,800 responses to the satisfaction survey in Washington state, and 3,200 responses in Oregon state, per month, while preventing the same customer from receiving more than 1 survey invitation within 90 days.  Survey invitations should be sent out at 1PM local time, and the number of invitations should be adjusted based on historical response rate calculated over the last 30 days, in order to smooth out the number of incoming responses per day. The survey invitations expire after 7 calendar days.

For the purpose of agile project management, the above scenario should be broken down to appropriate user stories, but most people know how to do that nowadays. The emphasis here is that we only work with the scenarios coming straight from the business owner.

Define Entities

Next, we identify the entities involved in the user story. For the above example, the apparent set of entities include:

Entity Description
Survey A set of questions.
Invitation Invitation to take a given survey. Expires after N days.
Response A set of answers to a survey.
Customer Person taking the survey.
Analyst Person defining the survey, and running the program.

Notice that each entity is expressed in the singular form so that cardinality can be defined by the relationships. However, as software designers, we may also identify the following additional entities from the story.

Entity Description
Monthly Quota The number of responses desired per month per state.
Contact Frequency Rule Constrains how frequently the system sends survey invitations to a customer.
Invitation Batch Contains a set of invitations to process at 1PM on certain date.

This second set of entities above represent certain design decisions with regard to data we want to track.  The business requirements may not have called out for "Monthly Quota" or "Invitation Batch" as primary data entities, but as a software engineer, we might choose to create them. We will examine the implications of these entities in our data model shortly.

By the way, one pitfall to avoid in this step is to equate "conceptual entities" to relational database tables.  We are not there yet!  Just focus on what conceptual entities make sense from an application design perspective without worrying about tables and columns at this stage.

Define Relationships

The next step in data modeling is to determine the complete set of relationships that exist among the entities.  I recommend using an Entity-Relationship (ER) diagram to express the relationships, as opposed to the UML equivalent.  UML is great for developers, but when you are conducting conceptual data modeling, you are still interfacing a lot with business owners, and it takes much less time to explain an ER diagram than to explain a UML diagrams to business folks.

The ER diagram provides a concise way to annotate the relationships among entities, along with optionality and cardinality involved.  Here are some optionality and cardinality symbols:

Using these symbols, the example scenario we are working with can be expressed as (by the way, the diagram was created using Lucidchart™, which seems to be a great tool for drawing ER diagrams):

So, why did we add Monthly Quota and Invitation Batch at all?  Here are a few reasons:

  1. The requirement of "Survey invitations should be sent out at 1PM local time" indicates that potentially thousands of invitations are scheduled to be sent at certain time of the day.  The Invitation Batch provides a convenient location to set the distribution schedule for those invitations.  Alternatively, we could schedule individual invitations at the same time, but that would make our job scheduler very busy.
  2. The requirement of "collect 4,800 responses ... in Washington state per month" implies that there is a concept of "quotas" set for different market segments per month.  These are probably configurable goals, and we should have a place to store them.
  3. Monthly Quota entity is where we can define the identifier for each "market segment".  If tracking market segments is a goal of the business requirements, we'd better add it to the model.

But, wait a minute.  Monthly Quota feels more like an "attribute" of a market segment than the other way around.  If we had a different entity called "Market Segment," we could store the numeric monthly quota there!  We'd better rename Monthly Quota to Market Segment to address the requirement more directly. This is a good example of the insights that occur when one follows a formal data modeling process.

Let's discuss the red dotted lines in the diagram.  Should we add the line [1]?  Although a Monthly Quota and Invitation Batch always belong to exactly one Survey entity, without line [1], we will not know which Invitation Batch is related to which Monthly Quota.  Adding line [1] seems to be a good idea for 2 reasons:

  1. Survey response rate can be calculated for each invitation batch. (recall the business requirement about calculating the historical response rate)  Now, market segments may have varying response rates, and if we were to predict the accurate responses for each segment, we really need to track which invitation batch belonged to which market segment.
    1. By the way, tracking historical response rates per segment is not trivial.  You need to know which invitation the user has responded to, and update the response rate of the corresponding batch.
    2. Also, you can only calculate the rate after the survey expires (in this case after 7 days), because the system will continue to collect responses for that configured time period, and we do not want to prematurely calculate the rate, before all responses are received.
  2. We can experiment with different distribution schedule per market segment, and pick the best schedule that works for each.  To do that, you need the relationship recorded.

What about the red dotted line [2]?  Now, that relationship seems redundant and even misleading, since:

  1. The customer can only create a response after receiving an invitation.
  2. It would not add any new information; you can already derive the one-to-many relationship from the customer to responses, through the Invitation entity's relationships to its neighbors.

Based on the design considerations above, the next iteration of the data model should look as follows (see the renaming of Monthly Quota, and a new line between Market Segment and Invitation Batch):

To recap, this new data model allows tracking of response rate variations in market segments, and allows experimenting with different distribution schedules for each market segment.

A technique for verifying whether the data model makes sense is to turn each relationship into an assertion and verify that they appear correct. From our example, you can generate the following assertions:

  • An individual Survey has zero or more Invitation Batches.
  • An Invitation Batch belongs to exactly one Survey.
  • A Customer may receive zero or more Invitations.
  • An Invitation belongs to exactly one Invitation batch.

The last assertion here says that you cannot invite a customer to a survey, unless you also create an invitation batch.  Is this what we want?  The answer really depends on the business requirements, and so far, the requirements don't seem to state that we need to schedule individual invitations.  Arguably, this constraint may even make our application more robust, because every invitation is accounted for whenever we calculate the historical response rate per market segment using the Invitation Batch entity!

Define Attributes

The last step is to fill in all the attributes of each entity. We won't go through this exercise in this blog, but as you can imagine, you might add first name, last name, gender, and email attributes to your customer entity, in order to address the customer in a friendly way in survey invitations (e.g. "Mr. Tim Allen").

As you go through adding the attributes, you naturally delve deeper into the meaning of each entity, and you might revise the data model yet again to either split an existing entity (for example, an invitation might go out via SMS or email) or combine entities.  This is a natural course of the design, and you can keep iterating until you have an elegant result.

Summary

In this article, we have looked at a really bare-bones example usage of data modeling techniques.  This is probably 1% of what is covered in the book that I introduced at the beginning.  There are some excellent techniques that you can learn about from various literature, such as normal forms, supertypes and subtypes, naming techniques, how to choose primary keys, etc.

Data modeling is one of the most creative aspects of software development, and it is an essential step in any complex application design.  In this article, I argued that:

  • The business requirements should be expressed as the real user stories written by the business owner, rather than user stories imagined, generalized or abstracted by a software engineer.
  • Data modeling is a design exercise that happens after business requirements are gathered.
  • Conceptual data modeling should be independent of technology choices.
  • Investment in data modeling up front will make your applications more elegant and more robust.

I hope you have found this article interesting, and engage in earnest data modeling during your next software development project!

Related Articles