Backend Development
Characterizing your system’s behavior using design of experiments
Have you ever run into a situation where you need to predict or control an output, but don’t know where to set your system’s knobs appropriately? And you need to know how to set them before launching?
We have. On the Text Analytics team, we are making a number of changes to our text processing backend to improve scalability, latency, and facilitate new machine learning processes. Towards these ends, we evaluated Crate, a shared-nothing, distributed SQL database built on elements of Lucene and ElasticSearch, as a replacement data store for post-processed versions of text.
One of the criteria for adopting a new database technology is understanding its storage requirements. If we switched to Crate, we would be adopting a new indexing mechanism and schema. Further, Crate exposes a number of parameters with which we had no experience. We needed a new model to understand these behaviors before moving on to production deployment and migration. Because we had a known system output to predict (storage size), a set of system inputs with unknown interactions, and the need to predict the value ahead of time, we decided to use a Design of Experiments (DoE) approach.
In this article, we will introduce the DoE methodology, explain how we used it to predict our storage requirements for the new Crate database backend, and describe some cases where this process could generate a bad model.
Design of Experiments
The DoE methodology dates back at least as far as James Lind and his work on preventing scurvy at sea in the 18th century. However, formal understanding of the technique did not develop until the 20th century with advances in statistical process control and clinical trials. Design of experiments is usually taught in an advanced statistics courses or via programs like Six Sigma. For a deeper introduction, Pennsylvania State University offers an online course that goes into far more detail than this post.
The top-level process is:
- Identify factors and your system output
- Design and run the experiments
- Fit models and simplify
- Validate the model
- Monitor the model
For our application (Figure 1), we have one independent input, c, the raw text size in bytes, and three parameters:
- i - the number of full-text indices
- p - the number of partitions of the database table
- h - the number of shards for each partition
We want to predict the disk storage requirements (S).
Figure 1: System Model
In Design of Experiments terminology, we have four factors (k=4) that we believe may influence S. All system inputs are considered factors.
The standard and straight-forward way to test four factors is to use a two-level, full factorial design. In this approach, you assign a “low” and “high” level to each factor. If a factor is a boolean parameter, then high and low map to true and false. If a factor is numerical or ordinal, then choose two distinct values that are within your range of possible values. If possible, select values that are order of magnitude apart in order to drive the factor’s impact. However, this is not always possible or pragmatic. Finally, if the factor is categorical then you may be forced to run an experiment for each category.
Within a two-level, full factorial design, you will run 2^{k} experiments, one for each combination of the k factors. In a full factorial design with 3 factors, Table 1 enumerates the eight experiments that will need to be run. If the number of experiments gets too large relative to your budget, you can use a fractional factorial design, but at the cost of a more complex process.
Table 1: High/Low Design with 3 Factors
a | b | c |
- | - | - |
- | - | + |
- | + | - |
- | + | + |
+ | - | - |
+ | - | + |
+ | + | - |
+ | + | + |
At this point, we could have proceeded with an experiment using all four factors. However, we were suspicious about the influence of h, the number of shards. We didn’t think it would significantly impact S based on the Crate design documentation, and reducing the number of experiments from sixteen to eight would save us a lot of time, so we decided to first test h in isolation.
Testing h, Number of Shards per Partition
Controlling for the other factors, we compared disk space with five and ten shards (Table 2). Five and ten are relatively close in value, but are representative of the realistic bounds for this parameter. As you can see in the table below, a 2x change in h had almost no impact on S (less than 2% difference), so we decided to drop h from the experiments going forward. Normally, dropping a factor prematurely can potentially lead to a bad model if factors unexpectedly interact with each other, but both the documentation on how sharding was implemented, as well as a sanity experiment, indicated that it was safe to drop h from the design.
Table 2: h Experimental Results
5 Shards | 10 Shards | |
S (bytes) | 47 979 080 | 48 887 152 |
Testing c, i, and p --- a 2^{3} Design
Using some sample datasets, we tested c, raw text size, at two levels --- the first level at ten megabytes and the second level at one hundred megabytes. We varied i, the number of full-text indices, between two and five indices. We chose these two values based on our schema design. For our application, we index the raw text in multiple ways to implement various search options. For example, we have a separate index to support exact matches and another to support matches on spell-corrected text. Based on our feature set, we expected at least two indices and at most five. Although, two and five are close to each other, they covered the range of realistic values. For p, we chose five and five hundred partitions. These two values are two orders of magnitude apart from each other, but the range of possible values in this case was particularly broad because we had not decided on our partitioning strategy and were considering options that were in that range.
Then, we created tables using the parameters as shown and measured the resulting disk requirements as shown below:
Table 3: Full Experimental Results
Raw Data Size (c) bytes | # Indices (i) | # Partitions (p) | Actual Storage Required (S) bytes |
9,673,389 | 2 | 5 | 17,297,548 |
9,673,389 | 2 | 500 | 31,234,977 |
9,673,389 | 5 | 5 | 31,787,432 |
9,673,389 | 5 | 500 | 63,861,567 |
98,046,526 | 2 | 5 | 156,334,250 |
98,046,526 | 2 | 500 | 227,733,331 |
98,046,526 | 5 | 5 | 274,408,607 |
98,046,526 | 5 | 500 | 449,638,647 |
Model Fitting
Another name for model fitting is regression. Regression is a process for determining an equation for how the various variables relate to each other. In our case, we want an equation that computes S, the storage requirements, given the system inputs. There are various techniques for performing a regression, depending on the nature of the data. For this example, we used a simple linear model. In a 2^{k} design, a linear model is the default choice. Most systems we, as humans, interact with are linear in behavior, so you often start with this design until you find evidence to refute its assumptions.
As a simplification step, we noted that the parameters within the table vary widely in magnitude --- c and S will be in the millions while i is limited to single digits. For the model fitting, we converted c and S into megabytes which makes reasoning about the equations and their coefficients easier.
Fitting the data and each of the column interactions using a linear model (see the section below for the specific steps) , the resulting function (after simplifying the coefficients) is:
S=0.8c+i+0.4ci+0.004p+0.006ip+0.0007cip
While this equation fits the known data extremely well, it is overfit and includes some terms that are not significant. Once you have an equation, examine each term and try removing it if the influence is small. For example, i values will be in the single digits, so an i by itself is an excellent candidate to drop. The p value, although it may be in the hundreds, is being multiplied by 0.004, so we can throw that out too. After these simplifications, we fit another linear model:
S=0.8c+0.3ci+0.7ip+0.07cip
Where:
- S is the storage required, in megabytes (1 million bytes)
- c is the size of the input text, in megabytes
- i is the number of full-text indices
- p is the number of partitions
Model Fitting using Mathematica™
Mathematica is a commercial computer algebra system created by Wolfram Research. There are many tools that can aid you in DoE and model fitting (the commercial tool Minitab™ and the open-source Python library scikit-learn are both popular choices), but we use Mathematica because of its broad built-in functionality and symbolic manipulation capabilities.
To create a linear model, we first loaded the experimental data as a 8 x 4 matrix:
data = { {9673389, 2, 5, 17297548}, {9673389, 2, 500, 31234977}, {9673389, 5, 5, 31787432}, {9673389, 5, 500, 63861567}, {98046526, 2, 5, 156334250}, {98046526, 2, 500, 227733331}, {98046526, 5, 5, 274408607}, {98046526, 5, 500, 449638647} };
Secondly, to normalize the variables, we divided the bytes values by a million.
dataN = {#[[1]]/1000000, #[[2]], #[[3]], #[[4]]/1000000} & /@ data;
The above function, for each row of the matrix data, creates a new row with the first and fourth values divided by a million. Then, we called LinearModelFit to create a model. LinearModelFit fits a linear model of the form y=Β_{0}+Β_{1}f_{1}+Β_{2}f_{2}+... under the assumption the original data is independent and normally distributed with means y_{i}. The first argument is the data matrix, the second argument is a list of forms for the variables to take, and the last argument is a list of variable names.
LM = LinearModelFit[dataN, {c, i, p, c i, c p, i p, c i p}, {c, i, p}]; LM[“Function”] -0.0407618 + 0.791836 #1 + 1.021 #2 + 0.387444 #1 #2 + 0.00365643 #3 + 7.59292*10^-6 #1 #3 + 0.00589667 #2 #3 + 0.000652988 #1 #2 #3 &
The output shows the fitted function where #1 stands for c, #2 for i, and #3 for p. The trailing ampersand indicates that this is a function.
For the simplified version, we drop unused forms from the second argument, like so:
LM2 = LinearModelFit[dataN, {c, c i, i p, c i p}, {c, i, p}]; LM2[“Function”] -0.826988 + 0.793753 #1 + 0.0718978 #2 + 0.276974 #1 #2 + 0.43698 #3 + 0.62321 #2 #3 + 0.0703853 #1 #2 #3 &
Validation and Monitoring
To validate the simplified equation, we ran it against the data collected so far. Testing a fitted model only on the training data is a bad idea; it encourages overfitting and can lead you to miss unexpected factors. In the interest of space, we’ve added only one new experiment testing almost 300 megabytes of text, three indices, and a single partition. Looking over the results, the simplified model is a good predictor, predicting the actual value within +/- 5%.
Raw Data Size (bytes) | # Indices | # Partitions | Actual Storage Required (mb) | Predicted Storage (mb) | % Δ |
9,673,389 | 2 | 5 | 17.2 | 17.6 | +2.3 |
9,673,389 | 2 | 500 | 31.2 | 30.3 | -2.9 |
9,673,389 | 5 | 5 | 31.7 | 32.2 | +1.6 |
9,673,389 | 5 | 500 | 63.9 | 64.1 | +0.3 |
98,046,526 | 2 | 5 | 156.3 | 156.9 | +0.4 |
98,046,526 | 2 | 500 | 227.7 | 227.2 | -0.2 |
98,046,526 | 5 | 5 | 274.4 | 274.1 | -0.1 |
98,046,526 | 5 | 500 | 449.6 | 449.9 | +0.1 |
292,691,123 | 3 | 1 | 589.1 | 559.8 | -4.9 |
Conclusion
DoE and regression are a useful pair of techniques for predicting and controlling a system’s output. They have the advantages of being rigorous, are supported by various software tools, and produce models that can be embedded into spreadsheets and dashboards. Furthermore, they don’t require anything more than a black box knowledge of the system and, by following the process, you will document what you know about the system --- a useful side effect.
Be aware, though, that DoE can create a misleading model. One way it can mislead you is if the system has hidden factors. A hidden factor is a factor that is either unknown to you or uncontrollable within the experiment but influences the system’s output. For example, the above model does not include the file system block size as a factor, but block size and the filesystem implementation impact how much space is actually consumed on a disk and filesystems could differ between the test and production systems.
Another way you can be misled is if a design’s assumptions on the nature of the data are violated. For example, a 2^{k} design only needs to test factors at two levels because it assumes that the system reacts linearly and thus a line can be plotted from just two points. If the system is nonlinear, the process (without extra validation steps) will not tell you that the model is invalid. Similarly, regression techniques make different assumptions about the distribution of the data and of errors within the data and differ in their robustness to violations of those assumptions.
Finally, DoE does require an up-front investment in terms of designing and conducting your experiments. If an experiment is nondeterministic, then it will need to be run a sufficient number of times for you to be confident of the mean value. Paradoxically, the value of using DoE increases as the cost of experimentation grows. Alternative designs like fractional factorial, can provide robust models and eliminate the need to run a large number of experiments.
In our case study, we were fortunate to have a deterministic system, linear behavior for the individual factors, and a relatively small number of factors. Our time investment, across both the experiments and analysis, was a couple of days. Looking forward, because models shift over time, we will monitor the performance of our model against the deployed systems. Depending on how the predictions and reality diverge, that will suggest whether we need to recalibrate the existing model or diagnose a hidden factor.
The author has used DoE several times in his career and it has always proved useful in modeling a system and predicting its future behavior.