Regression
From QualtricsWiki
Contents |
[edit] STEPWISE MULTIPLE LINEAR REGRESSION ANALYSIS
[edit] REQUIREMENTS
Regression is used to test the effects of n independent (predictor) variables on a single dependent (criterion) variable. Regression tests the deviation about the means, and all variables must be interval scaled. Computationally, regression analysis may be conducted using either a raw data matrix (respondents by variables) or a correlation matrix.
Regression analysis measures the degree of influence of the independent variables on a dependent variable. In the case of a single independent variable, the dependent variable could be predicted from the independent variable by the simple equation:
y = a + bx {where a is constant}
This could be extended to a multi-variable concept as follows:
y = a + b 1 x 1 + b 2 x 2 + b 3 x 3 + ..... +b n x n
It should be noted that whether it be for a single variable or for multiple variables, the relationship predicted is always linear.
[edit] A Graphical Explanation of Bi-Variate (2 Variable) Regression Analysis
A simple approach to approximate a regression equation for a single variable is to plot the relationship between the variables. The task requires that we first plot the dependent variable against the independent variable. This type of plotting is called the scatter diagram.
Next, identify the straight line that represents the trend through the mid-point of the data, this line must be the one with the `best fit'. The regression analysis line identifies the trend or relationship between the independent and dependent variables. The relationship, once identified, is used to predict the various values of the dependent variable given specific values of the independent variable. This predicted relationship is always in the form of a linear trend.
The table below identifies a set of values for an independent (X) and dependent (Y) variable.
| X | 39 | 43 | 21 | 64 | 57 | 47 | 28 | 75 | 34 | 52 |
| Y | 68 | 82 | 56 | 86 | 97 | 94 | 77 | 103 | 59 | 79 |
The scatter plot of the variables is given below:
Regression analysis is utilized to develop an accurate mathematical formulation of the regression analysis. The line of best fit is defined as a line for which minimizes the sum of squares of deviation of the various data points from the line. The regression line is also referred to as the least squares line.
In case of a multi-variable regression, the analysis is a sequence of multiple linear regression equations that are developed in a stepwise manner. At each step of the sequence, one variable is added to the regression equation.
The variable added is the one that makes the greatest reduction in the error sum of squares of the sample data. Equivalently it is the variable that when added, provides the greatest increase in the F value. Variables not having a significant correlation with the dependent variable, are those whose addition does not increase the F value and are not featured in the regression equation.
[edit] Mathematical Computation of the Regression Coefficients
I. With one independent Variable: The Mathematical Computation of the Regression Coefficients for the case of a single independent variable is given below:
The slope (regression coefficient) for the line of least squares is given by b, where
The intercept of the line is given by a, where a = y - bx.
The mathematical formula used for this computation is as follows:
The Residual: The residual is defined as the difference between the actual and predicted values of the dependent variable. The standard error of the estimate is the standard deviation of the residuals. The standard error of the estimate can be calculated as follows:
[edit] A Numerical Example: One dependent variable
Let us use the data which produced the above graphical representation of a regression analysis.
| SL.No | y | x | xy | y2 | x2 |
| 1 | 68 | 39 | 2652 | 4624 | 1521 |
| 2 | 82 | 43 | 3526 | 6724 | 1849 |
| 3 | 56 | 21 | 1176 | 3136 | 441 |
| 4 | 86 | 64 | 5504 | 7396 | 4096 |
| 5 | 97 | 57 | 5529 | 9409 | 3249 |
| 6 | 94 | 47 | 4418 | 8836 | 2209 |
| 7 | 77 | 28 | 2156 | 5929 | 784 |
| 8 | 103 | 75 | 7725 | 10609 | 5625 |
| 9 | 59 | 34 | 2006 | 3481 | 1156 |
| 10 | 79 | 52 | 4108 | 6241 | 2704 |
| SUM | 801 | 460 | 38800 | 66385 | 23634 |
| AVG. | 80.1 | 46 | 3888 | 6638.5 | 2363.4 |
Therefore, the slope is given by:
and the intercept is given by : a = Y - bX = 80.1 - 0.789814*46 = 43.768553
Hence the line of best fit is given by :
Y = 43.768553 + 0.789814 X
As an alternate method of deriving the regression equation, a spreadsheet could be used. The line for a single variable regression was derived by using the Excel spreadsheet. The output from Excel for the above data set is given below:
| Regression Output: | |
|---|---|
| Constant | 43.76855 |
| Std Err of Y Estimate | 9.230407 |
| R Squared | 0.693647 |
| No. of Observations | 10 |
| Degrees of Freedom | 8 |
| X Coefficient(s) | 0.788348 |
| X Coefficient(s) | 0.789814 |
[edit] A Numerical Example: Multiple Regression
The Mathematical Computation of the Regression Coefficients for one or more independent variables involves matrix computations. A brief result is given below:
Let X be the data matrix of the predictor (independent) variables. Y is the data vector representing the criterion (dependent) variable and b is the data vector representing the regression coefficients including the constants. The vector of regression coefficients is computed as
| Y | X0 | X1 | X2 |
|---|---|---|---|
| 4.50 | 1 | 8.00 | 2.00 |
| 22.50 | 1 | 40.50 | 24.50 |
| 2.00 | 1 | 4.50 | 0.50 |
| 0.50 | 1 | 0.50 | 2.00 |
| 18.00 | 1 | 4.50 | 4.50 |
| 2.00 | 1 | 7.00 | 8.00 |
| 32.00 | 1 | 24.50 | 40.50 |
| 4.50 | 1 | 4.50 | 2.00 |
| 40.50 | 1 | 32.00 | 24.50 |
| 2.00 | 1 | 0.50 | 4.50 |
VARIABLE LABELS
- V1 'AGE'
- V2 'WEIGHT'
- V3 'VARIABLE 3'
- V4 'HEIGHT'
- V5 'STATUS'
- V6 'DEPENDENT'
| 00025 00025 02500 00150 00034 00064 |
| 01300 00021 02100 00087 00036 00065 |
| 00350 00022 02200 00043 00041 00082 |
| 00175 00009 00130 00180 00015 00023 |
| 00300 00023 02300 00200 00033 00064 |
| 00200 00010 00060 00330 00013 00016 |
| 00550 00007 00140 00340 00016 00012 |
| 00600 00006 00080 00500 00011 00027 |
| 00130 00008 00270 00150 00019 00048 |
| 00500 00018 00360 00180 00027 00050 |
| ...........CONTINUES........... |
| 00500 00022 02200 00120 00039 00100 |
| 00100 00015 00500 00080 00029 00050 |
| 01700 00009 00300 01300 00010 00080 |
| 00500 00030 03500 00090 00058 00065 |
| 00130 00010 00130 00900 00010 00025 |
The Output File: The output for most regression analysis programs contain the following:
- Stepwise input of independent variables
- r2 value
- Standard Regression Coefficients
- Unstandardized Regression Coefficients
- Sum of squares, mean squares, F-ratio
- F value to enter a variable in the equation
- F value to remove a variable from the equation
- Summary table of stepwise analysis.
STEPWISE REGRESSION ANALYSIS
NO. OF VARIABLES 6
DATA TREATED AS HAVING NO MISSING VALUES
VARIABLE LABELS
V1 'AGE'
V2 'WEIGHT'
V3 'VARIABLE 3'
V4 'HEIGHT'
V5 'STATUS'
V6 'DEPENDENT'
VARIABLE MEAN STAND. DEV. MINIMUM MAXIMUM
V1 6.9956 6.47375 .25000 30.00000
V2 15.2500 9.35753 2.00000 38.00000
V3 10.4251 11.62703 .40000 38.00000
V4 3.0996 5.99713 .01000 48.00000
V5 25.3971 12.47940 5.00000 58.00000
V6 56.7941 43.55013 7.00000 208.00000
VARIANCE-COVARIANCE MATRIX
V1 .41909E+02
V2 -.10690E+02 .87563E+02
V3 .38646E+00 .94438E+02 .13519E+03
V4 .99188E+01 .56522E+01 .87765E+01 .35966E+02
V5 -.15809E+02 .10266E+03 .10910E+03-.10510E+02 .15574E+03
V6 .23134E+02 .30549E+03 .39803E+03 .89470E+02 .35064E+03 .18966E+04
1 2 3 4 5 6
Variance-Covariance Matrix : It is the matrix of the variance of the independent variables.
CORRELATION MATRIX
V1 .10000E+01
V2 -.17646E+00 .10000E+01
V3 .51343E-02 .86799E+00 .10000E+01
V4 .25548E+00 .10072E+00 .12587E+00 .10000E+01
V5 -.19569E+00 .87912E+00 .75194E+00-.14044E+00 .10000E+01
V6 .82056E-01 .74962E+00 .78606E+00 .34257E+00 .64517E+00 .10000E+01
1 2 3 4 5 6
Note: Correlation Matrix = It is the Matrix that gives the correlation between the dependent and independent variables. This table is also useful for studying multi-collinearity, (correlation between the independent variables).
SUB-PROBLEM 1
DEPENDENT VARIABLE V6
MAXIMUM NUMBER OF STEPS 12
F-LEVEL FOR INCLUSION .010000
F-LEVEL FOR DELETION .005000
TOLERANCE LEVEL .001000
**********************************************************************
STEP NUMBER VARIABLE ACTION R-SQUARED
1 V3 ENTERED .6179
**********************************************************************
STEP NUMBER 1
VARIABLE ENTERED V3
MULTIPLE R .7861
STD. ERROR OF EST. 27.1238
ANALYSIS OF VARIANCE
DF1 SUM OF SQUARES MEAN SQUARE2 F-RATIO3 F-PROB4.
REGRESSION 1 78516.9400 78516.9400 106.7242 .0000
RESIDUAL 66 48556.1800 735.6997
1. The Degree of Freedom for the Regression Model, also called the explained model, is given by k, where k = number of independent variables in the regression equation.
For the Residual, the error unexplained by the regression model, the Degree of Freedom is given by (n-k-1), where n = number of counts of the independent variable in the data set.
2. Mean Square = (Sum of Squares)/(DF)
3. F Ratio = (Mean Square of the Regression)/(Mean Square of the Residual)
4. F-Prob = Level of significance corresponding to the F Value
VARIABLES IN EQUATION . VARIABLES NOT IN EQUATION
STD1. UNSTD2. . PARTIAL
VAR COEFF. COEFF STD. ERROR F TO REMOVE . VAR. CORR. TOLERANCE F TO ENTER
(CONST3 26.09981) .
V3 .78606 2.94426 .28500 106.7242 (2) . V1 .12622 1.0000 1.0523 (2)
. V2 .21932 .2466 3.2847 (2)
. V4 .39729 .9842 12.1821 (2)
. V5 .13276 .4346 1.1662 (2)
******************************************************************************
STEP NUMBER VARIABLE ACTION R-SQUARED4
2 V4 ENTERED .6782
******************************************************************************
STEP NUMBER 2
VARIABLE ENTERED V4
MULTIPLE R .8235
STD. ERROR OF EST. 25.0821
ANALYSIS OF VARIANCE
DF SUM OF SQUARES MEAN SQUARE F-RATIO F-PROB.
REGRESSION 2 86180.8500 43090.4300 68.4941 .0000
RESIDUAL 65 40892.2700 629.1118
.
VARIABLES IN EQUATION . VARIABLES NOT IN EQUATION
STD. UNSTD. . PARTIAL
VAR COEFF. COEFF STD. ERROR F TO REMOVE . VAR. CORR. TOLERANCE F TO ENTER
(CONST 21.74448) .
V3 .75490 2.82755 .26566 113.2843 (2) . V1 .02724 .9340 .0475 (2)
V4 .24755 1.79768 .51505 12.1821 (2) . V2 .24653 .2465 4.1414 (2)
. V5 .32179 .3784 7.3925 (2)
1. Std. Coeff = Standardized Coefficient of Regression for the independent variable.
2. Unstd. Coeff = Unstandardized Coefficient of Regression for the independent variable.
3. Const = The Intercept of the Regression Equation
4. The R-Squared Value tells the percentage of the changes in the dependent variable that
can be explained by the regression equation.
******************************************************************************
STEP NUMBER VARIABLE ACTION R-SQUARED
3 V5 ENTERED .7115
******************************************************************************
STEP NUMBER 3
VARIABLE ENTERED V5
MULTIPLE R .8435
STD. ERROR OF EST. 23.9328
ANALYSIS OF VARIANCE
DF SUM OF SQUARES MEAN SQUARE F-RATIO F-PROB.
REGRESSION 3 90415.1500 30138.3800 52.6177 .0000
RESIDUAL 64 36657.9700 572.7808
.
VARIABLES IN EQUATION . VARIABLES NOT IN EQUATION
STD. UNSTD. . PARTIAL
VAR COEFF. COEFF STD. ERROR F TO REMOVE . VAR. CORR. TOLERANCE F TO ENTER .
(CONST 2.91075) .
V3 .52285 1.95840 .40797 23.0429 (2) . V1 .11111 .8832 .7875 (2)
V4 .31843 2.31239 .52665 19.2786 (2) . V2 .01574 .1134 .0156 (2)
V5 .29673 1.03553 .38086 7.3925 (2) .
******************************************************************************
STEP NUMBER VARIABLE ACTION R-SQUARED
4 V1 ENTERED .7151
******************************************************************************
STEP NUMBER 4
VARIABLE ENTERED V1
MULTIPLE R .8456
STD. ERROR OF EST. 23.9727
ANALYSIS OF VARIANCE
DF SUM OF SQUARES MEAN SQUARE F-RATIO F-PROB.
REGRESSION 4 90867.7400 22716.9400 39.5291 .0000
RESIDUAL 63 36205.3800 574.6885
.
VARIABLES IN EQUATION . VARIABLES NOT IN EQUATION
STD. UNSTD. . PARTIAL
VAR COEFF. COEFF STD. ERROR F TO REMOVE . VAR. CORR. TOLERANCE F TO ENTER .
(CONST -1.25280) .
V1 .06350 .42721 .48139 .7875 (2) . V2 .05241 .1029 .1708 (2)
V3 .50640 1.89677 .41451 20.9389 (2) .
V4 .30755 2.23335 .53500 17.4266 (2) .
V5 .32000 1.11674 .39232 8.1027 (2) .
******************************************************************************
STEP NUMBER VARIABLE ACTION R-SQUARED
5 V2 ENTERED .7159
******************************************************************************
STEP NUMBER 5
VARIABLE ENTERED V2
MULTIPLE R .8461
STD. ERROR OF EST. 24.1320
ANALYSIS OF VARIANCE
DF SUM OF SQUARES MEAN SQUARE F-RATIO F-PROB.
REGRESSION 5 90967.1800 18193.4400 31.2412 .0000
RESIDUAL 62 36105.9400 582.3538
VARIABLES IN EQUATION . VARIABLES NOT IN EQUATION
STD. UNSTD. . PARTIAL
VAR COEFF. COEFF STD. ERROR F TO REMOVE . VAR. CORR. TOLERANCE F TO ENTER .
(CONST -1.84213) .
V1 .07303 .49129 .50880 .9323 (2) .
V2 .08720 .40584 .98214 .1708 (2) .
V3 .46858 1.75512 .54002 10.5633 (2) .
V4 .29432 2.13728 .58658 13.2759 (2) .
V5 .27179 .94847 .56726 2.7956 (2) .
F-LEVEL OR TOLERANCE INSUFFICIENT FOR FURTHER COMPUTATION
SUMMARY TABLE
STEP VARIABLE MULTIPLE INCREASE F VALUE TO SUM OF SQ.
NUMBER ENTERED REMOVED R RSQ IN RSQ1 ENTER OR REMOVE ADDED
1 V3 .7861 .6179 .6179 106.7242 78516.9400
2 V4 .8235 .6782 .0603 12.1821 7663.9170
3 V5 .8435 .7115 .0333 7.3925 4234.2900
4 V1 .8456 .7151 .0036 .7875 452.6008
5 V2 .8461 .7159 .0008 .1708 99.4410
COMPLETION OF STEPWISE REGRESSION ANALYSIS
1. A variable is added as long as its addition contributes a positive increase in the
R-Square value of the model; i.e. as long as it meets the significant level of the test.












