Regression

From QualtricsWiki

Jump to: navigation, search

Contents

[edit] STEPWISE MULTIPLE LINEAR REGRESSION ANALYSIS

[edit] REQUIREMENTS

Regression is used to test the effects of n independent (predictor) variables on a single dependent (criterion) variable. Regression tests the deviation about the means, and all variables must be interval scaled. Computationally, regression analysis may be conducted using either a raw data matrix (respondents by variables) or a correlation matrix.

Regression analysis measures the degree of influence of the independent variables on a dependent variable. In the case of a single independent variable, the dependent variable could be predicted from the independent variable by the simple equation:

y = a + bx {where a is constant}

This could be extended to a multi-variable concept as follows:

y = a + b 1 x 1 + b 2 x 2 + b 3 x 3 + ..... +b n x n

It should be noted that whether it be for a single variable or for multiple variables, the relationship predicted is always linear.

[edit] A Graphical Explanation of Bi-Variate (2 Variable) Regression Analysis

A simple approach to approximate a regression equation for a single variable is to plot the relationship between the variables. The task requires that we first plot the dependent variable against the independent variable. This type of plotting is called the scatter diagram.

Next, identify the straight line that represents the trend through the mid-point of the data, this line must be the one with the `best fit'. The regression analysis line identifies the trend or relationship between the independent and dependent variables. The relationship, once identified, is used to predict the various values of the dependent variable given specific values of the independent variable. This predicted relationship is always in the form of a linear trend.

The table below identifies a set of values for an independent (X) and dependent (Y) variable.

X 39 43 21 64 57 47 28 75 34 52
Y 68 82 56 86 97 94 77 103 59 79

The scatter plot of the variables is given below:

Image:Reg002.gif

Regression analysis is utilized to develop an accurate mathematical formulation of the regression analysis. The line of best fit is defined as a line for which minimizes the sum of squares of deviation of the various data points from the line. The regression line is also referred to as the least squares line.

In case of a multi-variable regression, the analysis is a sequence of multiple linear regression equations that are developed in a stepwise manner. At each step of the sequence, one variable is added to the regression equation.

The variable added is the one that makes the greatest reduction in the error sum of squares of the sample data. Equivalently it is the variable that when added, provides the greatest increase in the F value. Variables not having a significant correlation with the dependent variable, are those whose addition does not increase the F value and are not featured in the regression equation.

[edit] Mathematical Computation of the Regression Coefficients

I. With one independent Variable: The Mathematical Computation of the Regression Coefficients for the case of a single independent variable is given below:

The slope (regression coefficient) for the line of least squares is given by b, where

Image:Reg004.gif

The intercept of the line is given by a, where a = y - bx.

The mathematical formula used for this computation is as follows:

Image:Reg006.gif

The Residual: The residual is defined as the difference between the actual and predicted values of the dependent variable. The standard error of the estimate is the standard deviation of the residuals. The standard error of the estimate can be calculated as follows:

Image:Reg008.gif

[edit] A Numerical Example: One dependent variable

Let us use the data which produced the above graphical representation of a regression analysis.

SL.No y x xy y2 x2
1 68 39 2652 4624 1521
2 82 43 3526 6724 1849
3 56 21 1176 3136 441
4 86 64 5504 7396 4096
5 97 57 5529 9409 3249
6 94 47 4418 8836 2209
7 77 28 2156 5929 784
8 103 75 7725 10609 5625
9 59 34 2006 3481 1156
10 79 52 4108 6241 2704
SUM 801 460 38800 66385 23634
AVG. 80.1 46 3888 6638.5 2363.4

Image:Reg010.gif

Image:Reg012.gif

Image:Reg014.gif

Image:Reg016.gif

Image:Reg018.gif

Therefore, the slope is given by:

Image:Reg020.gif

and the intercept is given by : a = Y - bX = 80.1 - 0.789814*46 = 43.768553

Hence the line of best fit is given by :

Y = 43.768553 + 0.789814 X

As an alternate method of deriving the regression equation, a spreadsheet could be used. The line for a single variable regression was derived by using the Excel spreadsheet. The output from Excel for the above data set is given below:

Regression Output:
Constant 43.76855
Std Err of Y Estimate 9.230407
R Squared 0.693647
No. of Observations 10
Degrees of Freedom 8
X Coefficient(s) 0.788348
X Coefficient(s) 0.789814


[edit] A Numerical Example: Multiple Regression

The Mathematical Computation of the Regression Coefficients for one or more independent variables involves matrix computations. A brief result is given below:

Let X be the data matrix of the predictor (independent) variables. Y is the data vector representing the criterion (dependent) variable and b is the data vector representing the regression coefficients including the constants. The vector of regression coefficients is computed as


Y X0 X1 X2
4.50 1 8.00 2.00
22.50 1 40.50 24.50
2.00 1 4.50 0.50
0.50 1 0.50 2.00
18.00 1 4.50 4.50
2.00 1 7.00 8.00
32.00 1 24.50 40.50
4.50 1 4.50 2.00
40.50 1 32.00 24.50
2.00 1 0.50 4.50


Image:Reg-xprime.png

Image:Reg-xprimex.png Image:Reg-xprimexinv.png

Image:Reg-xprimey.png

Image:Reg-b.png

VARIABLE LABELS

V1 'AGE'
V2 'WEIGHT'
V3 'VARIABLE 3'
V4 'HEIGHT'
V5 'STATUS'
V6 'DEPENDENT'


00025 00025 02500 00150 00034 00064
01300 00021 02100 00087 00036 00065
00350 00022 02200 00043 00041 00082
00175 00009 00130 00180 00015 00023
00300 00023 02300 00200 00033 00064
00200 00010 00060 00330 00013 00016
00550 00007 00140 00340 00016 00012
00600 00006 00080 00500 00011 00027
00130 00008 00270 00150 00019 00048
00500 00018 00360 00180 00027 00050
...........CONTINUES...........
00500 00022 02200 00120 00039 00100
00100 00015 00500 00080 00029 00050
01700 00009 00300 01300 00010 00080
00500 00030 03500 00090 00058 00065
00130 00010 00130 00900 00010 00025


The Output File: The output for most regression analysis programs contain the following:

  1. Stepwise input of independent variables
  2. r2 value
  3. Standard Regression Coefficients
  4. Unstandardized Regression Coefficients
  5. Sum of squares, mean squares, F-ratio
  6. F value to enter a variable in the equation
  7. F value to remove a variable from the equation
  8. Summary table of stepwise analysis.


                          STEPWISE REGRESSION ANALYSIS

 

NO. OF VARIABLES       6

DATA TREATED AS HAVING NO MISSING VALUES

VARIABLE LABELS

        V1        'AGE'

        V2        'WEIGHT'

        V3        'VARIABLE 3'

        V4        'HEIGHT'

        V5        'STATUS'

        V6        'DEPENDENT'

 

 VARIABLE        MEAN      STAND. DEV.    MINIMUM       MAXIMUM

 V1             6.9956      6.47375        .25000      30.00000

 V2            15.2500      9.35753       2.00000      38.00000

 V3            10.4251     11.62703        .40000      38.00000

 V4             3.0996      5.99713        .01000      48.00000

 V5            25.3971     12.47940       5.00000      58.00000

 V6            56.7941     43.55013       7.00000     208.00000

 

 VARIANCE-COVARIANCE MATRIX

 V1   .41909E+02

 V2  -.10690E+02 .87563E+02

 V3   .38646E+00 .94438E+02 .13519E+03

 V4   .99188E+01 .56522E+01 .87765E+01 .35966E+02

 V5  -.15809E+02 .10266E+03 .10910E+03-.10510E+02 .15574E+03

 V6   .23134E+02 .30549E+03 .39803E+03 .89470E+02 .35064E+03 .18966E+04

          1          2          3          4          5          6

Variance-Covariance Matrix : It is the matrix of the variance of the independent variables.

 

 CORRELATION MATRIX

 V1  .10000E+01

 V2 -.17646E+00 .10000E+01

 V3  .51343E-02 .86799E+00 .10000E+01

 V4  .25548E+00 .10072E+00 .12587E+00 .10000E+01

 V5 -.19569E+00 .87912E+00 .75194E+00-.14044E+00 .10000E+01

 V6  .82056E-01 .74962E+00 .78606E+00 .34257E+00 .64517E+00 .10000E+01

         1          2          3          4          5          6

 

Note: Correlation Matrix = It is the Matrix that gives the correlation between the dependent and independent variables.  This table is also useful for studying multi-collinearity, (correlation between the independent variables).

   SUB-PROBLEM                     1

   DEPENDENT VARIABLE          V6       

   MAXIMUM NUMBER OF STEPS        12

   F-LEVEL FOR INCLUSION     .010000

   F-LEVEL FOR DELETION      .005000

   TOLERANCE LEVEL           .001000

 

**********************************************************************         

 STEP NUMBER   VARIABLE   ACTION   R-SQUARED

      1          V3       ENTERED     .6179

 

**********************************************************************   

 STEP NUMBER     1

 VARIABLE ENTERED  V3       

 MULTIPLE R                .7861

 STD. ERROR OF EST.      27.1238

 

    ANALYSIS OF VARIANCE

              DF1    SUM OF SQUARES   MEAN SQUARE2   F-RATIO3    F-PROB4.

REGRESSION    1       78516.9400      78516.9400   106.7242      .0000

RESIDUAL      66      48556.1800        735.6997

 

1. The Degree of Freedom for the Regression Model, also called the explained model, is given by k, where k = number of independent variables in the regression equation.

For the Residual, the error unexplained by the regression model, the Degree of Freedom is given by (n-k-1), where n = number of counts of the independent variable in the data set.

2. Mean Square = (Sum of Squares)/(DF)

3. F Ratio = (Mean Square of the Regression)/(Mean Square of the Residual)

4. F-Prob = Level of significance corresponding to the F Value

 

                  VARIABLES IN EQUATION       .   VARIABLES NOT IN EQUATION

     STD1.   UNSTD2.                         .      PARTIAL

VAR  COEFF.  COEFF   STD. ERROR  F TO REMOVE  . VAR.  CORR. TOLERANCE  F TO ENTER

(CONST3 26.09981)                            .

V3  .78606  2.94426   .28500     106.7242 (2) .   V1  .12622  1.0000  1.0523 (2) 

                                              .   V2  .21932   .2466   3.2847 (2)

                                              .   V4  .39729   .9842  12.1821 (2)

                                              .   V5  .13276   .4346   1.1662 (2)

 

 ******************************************************************************

       STEP NUMBER   VARIABLE   ACTION   R-SQUARED4

          2             V4      ENTERED     .6782

 ******************************************************************************

    STEP NUMBER     2

    VARIABLE ENTERED  V4       

    MULTIPLE R                .8235

    STD. ERROR OF EST.      25.0821

 

    ANALYSIS OF VARIANCE

                DF    SUM OF SQUARES    MEAN SQUARE    F-RATIO    F-PROB.

REGRESSION      2       86180.8500      43090.4300    68.4941      .0000

RESIDUAL       65       40892.2700        629.1118

                

                                              .

                  VARIABLES IN EQUATION       .   VARIABLES NOT IN EQUATION

      STD.   UNSTD.                           .       PARTIAL

VAR  COEFF.  COEFF   STD. ERROR  F TO REMOVE  .  VAR. CORR. TOLERANCE F TO ENTER

(CONST 21.74448)                              .

V3  .75490  2.82755  .26566      113.2843 (2) .  V1   .02724  .9340    .0475 (2)

V4  .24755  1.79768  .51505       12.1821 (2) .  V2   .24653  .2465   4.1414 (2)

                                              .  V5   .32179  .3784   7.3925 (2) 

 

1.  Std. Coeff = Standardized Coefficient of Regression for the independent variable.

2.  Unstd. Coeff = Unstandardized Coefficient of Regression for the independent variable.

3.  Const = The Intercept of the Regression Equation

4.  The R-Squared Value tells the percentage of the changes in the dependent variable that

    can be explained by the regression equation.

 

 

******************************************************************************

         STEP NUMBER   VARIABLE   ACTION   R-SQUARED

                3        V5       ENTERED     .7115

******************************************************************************

    STEP NUMBER     3

    VARIABLE     ENTERED  V5       

    MULTIPLE R                .8435

    STD. ERROR OF EST.      23.9328

 

    ANALYSIS OF VARIANCE

                DF    SUM OF SQUARES    MEAN SQUARE    F-RATIO    F-PROB.

REGRESSION      3       90415.1500      30138.3800    52.6177      .0000

RESIDUAL       64       36657.9700        572.7808

                 

                                              .

                  VARIABLES IN EQUATION       .   VARIABLES NOT IN EQUATION

      STD.   UNSTD.                           .       PARTIAL

VAR  COEFF.  COEFF   STD. ERROR  F TO REMOVE  .  VAR. CORR. TOLERANCE F TO ENTER                                               .

(CONST 2.91075)                               .

V3  .52285  1.95840   .40797      23.0429 (2) .  V1   .11111  .8832   .7875 (2) 

V4  .31843  2.31239   .52665      19.2786 (2) .  V2   .01574  .1134   .0156 (2) 

V5  .29673  1.03553   .38086       7.3925 (2) .

 

 ******************************************************************************

          STEP NUMBER   VARIABLE   ACTION   R-SQUARED

                 4        V1       ENTERED     .7151

 ******************************************************************************  

    STEP NUMBER     4

    VARIABLE     ENTERED  V1       

    MULTIPLE R                .8456

    STD. ERROR OF EST.      23.9727

 

    ANALYSIS OF VARIANCE

                DF    SUM OF SQUARES    MEAN SQUARE    F-RATIO    F-PROB.

REGRESSION      4       90867.7400      22716.9400    39.5291      .0000

RESIDUAL       63       36205.3800        574.6885

                                              .

                  VARIABLES IN EQUATION       .   VARIABLES NOT IN EQUATION

      STD.   UNSTD.                           .       PARTIAL

VAR  COEFF.  COEFF   STD. ERROR  F TO REMOVE  .  VAR. CORR. TOLERANCE F TO ENTER                                                .

(CONST -1.25280)                              .

V1  .06350   .42721   .48139        .7875 (2) .  V2   .05241   .1029   .1708 (2) 

V3  .50640  1.89677   .41451      20.9389 (2) .

V4  .30755  2.23335   .53500      17.4266 (2) .

V5  .32000  1.11674   .39232       8.1027 (2) .

 

******************************************************************************

         STEP NUMBER   VARIABLE   ACTION   R-SQUARED

                5        V2       ENTERED     .7159

******************************************************************************

    STEP NUMBER     5

    VARIABLE     ENTERED  V2       

    MULTIPLE R                .8461

    STD. ERROR OF EST.      24.1320

 

    ANALYSIS OF VARIANCE

                DF    SUM OF SQUARES    MEAN SQUARE    F-RATIO    F-PROB.

REGRESSION      5       90967.1800      18193.4400    31.2412      .0000

RESIDUAL       62       36105.9400        582.3538

 

                  VARIABLES IN EQUATION       .   VARIABLES NOT IN EQUATION

      STD.   UNSTD.                           .       PARTIAL

VAR  COEFF.  COEFF   STD. ERROR  F TO REMOVE  .  VAR. CORR. TOLERANCE F TO ENTER                                                .

(CONST  -1.84213)                             .

V1   .07303  .49129     .50880      .9323 (2) .

V2   .08720  .40584     .98214      .1708 (2) .

V3   .46858 1.75512     .54002    10.5633 (2) .

V4   .29432 2.13728     .58658    13.2759 (2) .

V5   .27179  .94847     .56726     2.7956 (2) .

F-LEVEL OR TOLERANCE INSUFFICIENT FOR FURTHER COMPUTATION

 

  SUMMARY TABLE

STEP        VARIABLE       MULTIPLE       INCREASE   F VALUE TO      SUM OF SQ.

NUMBER  ENTERED  REMOVED  R         RSQ    IN RSQ1   ENTER OR REMOVE           ADDED

  1      V3             .7861      .6179    .6179      106.7242      78516.9400

  2      V4             .8235      .6782    .0603       12.1821        7663.9170  

  3      V5             .8435      .7115    .0333        7.3925        4234.2900  

  4      V1             .8456      .7151    .0036         .7875         452.6008  

  5      V2             .8461      .7159    .0008         .1708          99.4410 

 

  COMPLETION OF STEPWISE REGRESSION ANALYSIS

 

1.  A variable is added as long as its addition contributes a positive increase in the
    R-Square value of the model; i.e. as long as it meets the significant level of the test.