Regression, Prediction, and Model Building
© 1998 by Dr. Thomas W. MacFarland -- All Rights Reserved
************ reg_sion.doc ************ Background: Statistical tests are used to carefully examine prior activities and to then use these analyses to make informed predictions about future activities. Regardless of the statistical test, data are examined in a systematic manner so that decisions can be made with some degree of certainty. It is very common to use accepted data to offer a prediction of the future. The opportunity of using existing data to predict future outcomes is viewed as model-building. That is to say, existing data are used to build a model of the future, with a predetermined degree of error built into the model. Multiple regression is a common and useful tool for model building. Scenario: This study will demonstrate how historical data on Math and Verbal SAT scores can be used to predict University GPA. That is to say: -- Provided that you know a student's Math SAT score and Verbal SAT score, -- Can you use these two scores to predict this student's University GPA? This study will attempt to resolve the following equation: University GPA = Constant +or- (x * Math_SAT) +or- (y * Verb_SAT) Data are from the 105 students who graduated from a local state university, earning a B.S. in Computer Science. Data on these students were previously identified in the tutorial on the use of Pearson's Product-Moment Coefficient of Correlation. Because the data are all interval data (i.e., the data are parametric, with the difference between a 3.87 GPA and a 3.88 GPA equal to the difference between a 4.03 GPA and a 4.04 GPA, Pearson's Coefficient of Correlation is the correct test to determine the degree of association between these variables. Note: The data file 'reg_sion.dat' is an exact copy of the file used to conduct the Pearson's Coefficient of Correlation analysis. Data for this study are summarized in Table 1. Table 1 Summary Statistics of Computer Science Graduates: High School Grade Point Average (High_GPA), Math Scholastic Aptitude Test Score (Math_SAT), Verbal Scholastic Aptitude Test Score (Verb_SAT), Computer Science Grade Point Average (Comp_GPA), and Overall University Grade Point Average (Univ_GPA) ======================================================== Student Number High_GPA Math_SAT Verb_SAT Comp_GPA Univ_GPA -------------------------------------------------------- 001 3.45 643 589 3.76 3.52 002 2.78 558 512 2.87 2.91 003 2.52 583 503 2.54 2.40 004 3.67 685 602 3.83 3.47 005 3.24 592 538 3.29 3.47 006 2.10 562 486 2.64 2.37 007 2.82 573 548 2.86 2.40 008 2.36 559 536 2.03 2.24 009 2.42 552 583 2.81 3.02 010 3.51 617 591 3.41 3.32 011 3.48 684 649 3.61 3.59 012 2.14 568 592 2.48 2.54 013 2.59 604 582 3.21 3.19 014 3.46 619 624 3.52 3.71 015 3.51 642 619 3.41 3.58 016 3.68 683 642 3.52 3.40 017 3.91 703 684 3.84 3.73 018 3.72 712 652 3.64 3.49 019 2.15 564 501 2.14 2.25 020 2.48 557 549 2.21 2.37 021 3.09 591 584 3.17 3.29 022 2.71 599 562 3.01 3.19 023 2.46 607 619 3.17 3.28 024 3.32 619 558 3.01 3.37 025 3.61 700 721 3.72 3.61 026 3.82 718 732 3.78 3.81 027 2.64 580 538 2.51 2.40 028 2.19 562 507 2.10 2.21 029 3.34 683 648 3.21 3.58 030 3.48 717 724 3.68 3.51 031 3.56 701 714 3.48 3.62 032 3.81 691 684 3.71 3.60 033 3.92 714 706 3.81 3.65 034 4.00 689 673 3.84 3.76 035 2.52 554 507 2.09 2.27 036 2.71 564 543 2.17 2.35 037 3.15 668 604 2.98 3.17 038 3.22 691 662 3.28 3.47 039 2.29 573 591 2.74 3.00 040 2.03 568 517 2.19 2.74 041 3.14 607 624 3.28 3.37 042 3.52 651 683 3.68 3.54 043 2.91 604 583 3.17 3.28 044 2.83 560 542 3.17 3.39 045 2.65 604 617 3.31 3.28 046 2.41 574 548 3.07 3.19 047 2.54 564 500 2.38 2.52 048 2.66 607 528 2.94 3.08 049 3.21 619 573 2.84 3.01 050 3.34 647 608 3.17 3.42 051 3.68 651 683 3.72 3.60 052 2.84 571 543 2.17 2.40 053 2.74 583 510 2.42 2.83 054 2.71 554 538 2.49 2.38 055 2.24 568 519 3.38 3.21 056 2.48 574 602 2.07 2.24 057 3.14 605 619 3.22 3.40 058 2.83 591 584 2.71 3.07 059 3.44 642 608 3.31 3.52 060 2.89 608 573 3.28 3.47 061 2.67 574 538 3.19 3.08 062 3.24 643 607 3.24 3.38 063 3.29 608 649 3.53 3.41 064 3.87 709 688 3.72 3.64 065 3.94 691 645 3.98 3.71 066 3.42 667 583 3.09 3.01 067 3.52 656 609 3.42 3.37 068 2.24 554 542 2.07 2.34 069 3.29 692 563 3.17 3.29 070 3.41 684 672 3.51 3.40 071 3.56 717 649 3.49 3.38 072 3.61 712 708 3.51 3.28 073 3.28 641 608 3.40 3.31 074 3.21 675 632 3.38 3.42 075 3.48 692 698 3.54 3.39 076 3.62 684 609 3.48 3.51 077 2.92 564 591 3.09 3.17 078 2.81 554 509 3.14 3.20 079 3.11 685 694 3.28 3.41 080 3.28 671 609 3.41 3.29 081 2.70 571 503 3.02 3.17 082 2.62 582 591 2.97 3.12 083 3.72 621 589 4.00 3.71 084 3.42 651 642 3.34 3.50 085 3.51 673 681 3.28 3.34 086 3.28 651 640 3.32 3.48 087 3.42 672 607 3.51 3.44 088 3.90 591 587 3.68 3.59 089 3.12 582 612 3.07 3.28 090 2.83 609 555 2.78 3.00 091 2.09 554 480 3.68 3.42 092 3.17 612 590 3.30 3.41 093 3.28 628 580 3.34 3.49 094 3.02 567 602 3.17 3.28 095 3.42 619 623 3.07 3.17 096 3.06 691 683 3.19 3.24 097 2.76 564 549 2.15 2.34 098 3.19 650 684 3.11 3.28 099 2.23 551 554 2.17 2.29 100 2.48 568 541 2.14 2.08 101 3.76 605 590 3.74 3.64 102 3.49 692 683 3.27 3.42 103 3.07 680 692 3.19 3.25 104 2.19 617 503 2.98 2.76 105 3.46 516 528 3.28 3.41 -------------------------------------------------------- Files: 1. reg_sion.doc 2. reg_sion.dat 3. reg_sion.r01 4. reg_sion.o01 5. reg_sion.con 6. reg_sion.lis Command: At the Unix prompt (%), key: %spss -m < reg_sion.r01> reg_sion.o01 ************ reg_sion.dat ************ 001 3.45 643 589 3.76 3.52 002 2.78 558 512 2.87 2.91 003 2.52 583 503 2.54 2.40 004 3.67 685 602 3.83 3.47 005 3.24 592 538 3.29 3.47 006 2.10 562 486 2.64 2.37 007 2.82 573 548 2.86 2.40 008 2.36 559 536 2.03 2.24 009 2.42 552 583 2.81 3.02 010 3.51 617 591 3.41 3.32 011 3.48 684 649 3.61 3.59 012 2.14 568 592 2.48 2.54 013 2.59 604 582 3.21 3.19 014 3.46 619 624 3.52 3.71 015 3.51 642 619 3.41 3.58 016 3.68 683 642 3.52 3.40 017 3.91 703 684 3.84 3.73 018 3.72 712 652 3.64 3.49 019 2.15 564 501 2.14 2.25 020 2.48 557 549 2.21 2.37 021 3.09 591 584 3.17 3.29 022 2.71 599 562 3.01 3.19 023 2.46 607 619 3.17 3.28 024 3.32 619 558 3.01 3.37 025 3.61 700 721 3.72 3.61 026 3.82 718 732 3.78 3.81 027 2.64 580 538 2.51 2.40 028 2.19 562 507 2.10 2.21 029 3.34 683 648 3.21 3.58 030 3.48 717 724 3.68 3.51 031 3.56 701 714 3.48 3.62 032 3.81 691 684 3.71 3.60 033 3.92 714 706 3.81 3.65 034 4.00 689 673 3.84 3.76 035 2.52 554 507 2.09 2.27 036 2.71 564 543 2.17 2.35 037 3.15 668 604 2.98 3.17 038 3.22 691 662 3.28 3.47 039 2.29 573 591 2.74 3.00 040 2.03 568 517 2.19 2.74 041 3.14 607 624 3.28 3.37 042 3.52 651 683 3.68 3.54 043 2.91 604 583 3.17 3.28 044 2.83 560 542 3.17 3.39 045 2.65 604 617 3.31 3.28 046 2.41 574 548 3.07 3.19 047 2.54 564 500 2.38 2.52 048 2.66 607 528 2.94 3.08 049 3.21 619 573 2.84 3.01 050 3.34 647 608 3.17 3.42 051 3.68 651 683 3.72 3.60 052 2.84 571 543 2.17 2.40 053 2.74 583 510 2.42 2.83 054 2.71 554 538 2.49 2.38 055 2.24 568 519 3.38 3.21 056 2.48 574 602 2.07 2.24 057 3.14 605 619 3.22 3.40 058 2.83 591 584 2.71 3.07 059 3.44 642 608 3.31 3.52 060 2.89 608 573 3.28 3.47 061 2.67 574 538 3.19 3.08 062 3.24 643 607 3.24 3.38 063 3.29 608 649 3.53 3.41 064 3.87 709 688 3.72 3.64 065 3.94 691 645 3.98 3.71 066 3.42 667 583 3.09 3.01 067 3.52 656 609 3.42 3.37 068 2.24 554 542 2.07 2.34 069 3.29 692 563 3.17 3.29 070 3.41 684 672 3.51 3.40 071 3.56 717 649 3.49 3.38 072 3.61 712 708 3.51 3.28 073 3.28 641 608 3.40 3.31 074 3.21 675 632 3.38 3.42 075 3.48 692 698 3.54 3.39 076 3.62 684 609 3.48 3.51 077 2.92 564 591 3.09 3.17 078 2.81 554 509 3.14 3.20 079 3.11 685 694 3.28 3.41 080 3.28 671 609 3.41 3.29 081 2.70 571 503 3.02 3.17 082 2.62 582 591 2.97 3.12 083 3.72 621 589 4.00 3.71 084 3.42 651 642 3.34 3.50 085 3.51 673 681 3.28 3.34 086 3.28 651 640 3.32 3.48 087 3.42 672 607 3.51 3.44 088 3.90 591 587 3.68 3.59 089 3.12 582 612 3.07 3.28 090 2.83 609 555 2.78 3.00 091 2.09 554 480 3.68 3.42 092 3.17 612 590 3.30 3.41 093 3.28 628 580 3.34 3.49 094 3.02 567 602 3.17 3.28 095 3.42 619 623 3.07 3.17 096 3.06 691 683 3.19 3.24 097 2.76 564 549 2.15 2.34 098 3.19 650 684 3.11 3.28 099 2.23 551 554 2.17 2.29 100 2.48 568 541 2.14 2.08 101 3.76 605 590 3.74 3.64 102 3.49 692 683 3.27 3.42 103 3.07 680 692 3.19 3.25 104 2.19 617 503 2.98 2.76 105 3.46 516 528 3.28 3.41 ************ reg_sion.r01 ************ SET WIDTH = 80 SET LENGTH = NONE SET CASE = UPLOW SET HEADER = NO TITLE = Multiple Regression COMMENT = This file is used to build a regression model for University Grade Point Average and SAT scores. That is to say: -- Provided that you know a student's Math SAT score and Verbal SAT score, -- Can you use these two scores to predict this student's University GPA? Data are from the 105 students who graduated from a local state university, earning a B.S. in Computer Science. As entering freshmen, these students need a Math SAT score of 550 or greater to major in Computer Science. Because the data are all interval data (i.e., the data are parametric, with the difference between a 3.87 GPA and a 3.88 GPA equal to the difference between a 4.03 GPA and a 4.04 GPA, Pearson's Coefficient of Correlation is the correct test to determine the degree of association between these variables. Note: The data file 'reg_sion.dat' is an exact copy of the file used to conduct the Pearson's Coefficient of Correlation analysis. DATA LIST FILE = 'reg_sion.dat' FIXED / Stu_Code 12-14 High_GPA 22-25 Math_SAT 32-34 Verb_SAT 42-44 Comp_GPA 52-55 Univ_GPA 62-65 FORMAT High_GPA (F4.2) FORMAT Comp_GPA (F4.2) FORMAT Univ_GPA (F4.2) COMMENT = By using the "FORMAT" command in this way, the three GPA scores are restricted to four columns, with the last two columns to the right of the decimal point. Variable Labels Stu_Code "Student Code" / High_GPA "High School GPA" / Math_SAT "Mathematics SAT Score" / Verb_SAT "Verbal SAT Score" / Comp_GPA "GPA in Computer Science Courses" / Univ_GPA "GPA in All University Courses" REGRESSION VARIABLES = Univ_GPA Math_SAT Verb_SAT / DEPENDENT = Univ_GPA / METHOD = ENTER COMMENT = Notice how Univ_GPA is declared as the dependent variable. ************ reg_sion.o01 ************ 1 SET WIDTH = 80 2 SET LENGTH = NONE 3 SET CASE = UPLOW 4 SET HEADER = NO 5 TITLE = Multiple Regression 6 COMMENT = This file is used to build a regression model 7 for University Grade Point Average and SAT 8 scores. That is to say: 9 10 -- Provided that you know a student's Math 11 SAT score and Verbal SAT score, 12 13 -- Can you use these two scores to predict 14 this student's University GPA? 15 16 Data are from the 105 students who graduated 17 from a local state university, earning a B.S. 18 in Computer Science. As entering freshmen, 19 these students need a Math SAT score of 550 20 or greater to major in Computer Science. 21 22 Because the data are all interval data (i.e., 23 the data are parametric, with the difference 24 between a 3.87 GPA and a 3.88 GPA equal to the 25 difference between a 4.03 GPA and a 4.04 GPA, 26 Pearson's Coefficient of Correlation is the 27 correct test to determine the degree of 28 association between these variables. 29 30 Note: The data file 'reg_sion.dat' is an 31 exact copy of the file used to conduct the 32 Pearson's Coefficient of Correlation analysis. 33 DATA LIST FILE = 'reg_sion.dat' FIXED 34 / Stu_Code 12-14 35 High_GPA 22-25 36 Math_SAT 32-34 37 Verb_SAT 42-44 38 Comp_GPA 52-55 39 Univ_GPA 62-65 40 This command will read 1 records from reg_sion.dat Variable Rec Start End Format STU_CODE 1 12 14 F3.0 HIGH_GPA 1 22 25 F4.0 MATH_SAT 1 32 34 F3.0 VERB_SAT 1 42 44 F3.0 COMP_GPA 1 52 55 F4.0 UNIV_GPA 1 62 65 F4.0 41 FORMAT High_GPA (F4.2) 42 FORMAT Comp_GPA (F4.2) 43 FORMAT Univ_GPA (F4.2) 44 45 COMMENT = By using the "FORMAT" command in this way, 46 the three GPA scores are restricted to 47 four columns, with the last two columns to 48 the right of the decimal point. 49 50 Variable Labels 51 Stu_Code "Student Code" 52 / High_GPA "High School GPA" 53 / Math_SAT "Mathematics SAT Score" 54 / Verb_SAT "Verbal SAT Score" 55 / Comp_GPA "GPA in Computer Science Courses" 56 / Univ_GPA "GPA in All University Courses" 57 58 REGRESSION VARIABLES = Univ_GPA Math_SAT Verb_SAT 59 / DEPENDENT = Univ_GPA 60 / METHOD = ENTER 61 62 COMMENT = Notice how Univ_GPA is declared as the 63 dependent variable. 1404 bytes of memory required for REGRESSION procedure. 0 more bytes may be needed for Residuals plots. * * * * M U L T I P L E R E G R E S S I O N * * * * Listwise Deletion of Missing Data Equation Number 1 Dependent Variable.. UNIV_GPA GPA in All University Cou Block Number 1. Method: Enter Variable(s) Entered on Step Number 1.. VERB_SAT Verbal SAT Score 2.. MATH_SAT Mathematics SAT Score Multiple R .68573 R Square .47022 Adjusted R Square .45983 Standard Error .32867 Analysis of Variance DF Sum of Squares Mean Square Regression 2 9.77974 4.88987 Residual 102 11.01840 .10802 F = 45.26669 Signif F = .0000 ------------------ Variables in the Equation ------------------ Variable B SE B Beta T Sig T MATH_SAT .003291 .001090 .395622 3.019 .0032 VERB_SAT .002272 9.3082E-04 .319867 2.441 .0164 (Constant) -.237534 .375038 -.633 .5279 End Block Number 1 All requested variables entered. ************ reg_sion.con ************ Outcome: The following information from the SPSS output file is used to develop the model, or the prediction equation: ------------------ Variables in the Equation ------------------ Variable B SE B Beta T Sig T MATH_SAT .003291 .001090 .395622 3.019 .0032 VERB_SAT .002272 9.3082E-04 .319867 2.441 .0164 (Constant) -.237534 .375038 -.633 .5279 Although there is an abundance of information in this part of the SPSS printout, the following parts of the printout are what you need to build the prediction equation: Univ_GPA = Constant +or- (x * Math_SAT) +or- (y * Verb_SAT) Univ_GPA = -.237534 + (.003291 * Math_SAT) + (.002272 * Verb_SAT) It is always best to try a sample calculation to see if the model is reasonable. Imagine a student with with a Math_SAT of 650 and a Verb_SAT of 625. Using the prediction formula for this study: Univ_GPA = -.237534 + (.003291 * Math_SAT) + (.002272 * Verb_SAT) Univ_GPA = -.237534 + (.003291 * 650) + (.002272 * 625) Univ_GPA = -.237534 + (2.13915) + (1.42) Univ_GPA = 3.32162 And it is certainly reasonable to think that a student with a Math_SAT score of 650 and a Verb_SAT score of 625 would graduate from university with a 3.32162 (GPA = 4.0 is all A's). Again, always try a sample calculation to verify the accuracy of the model. As you will notice, the prediction equation is much easier to read in the attached MINITAB printout. ************ reg_sion.lis ************ % minitab MTB > outfile 'reg_sion.lis' Collecting Minitab session in file: reg_sion.lis MTB > # MINITAB addendum to 'reg_sion.dat' MTB > # MTB > read 'reg_sion.dat' c1 c2 c3 c4 c5 c6 Entering data from file: reg_sion.dat 105 rows read. MTB > name c1 'Stu_Code' MTB > name c2 'High_GPA' MTB > name c3 'Math_SAT' MTB > name c4 'Verb_SAT' MTB > name c5 'Comp_GPA' MTB > name c6 'Univ_GPA' MTB > # MTB > # Before I conduct the regression analysis, I like to MTB > # plot the two predictor variables. MTB > # MTB > plot 'Math_SAT' 'Verb_SAT' - - 2 * ** ** 700+ * ** - * * * **2 * ** 3*** Math_SAT- * 2* * * - * 2 2 - * *3 * * 630+ * - * ** * 3 2 - * * **** *2* * - ** * 3* * - * ** 3*2 2 2 560+ ** 222 *3*2 * - * * - - * - ------+---------+---------+---------+---------+---------+Verb_SAT 500 550 600 650 700 750 MTB > # And as you see, there is a generally positive association MTB > # between Verb_SAT and Math_SAT. MTB > # MTB > regress 'Univ_GPA' on 2 predictor variables 'Math_SAT' 'Verb_SAT' The regression equation is Univ_GPA = - 0.238 + 0.00329 Math_SAT + 0.00227 Verb_SAT Predictor Coef Stdev t-ratio p Constant -0.2375 0.3750 -0.63 0.528 Math_SAT 0.003291 0.001090 3.02 0.003 Verb_SAT 0.0022718 0.0009308 2.44 0.016 s = 0.3287 R-sq = 47.0% R-sq(adj) = 46.0% Analysis of Variance SOURCE DF SS MS F p Regression 2 9.7797 4.8899 45.27 0.000 Error 102 11.0184 0.1080 Total 104 20.7981 SOURCE DF SEQ SS Math_SAT 1 9.1363 Verb_SAT 1 0.6435 Continue? y MTB > # Unlike SPSS, MINITAB actually prints out the regression MTB > # formula: MTB > # MTB > # Univ_GPA = - 0.238 + 0.00329 Math_SAT + 0.00227 Verb_SAT MTB > # MTB > # I will test this formula, using 650 on Math SAT and 589 MTB > # for the Verbal SAT score: MTB > # MTB > # Univ_GPA = -0.238 + (0.00329 * 650) + (0.00227 * 589) MTB > # Univ_GPA = -0.238 + (2.1385) + (1.33703) MTB > # Univ_GPA = 3.23754 MTB > # MTB > # And it is perfectly reasonable to expect a student with MTB > # a Math SAT score of 650 and a Verbal SAT score of 589 MTB > # to later achieve a University Grade Point Average of MTB > # approximately 3.24. MTB > # MTB > # As a note, you may want to look into the issue of MTB > # multicollinearity when determining which predictor MTB > # variables to select for the regression model. But MTB > # this topic is beyond the scope of this tutorial. MTB > stop -------------------------- Disclaimer: All care was used to prepare the information in this tutorial. Even so, the author does not and cannot guarantee the accuracy of this information. The author disclaims any and all injury that may come about from the use of this tutorial. As always, students and all others should check with their advisor(s) and/or other appropriate professionals for any and all assistance on research design, analysis, selected levels of significance, and interpretation of output file(s). The author is entitled to exclusive distribution of this tutorial. Readers have permission to print this tutorial for individual use, provided that the copyright statement appears and that there is no redistribution of this tutorial without permission. Prepared 980316 Revised 980914 end-of-file 'reg_sion.ssi'