Statistics Tutorial: Chi-Square Test or the "Goodness of Fit" Test

Chi-Square Test or the "Goodness of Fit" Test
************
chi_squa.doc
************
Background:  Parametric data have exact parameters, or
             boundaries, to the data.  For example, scores
             on a final examination can range only from 000
             to 100.  Therefore, these scores are parametric
             data.

             As opposed to parametric data, nonparametric data
             are data that are instead typically counted and
             then put into groups or categories.  Using test
             scores again as an example, nonparametric data
             could be viewed as the "number of pass test scores
             on a C++ programming final examination," with pass
             defined as all scores of 70 (out of 100) or
             greater.

             With this brief background on parametric and non-
             parametric data, the Chi-square test is perhaps
             the most frequently used (if not overused) non-
             paramteric statistical test.  The Chi-square test, 
             named for the Greek symbol "Chi," is used to test
             for differences in proportions between two or more
             groups.  You may also see the Chi-square test called
             a "goodness of fit" test.  That is to say, the Chi-
             square test is used to see if grouped data "fit"
             into declared groups, or if the data instead do not
             "fit" into the group.

             A Chi-square test typically involves:

             -- the assignment of frequency data (i.e., head
                counts) into a 2 by 2, 2 by 3, etc., table.

                The following figure represents a 2 by 2
                (always in the order of ROWS by COLUMNS)
                table, with this table composed of four
                cells:

                         N Scores >= 70    N Scores < 70
                         ------------------------------|
                Male     |  N = 45      |    N = 38    |
                         |-----------------------------|
                Female   |  N = 42      |    N = 27    |
                         ------------------------------

             -- application of the actual Chi-square formula 
                and subsequent decision-making as to whether
                differences between observed and expected
                counts in each cell are due to chance, or
                if the differences are instead due to true
                differences between the declared groups.

             A typical scenario for a Chi-square test would be
             the organization of data into discrete categories.  
             Imagine a situation, such as the response among male
             and female workers to a question that allowed "yes,"
             "no," or "undecided" as possible responses.  The
             data would be organized into a 2 by 3 table (i.e.,
             2 rows by 3 columns, consisting of 2*3 = 6 cells):


                                      Response

                               Yes       No     Undecided
                           _______________________________
                           |         |         |         |
                  Male     | N = 12  | N = 07  | N = 08  |
                           |         |         |         |
             Gender        |------------------------------
                           |         |         |         |
                  Female   | N = 09  | N = 11  | N = 14  |
                           |_________|_________|_________|


             When using Chi-square, there are a few criteria
             that must be observed:

             -- Data must be presented as frequency (i.e.,
                counted) data, such as the number of "yes"
                responses to a survey statement.  Please recall,
                however, that parametric data can be organized
                into categories such as "The number of students
                with IQ > 100" or "The number of students with
                IQ <= 100." 

             -- Ideally, the number of observed frequences for
                each cell should be five or more.  Otherwise, it
                may be necessary to use Yates' correction formula
                to account for low cell counts.  

             -- Regardless of the organization scheme, the data
                must be organized in a logical manner. 


Scenario:    In this study, the Chi-square test will be used
             to determine if a passing grade on a high school
             mathematics mastery test, administered during the
             senior year of high school, can be used as a
             later indicator of the pass/fail rate in a C++
             programming course among freshmen students in a
             community college.  
             
             In this example, Dr. Dunbar teaches at Warren
             County Community College.  She knows from local
             contacts that all 60 freshmen students in her most
             recent C++ programming class were required to sit
             for a mathematics mastery test during their senior
             year in high school.  Dr. Dunbar decides to use
             the Chi-square test, comparing the proportion of
             pass/fail mathematics mastery test scores against
             the proportion of pass/fail C++ end-of-term
             grades.  Correctly, Dr. Dunbar assumes that the
             Chi-square test is the most appropriate statistical
             test for this problem since it is used to test
             for differences in proportions between two or more
             groups (i.e., mathematics mastery test pass/fail
             rate and C++ end-of-term pass/fail rate).

             After a few phone calls to high school guidance
             counselors, Dr. Dunbar is able to assemble the
             mathematics mastery test scores (pass/fail status
             only) of these 60 students from her prior C++
             programming class.

             Dr. Dunbar organizes the mathematics mastery test
             scores (pass/fail) and end-of-term grades (pass/
             fail) in her C++ programming course into a 2 by 2
             table:
             
                                              
                                C Programming Class

                                 Pass       Fail   
                              _______________________
                              |                     |
                     Pass     |  N = 31  |  N = 12  |   Row 1
                              |          |          |
             Mathematics Test |---------------------| 
                              |          |          | 
                     Fail     |  N = 09  |  N = 08  |   Row 2 
                              |_____________________|

                                Column 1    Column 2


             To summarize this 2 by 2 table:

             -- 31 students passed the mathematics mastery test
                and also passed the C++ programming course

             -- 12 students passed the mathematics mastery test
                but failed the C++ programming course

             -- 09 students failed the mathematics mastery test
                but passed the C++ programming course

             -- 08 students failed the mathematics mastery test
                and also failed the C++ programming course

             Then, Dr. Dunbar prepares a table (Table 1) that
             identifies student number, pass/fail status on
             the mathematics mastery test, and pass/fail status
             for end-of-term grade in the C++ programming class.
             

             Table 1

             Pass/Fail Scores For Freshmen Students at Warren
             County Community College:  Mathematics Mastery Test
             Score by End-of-Term Grade in a C++ Programming
             Course
             ===================================================
                                     Math       C++
                                   =================
             Student Number        Pass = 1 Fail = 2
             ---------------------------------------------------

                   01                 1          1 
                   02                 2          1 
                   03                 1          1 
                   04                 1          1 
                   05                 1          2 
                   06                 1          1 
                   07                 1          2 
                   08                 2          1 
                   09                 2          1 
                   10                 1          2 
                   11                 1          1 
                   12                 1          1 
                   13                 1          1 
                   14                 2          2 
                   15                 2          2 
                   16                 1          1 
                   17                 1          1 
                   18                 1          1 
                   19                 1          2 
                   20                 1          1 
                   21                 1          2 
                   22                 1          1 
                   23                 1          2 
                   24                 1          1 
                   25                 1          1 
                   26                 1          2 
                   27                 2          2 
                   28                 1          1 
                   29                 1          1 
                   30                 1          1 
                   31                 1          2 
                   32                 1          1 
                   33                 2          2 
                   34                 2          1 
                   35                 1          1 
                   36                 1          2 
                   37                 1          1 
                   38                 1          1 
                   39                 2          2 
                   40                 1          1 
                   41                 1          1 
                   42                 2          1 
                   43                 1          2 
                   44                 1          1 
                   45                 1          1 
                   46                 1          1 
                   47                 1          2 
                   48                 2          2 
                   49                 1          1 
                   50                 2          1 
                   51                 2          1
                   52                 1          1
                   53                 1          1
                   54                 2          1
                   55                 2          2
                   56                 1          2
                   57                 1          1
                   58                 2          1
                   59                 2          2
                   60                 1          1
             ---------------------------------------------------
             

             As you review this table, be sure to ask your
             advisor(s) for the proper form and style for
             the construction of a table.  I prefer to place
             the title "flush left," but centered headings
             are also common.

             A new topic presented in this table is the use of
             numerical codes for pass/fail.  In this template,
             I will use the following codes for pass/fail:

             -- pass = 1

             -- fail = 2

             You will also see this numerical coding scheme
             used in this template's SPSS run file.  The exact
             code from the SPSS run file chi_squa.r01 follows:

             Value Labels
                 Pass_M    1 'Passed' 
                           2 'Failed'

               / Pass_Cpp  1 'Passed' 
                           2 'Failed'



Ho:          Null Hypothesis:  There is no difference between
             the pass/fail rate for senior high school students
             on a mathematics mastery test and their later
             pass/fail rate as freshmen college students in a
             C++ programming course (p <= .05).

             Notice how the Null Hypothesis (Ho) uses p <= .05.
             If this term is new to you, then you should know
             that the term p <= .05 is used to declare that
             there is a five percent or less probability that
             the final inference (i.e., decision that there is
             or is not a difference between the pass/fail rate
             on the mathematics mastery and the later pass/fail
             rate in the C++ programming course) is incorrect.
             That is to say, there is a five percent or less
             probability that any inference related to
             differences associated with this test will be
             incorrect.

             Most inferential analyses in the social sciences
             are conducted at p <= .05.  However, you will see
             some problems set at the more restrictive p <= .01.
             I suggest that you consult with your advisor(s) for
             guidance on the most appropriate level of
             probability to use when you conduct your own
             analyses.
             
             Along with the use of "p," you will also see the
             term "alpha" to describe the level of probability.
             Personally, I prefer to use "alpha," but "p" is so
             common that I will use this term throughout this set
             of templates.


Files:       1.  chi_squa.doc

             2.  chi_squa.dat  

             3.  chi_squa.r01  

             4.  chi_squa.o01

             5.  chi_squa.con

             6.  chi_squa.lis


Command:     At the Unix prompt (%), key:

             %spss -m < chi_squa.r01 > chi_squa.o01


************
chi_squa.dat
************
                   01                 1          1 
                   02                 2          1 
                   03                 1          1 
                   04                 1          1 
                   05                 1          2 
                   06                 1          1 
                   07                 1          2 
                   08                 2          1 
                   09                 2          1 
                   10                 1          2 
                   11                 1          1 
                   12                 1          1 
                   13                 1          1 
                   14                 2          2 
                   15                 2          2 
                   16                 1          1 
                   17                 1          1 
                   18                 1          1 
                   19                 1          2 
                   20                 1          1 
                   21                 1          2 
                   22                 1          1 
                   23                 1          2 
                   24                 1          1 
                   25                 1          1 
                   26                 1          2 
                   27                 2          2 
                   28                 1          1 
                   29                 1          1 
                   30                 1          1 
                   31                 1          2 
                   32                 1          1 
                   33                 2          2 
                   34                 2          1 
                   35                 1          1 
                   36                 1          2 
                   37                 1          1 
                   38                 1          1 
                   39                 2          2 
                   40                 1          1 
                   41                 1          1 
                   42                 2          1 
                   43                 1          2 
                   44                 1          1 
                   45                 1          1 
                   46                 1          1 
                   47                 1          2 
                   48                 2          2 
                   49                 1          1 
                   50                 2          1 
                   51                 2          1
                   52                 1          1
                   53                 1          1
                   54                 2          1
                   55                 2          2
                   56                 1          2
                   57                 1          1
                   58                 2          1
                   59                 2          2
                   60                 1          1

************
chi_squa.r01
************
SET WIDTH      = 80
SET LENGTH     = NONE
SET CASE       = UPLOW
SET HEADER     = NO
TITLE          = Chi-Square 
COMMENT        = This file examines pass/fail scores on
                 a mathematics mastery test and the 
                 potential that students who pass this 
                 mastery test have a greater chance of 
                 passing a C++ programming class than   
                 students who did not pass this mastery
                 test
DATA LIST FILE = 'chi_squa.dat' FIXED
     / Stu_Code   20-21
       Pass_M        39
       Pass_Cpp      50

Variable Lables
       Stu_Code   "Student Code"    
     / Pass_M     "Passed the Mathematics Competency Test"
     / Pass_Cpp   "Passed the C++ Programming Course"

Value Labels
       Pass_M    1 'Passed' 
                 2 'Failed'

     / Pass_Cpp  1 'Passed' 
                 2 'Failed'

CROSSTABS TABLES  = Pass_M by Pass_Cpp    
     / STATISTICS = CHISQ
************
chi_squa.o01
************
   1  SET WIDTH      = 80
   2  SET LENGTH     = NONE
   3  SET CASE       = UPLOW
   4  SET HEADER     = NO
   5  TITLE          = Chi-Square
   6  COMMENT        = This file examines pass/fail scores on
   7                   a mathematics mastery test and the
   8                   potential that students who pass this
   9                   mastery test have a greater chance of
  10                   passing a C++ programming class than
  11                   students who did not pass this mastery
  12                   test
  13  DATA LIST FILE = 'chi_squa.dat' FIXED
  14       / Stu_Code   20-21
  15         Pass_M        39
  16         Pass_Cpp      50
  17

This command will read 1 records from chi_squa.dat

Variable   Rec   Start     End         Format

STU_CODE     1      20      21         F2.0
PASS_M       1      39      39         F1.0
PASS_CPP     1      50      50         F1.0

  18  Variable Lables
  19         Stu_Code   "Student Code"
  20       / Pass_M     "Passed the Mathematics Competency Test"
  21       / Pass_Cpp   "Passed the C++ Programming Course"
  22
  23  Value Labels
  24         Pass_M    1 'Passed'
  25                   2 'Failed'
  26
  27       / Pass_Cpp  1 'Passed'
  28                   2 'Failed'
  29
  30  CROSSTABS TABLES  = Pass_M by Pass_Cpp
  31       / STATISTICS = CHISQ

Memory allows for 11,915 cells with 2 dimensions for general CROSSTABS.


PASS_M  Passed the Mathematics Competency Test
by  PASS_CPP  Passed the C++ Programming Course

                    PASS_CPP     Page 1 of 1
            Count  |
                   |Passed   Failed
                   |                    Row
                   |     1  |     2  | Total
PASS_M     --------+--------+--------+
                1  |    31  |    12  |    43
  Passed           |        |        |  71.7
                   +--------+--------+
                2  |     9  |     8  |    17
  Failed           |        |        |  28.3
                   +--------+--------+
            Column      40       20       60
             Total    66.7     33.3    100.0

      Chi-Square                  Value           DF               Significance
--------------------          -----------        ----              ------------

Pearson                          2.01094           1                  .15617
Continuity Correction            1.24145           1                  .26519
Likelihood Ratio                 1.95531           1                  .16202
Mantel-Haenszel test for         1.97743           1                  .15966
      linear association

Minimum Expected Frequency -    5.667



Number of Missing Observations:  0


************
chi_squa.con
************
Outcome:     In this example, the SPSS output file chi_squa.o01
             has a great deal of information.  However, for your
             interest in the use of the Chi-square test, you
             only need to concentrate on the part of the
             printout that shows that Pearson's Chi-square
             value = 2.01094:

             Computed Chi-square  = 2.01

             Knowing the computed Chi-square statistic, you
             should then compare it to the criterion Chi-
             square statistic:

             Criterion Chi-square = 3.84 (p <= .05, df = 1)

             The criterion Chi-square statistic of 3.84 was
             gained from the table of Chi-square statistics
             found in the appendix of nearly all leading
             statistics textbooks.  (Be sure to notice in your
             own textbook how degrees of freedom are usually
             placed as row values and probability/alpha levels
             are placed as column headers.)  Using this scheme,
             notice how the criterion Chi-square statistic
             equals 3.84 with the column header of p <= .05
             and the row value of 1 degree of freedom.  You may
             need to consult your statistics textbook for
             background information if this topic is totally
             new to you.

             After comparing the computed Chi-square statistic
             to the criterion Chi-square statistic, you will
             notice that:

             Computed Chi-square (2.01) < Criterion Chi-
             square (3.84)

             Because the computed Chi-square statistic (2.01)
             is less than the criterion Chi-square statistic
             (3.84), accept the Null Hypothesis (Ho).  That is  
             to say, there is no difference between the pass/
             fail rate for senior high school students on a
             mathematics mastery test and their later pass/fail
             rate as freshmen college students in a C++
             programming course (p <= .05).

             Another way to interpret this problem, without
             looking up criterion Chi-square statistics in
             tables, is to look at the significance value
             (i.e., probability of significant difference)
             presented as part of the SPSS output file:


             Chi-Square     Value        DF   Significance
             ----------     -----------  --   ------------

             Pearson        2.01094      1      .15617


             By looking at this section of the printout, you
             can see that the probability of significance is
             .15617 (approximately 16 percent), which exceeds
             the previously declared value of p <= .05 (i.e.,
             probability of significant difference, which was
             5 percent).  In this case, 16 percent probability
             of significant difference (the observed probability
             of significance) exceeds 5 percent or less
             probability of significance (the declared
             probability of significance), so you have another
             measure that confirms that there is no difference
             (at p <= .05) in the pass/fail rate for the
             mathematics mastery test and the later pass/fail
             rate in a C++ programming course.

Conclusion:  Dr. Dunbar, a college-level computer science
             teacher, had 60 freshmen students in her C++
             programming course who sat for a mathematics
             mastery test during the senior year in high
             school.  Based on the set of data in this one
             problem, Dr. Dunbar can expect that the pass/fail
             rate of these students in her C++ programming
             class will approximate the pass/fail rate these
             same students achieved on the mathematics mastery
             test.

             That is to say, approximately 72 percent of all
             students passed the mathematics mastery test and
             approximately 67 percent of all students passed
             the C++ programming course.  With p <= .05,
             there is no difference in these two passing rates.

Review:      As you review this template, be sure to give
             attention to the following new concepts:

             -- Numerical codes were used to identify
                pass/fail (1 = pass and 2 = fail).

             -- Data are often organized in 2 by 2, 3 by
                4, etc. tables.  As data are organized
                into these tables, it is standard to state
                these organizational schemes as rows by
                columns.

                Further, a 2 by 3 table, with two rows and three
                columns, has six cells (2*3 = 6).

                You may sometimes see the term 2 x 3 instead
                of 2 by 3 when referencing these tables.  I
                suggest that you avoid "2 x 3" and instead
                write "2 by 3" to avoid the confusion that may
                come about as some readers incorrectly see the
                "x" character as the name for a new variable.

             -- The computed Chi-square statistic (2.01),
                provided in the SPSS output file, was compared
                to a criterion Chi-square statistic (3.84).

             -- Consult your textbook for the table(s) that
                identifies these criterion statistics.  Be
                sure to notice that the criterion value is
                dependent on the declared p level (probability
                of significance, or alpha) level.  Most tests
                in the social sciences are conducted at the 5
                percent level of significance (p <= .05).  You
                may occasionally see p <= .10 (10 percent level
                of significance) and p <= .01 (1 percent level
                of significance) in the literature.

             -- The SPSS printout in the output file also
                includes the calculated significance value
                (.15617).  By using this statistic, you may
                not need to review the table value of the
                criterion Chi-square statistic.

             -- When considering significance, an important
                point to mention here is that there is no
                basis to the incorrect statement "almost
                significant."  The p level (whether .10,
                .05, or .01) is determined in advance and
                then included in the Null Hypothesis.
                Whichever test you use, differences are
                either significant at the declared p value
                or they are not significant.  This is a
                discrete activity (i.e., The light bulb is
                either on or it is off.  The light bulb is
                not "almost" on or "almost" off.)

             The MINITAB addendum for this analysis follows,
             as the file chi_squa.lis.  This file represents
             a MINITAB-based analysis of this problem in
             interactive mode, as opposed to the previous
             analysis of this problem with SPSS in batch
             mode.


************
chi_squa.lis
************
% minitab
 
 MTB > outfile 'chi_squa.lis'
 Collecting Minitab session in file: chi_squa.lis
 MTB > # MINITAB addendum to chi_squa.dat
 MTB > read 'chi_squa.dat' c1 c2 c3
 Entering data from file: chi_squa.dat
      60 rows read.
 MTB > name c1 'Stu_Code' c2 'Pass_M' c3 'Pass_Cpp'
 MTB > print c1
 
 
 Stu_Code
     1     2     3     4     5     6     7     8     9    10    11    12    13 
    14    15    16    17    18    19    20    21    22    23    24    25    26 
    27    28    29    30    31    32    33    34    35    36    37    38    39 
    40    41    42    43    44    45    46    47    48    49    50    51    52 
    53    54    55    56    57    58    59    60 
 
 MTB > print c2
 
 
 Pass_M  
    1    2    1    1    1    1    1    2    2    1    1    1    1    2    2 
    1    1    1    1    1    1    1    1    1    1    1    2    1    1    1 
    1    1    2    2    1    1    1    1    2    1    1    2    1    1    1 
    1    1    2    1    2    2    1    1    2    2    1    1    2    2    1 
 
 MTB > print c3
 
 
 Pass_Cpp  
    1    1    1    1    2    1    2    1    1    2    1    1    1    2    2 
    1    1    1    2    1    2    1    2    1    1    2    2    1    1    1 
    2    1    2    1    1    2    1    1    2    1    1    1    2    1    1 
    1    2    2    1    1    1    1    1    1    2    2    1    1    2    1 
 
 MTB > table 'Pass_M' by 'Pass_Cpp';
 SUBC> chisquare 2.
   
  
  
  
  ROWS: Pass_M     COLUMNS: Pass_Cpp
  
            1        2      ALL
   
   1       31       12       43
        28.67    14.33    43.00
   
   2        9        8       17
        11.33     5.67    17.00
   
  ALL      40       20       60
        40.00    20.00    60.00
  
 CHI-SQUARE =     2.011   WITH D.F. =    1
   CELL CONTENTS --
                   COUNT
                   EXP FREQ
 
 MTB > stop

--------------------------
Disclaimer:  All care was used to prepare the information in this 
tutorial.  Even so, the author does not and cannot guarantee the 
accuracy of this information.  The author disclaims any and all 
injury that may come about from the use of this tutorial.  As 
always, students and all others should check with their advisor(s) 
and/or other appropriate professionals for any and all assistance 
on research design, analysis, selected levels of significance, and 
interpretation of output file(s).

The author is entitled to exclusive distribution of this tutorial. 
Readers have permission to print this tutorial for individual use, 
provided that the copyright statement appears and that there is no 
redistribution of this tutorial without permission.

Prepared 980316
Revised  980914
end-of-file 'chi_squa.ssi'
Please send comments or suggestions to Dr. Thomas W. MacFarland

There have been visitors to this page since February 1, 1999.