Correlation Analysis

 CORRTEST.83p CORRTEST.86p corrtest.89p

Suppose we have two populations X and Y, at least one of which is normally distributed, and that we have a random sample of paired data from the populations: {(x1, y1), (x2, y2), . . ., (xn, yn)}. We wish to test the null hypothesis that correlation r between the populations is equal to 0. To perform this test, we study the sample correlation coefficient R.

It is always the case that -1 <= R <= 1. If R is near 1, then there is a strong positive linear correlation in the sense that as the values of one set of data increase, then so do the values of the other set. If R is near -1, then as one set of values increases, the other set of values decreases.

If the populations are independent, then the true correlation r between populations will be 0. So if the sample correlation R is not sufficiently close to 0, then we not only reject that r = 0, but we can also reject a claim that the populations are independent.

To measure the closeness of R to 0, we use the test statistic

x = R * Sqrt[n - 2] / Sqrt[1 - R^2],

which follows a t-distribution with n - 2 degrees of freedom: T ~ t(n - 2). If a is our level of significance, then we reject the claim that r = 0 if the right or left tail probabilities are too small. That is, we reject if P(T <= x) < a / 2 or if P(T >= x) < a / 2. For this two-sided test, the p-value will be given by twice the smallest tail value.

For the one-sided alternative hypothesis Ha: r > 0, the p-value is given by the right tail value. For the one-sided alternative hypothesis Ha: r < 0, the p-value is given by the left tail value.

In addition to this correlation analysis, we also can plot the data points along with the least squares regression line through the data. This line, having equation y = ax + b, provides an approximate linear functional relationship between the values of xi and yi.

Using the CORRTEST Program

The CORRTEST program can be used to perform these hypothesis tests on either statistics R and n, or on data in lists L1 and L2 in the STAT Edit menu on the TI-83 (on the TI-86, enter the data into lists xStat and yStat in the LIST EDIT menu. On the TI-89, enter the data in columns c1 and c2 in a list called dist in the Data/Matrix Editor; this list is the current list after running many of the programs on this site).

The program displays the result of the test (based on an entered level of significance), the test statistic, the right and left tail probabilities, and the sample correlation.

Example.Assume that the tar level among all cigarette brands is normally distributed. We wish to test whether or not correlation between the nicotine level and the tar level equals 0. The data below, produced by the Federal Trade Commision, lists the levels of tar and nicotene (in milligrams) for a sample of various brands.

At the 0.05 level of significance, test the hypothesis that the correlation equals 0. Find the equation of the least squares line and use it to estimate a nicotene level for a tar level of 15.

 Brand Tar Nicotine Alpine 14.1 0.86 Benson & Hedges 16.0 1.06 Bull Durham 29.8 2.03 Camel Lights 8.0 0.67 Carlton 4.1 0.4 Chesterfield 15.0 1.04 Golden Lights 8.8 0.76 Kent 12.4 0.95 Kool 16.6 1.12 L&M 14.9 1.02 Lark Lights 13.7 1.01 Marlboro 15.1 0.9 Merit 7.8 0.57 MultiFilter 11.4 0.78 Newport Lights 9.0 0.74 Now 1.0 0.13 Old Gold 17.0 1.26 Pall Mall Light 12.8 1.08 Raleigh 15.8 0.96 Salem Ultra 4.5 0.42 Tareyton 14.5 1.01 True 7.3 0.61 Virginia Slims 15.2 1.02 Winston Light 12.0 0.82

Solution. After entering the data into the appropriate lists, call up the CORRTEST program and use the option for DATA. (On the TI-83, enter 1. On the TI-86, press F1 for the tool bar item. On the TI-89, press 1 to access the drop down menu, then press 1.) Then for LEVEL OF SIG., enter 0.05.

We receive a a right tail value of 0 from a test statistic of 21.268, we reject the claim that r = 0. The sample correlation is given as 0.976534.

Since the smallest tail value is 0, we see that there is a 0% chance of R being as large as 0.976534 with a sample of this size if the true correlation were equal to 0. Thus we reject that r = 0 and also reject that tar levels and nicotine levels are independent. We conclude that r > 0 and that the levels of tar and nicotine are dependent.

We see that there is a strong correlation with R near 1. We may conclude further that the level of nicotine generally increases as the tar level increases. We can see this relationship by graphing a scatterplot of the data along with the regression line. To do so, simply press GRAPH. (The settings have already been adjusted in the program.)

This scatterplot is a graph of nictone on the y-axis (from list L2, yStat, or c2) vs. tar on the x-axis (from list L1, xStat, or c1). The line through the data is the least squares regression line which has been stored in Y1. To estimate the nicotine level for a tar level of 15, enter Y1(15) (or y1(15) on the TI-86 and 89). We obtain an estimated nicotene level of 1.04532.

Here's how obtain the equation of the regression line separately when data is entered into lists:

On the TI-83: Press STAT, press the right arrow cursor to CALC, scroll down to LinReg(ax+b) and press ENTER. Then enter the command LinReg(ax+b) L1,L2.

On the TI-86: Press STAT, press F1 for CALC, then F3 for LinR. Then press LIST, then F3 for NAMES, then access xStat and yStat to enter the command LinR xStat,yStat.

On the TI-89: Press MATH, then 6 for Statistics, then 3 for Regressions, then 1 for LinReg. Enter the command LinReg c1,c2. Then press MATH, then 6 for Statistics, then 8 for ShowStat, and press ENTER.

We obtain the following results (with the values of a and b reversed on the TI-86).

LinReg
y = ax + b
a = .0611991508
b = .1273371681
r = .9765340135

Hence, y ~ 0.061199 x + 0.127337. So again, for a tar level of x = 15, the nicotine level should be around 0.061199 *15 + 0.127337 = 1.045332.

Exercises

1. The data below gives the high school GPA and the verbal SAT score of a random sample of students:

High School GPA vs. Verbal SAT Score
 2.6 460 3.7 500 3.2 450 3.6 480 2.5 510 3.7 490 3.1 450 3.25 560 3 500 2.5 510 2.6 410 2.7 450 3.4 510 3.4 540 3.5 520 3.5 500 3.5 330 3.3 570 3.1 400 2.7 480 3.2 620 3.1 550 2.6 470 3.5 450 3.4 420 3.5 530 3.2 460 3 560 3.5 540 3.7 600 3.8 500 3.8 480 3.1 420 3.2 520 3.7 550 2.7 410

(a) With level of significance 0.10, test the hypothesis that there is no correlation between verbal SAT score and high school GPA. (b) With the same level of significance, what conclusion do you draw with regard to a one-sided alternative Ha: r > 0.

2. A study was performed on 40 subjects comparing height and IQ. The sample correlation was found to be R = -0.0346. At the 0.03 level of significance, test the hypothesis that there is no correlation between height and IQ.

3. If a person has high body density, then he should have less body fat. The following data lists measurements of body densities and percentages of body fat from a random sample of men aged 20 - 29.

Body Density vs. Body Fat
 1.0708 12.6% 1.0853 6.9% 1.0414 24.6% 1.0751 10.9% 1.034 27.8% 1.0502 20.6% 1.0549 19% 1.0704 12% 1.09 5.1% 1.0722 12% 1.083 7.5% 1.0812 8.5% 1.0622 16.1% 1.0551 19% 1.064 15.3% 1.0668 14.2% 1.0911 4.6% 1.091 4.7% 1.079 9.4% 1.0862 6.5% 1.0719 13.4% 1.0775 9.9% 1.0754 10.8% 1.0664 14.4% 1.055 19% 1.0322 28.6% 1.0873 6.1% 1.0416 24.5% 1.0776 9.9% 1.0542 19.1%

Find the sample correlation, the least squares line, and estimate the body fat for a body density of 1.045.

Solutions

1. (a) Executing the CORRTEST program as in the example above with the GPA's in list L1 (xStat or c1) and the SAT scores in list L2 (yStat or c2), we obtain a right tail value of 0.0929 and a sample correlation of 0.2256. With a two-sided test, we do not reject that r = 0 at the 0.10 level of significance since the p-value is 2*.0929 = 0.1858.

(b) For the one-sided alternative Ha: r > 0, the p-value is 0.0929 (from the right tail value). If r = 0, then there would be only a 9.29% chance of R being as large as 0.2256 with a sample of this size. Since this (one-sided) p-value is lower than the desired level of significance 0.10, we now can reject r = 0 in favor of the alternative r > 0.

2. Bring up the CORRTEST program and use the option for STATS. Then enter 40 for SAMPLE SIZE, enter -.0346 for SAMPLE CORR., and enter .03 for LEVEL OF SIG.

We obtain a left tail value of 0.4161 from a test statistic of -0.2134165. Therefore we do not have significant evidence to reject the hypothesis that r = 0. If r were equal to 0, then there would still be a 41.61% chance of obtaining a sample correlation as low as R = -0.0346 with a sample of size 40.

3. We enter the densities under list L1 (xStat or c1) and the percentages of body fat under list L2 (yStat or c2). Then we compute the linear regression line as explained at the end of the original example above. We obtain an almost perfect negative correlation with a value of -0.9989998745.

The negative correlation means that as body density increases, then body fat decreases.

Given a density x, the percentage of body fat is approximated by the least squares line y = ax + b = - 403.4617559 x + 444.7041681. For x = 1.045, the body fat is estimated to be 23.086%.