Links : Home Index (Subjects) Index (Categories) Contact StatTools 
Related Links:
This page introduces 3 commonly used parametric correlation / regression procedures.
Two nonparametric procedures, Spearman's Correlation is calculated using the Chi Square for Large Contingency Tables and Spearman Correlation Coefficient Program Page , and Regression using proportions are calculated in the Compare Two Regression Lines (Covariance Analysis) Program Page . As both are discussed in Unpaired Proportions Explained Page , they will not be further covered in this page
Regression defines the relationship between measurements hierarchically.
There is an assumption of order, that the variable x comes first, and is
considered independent, while the variable y comes second, and is dependent upon x.
The model is therefore either knowing x will allow a prediction of y, or by
changing x, y is also changed.
Statistically, only the y variable needs to be parametric (continuous and normally distributed). The x variable must be at least ordered (3>2>1), so it can be binary, ordinal, interval, or ratio measurements. The results of the analysis produces the formula y = a + bx, where a is the constant, and b the regression coefficient.
Example : Interpretations are best demonstrated in the results of analysis using the example data in the Correlation and Regression Program Page . The data related how birth weight (y in grams) is dependent on gestation (x in weeks). The data point, the regression line, and the 95% confidence interval around the regression line, are shown in the diagram to the right.
Sample Size : Theoretically, regression differs from correlation in that only the y variable assumes a normal distribution, while the x variable only needs to be ordered. Sample size calculation for regression should therefore follow the Analysis of Variance model, based on the F distribution. However, the sample sizes so calculated are in most cases similar to that for correlation, and as correlation and regression are often calculated at the same time, the sample size calculated for correlation is used also for regression.
Pearson's Correlation Coefficient (ρ) is a measure of how two normally distributed measurements are related, as shown in the diagram on the left.
When the relationship is precise and in the same order (green dots), ρ=1. When the relationship is precise but in reverse order (red dots), ρ=1. When there is no relationship at all (blue dots), ρ=0. The traditional results from analysis are the coefficient (ρ), and it's Standard Error, assuming the coefficient to be a normally distributed variable. The statistical significance (probability of not null (0)) can then be evaluated using the t test. However, it is increasingly accepted that correlation coefficient is not truly normally distributed, as it cannot have a value outside of ±1, the variance on the extreme side is therefore always narrower than the variance facing the zero value. Correlation coefficient is therefore only truly parametric when it has a value of 0, and the error increases as its value becomes nearer to ±1, as shown in the diagram to the right. Increasing therefore, the Fisher's Z transformation of the coefficient is used, so that Z is normally distributed. After the Standard Error and confidence interval of Z is calculated, the data is retransformed to the original ρ unit. The algorithm for Fisher's Z transformation is as follows.
The algorithm is best demonstrated in the results of analysis from the example data from the Correlation and Regression Program Page , which is a correlation between gestation (in weeks) and birth weight (in grams)
Comparing two regression lines is the simplest model of covariance analysis.
It uses the independent variable x as covariate and dependent variable y as outcome in a 2 group analysis of covariance.
Two procedures are carried out.
Firstly, it computes the two regression lines y_{1} = a_{1} + b_{1}x and y_{2} = a_{2} + b_{2}x, then compare the two regression coefficients to see if they are significantly different. This is equivalent to evaluating whether interaction between groups and covariates exists. Secondly, it assumes that the two regression coefficients are not significantly different (that there is no significant interaction), calculates a common regression slope for the whole set of data, then compared the mean dependent y values (adjusted for the common regression slope). This second part is the same as covariance analysis, where the dependent variable is y and the covariate is x. Collectively, the first procedure, comparing the two regression coefficients, is equivalent to evaluate the presence of interaction between the covariates and the groups, and the second procedure, comparing the adjusted means of the two groups, is a simple analysis of covariance for two groups with one covariate, assuming no interaction. Example
The program in the Compare Two Regression Lines (Covariance Analysis) Program Page is best understood by following the default example in the program. We wish to compare the birthweight of boys and girls, but we need to take into consideration the gestational age at birth. The data is therefore as follows. Sex is the sex of the baby (G for girls and B for boys). Gest is the gestational age (in weeks) at birth, and BWt is the birthweight in grams. The data are shown in table to the left.
The initial results are shown in the table to the right. There were 11 boys and 11 girls in the study. Means and standard deviations for gestation (x) are 38 and 1.8 weeks for boys and 38 and 2.0 weeks for girls respectively. Means and standard deviations for birthweight (y) are 3268 and 351.4 g for boys and 3119 and 380.3 for girls respectively. Correlation coefficients are high for both sexes, and the regressions are BWt(g) = 3772 + 185.3 Gest (weeks) for boys and BWt(g) = 3998 + 186.9 Gest (weeks) for girls. The two slopes (b) are then comparedThe results show that the two slopes are not significantly different, so that assuming a common slope is valid. In other words, there is no significant interaction between sex and gestation. Assuming a common slope, the adjusted means are as follows. Using a common slope as a correction, the difference between the adjusted means is 165g, and this is statistically significant (p<0.05).
We can therefore draw the following conclusions from this study.
These results are best illustrated in the diagram to the left. Please note : These conclusions are only correct if the growth of babies near term is linear (in a straight line at 185g per week). Looking at the data carefully, this appears not to be true, as growth at earlier gestation appear faster and growth rates flatten nearer to 40 weeks. Users should therefore be aware that, regardless of how elegant the results seem to be, the validity of the conclusions ultimately depends on the validity of the model's assumptions. Please also note : data in this example are computer generated, based on a published growth model. The Data are constructed to demonstrate how the program works, and not meant to represent actual growth physiology. A note on graphic plotting The program is accompanied by a plot, with the following default settings.
Comparing two regression coefficients using summary dataCompare Two Summary Regression Coefficients Program Pageis a similar program, but calculations are carried out using summary data. The purpose of this program is to enable covariance analysis, compating the two regression coefficients and adjusted means, using results published by others, without the need to use the raw data itself. The input is a two column table
Correlation and regression
Significance and 95% confidence interval of correlation coefficient t test : Armitage P. Statistical Methods in Medical Research (1971). Blackwell Scientific Publications. Oxford. P.156163. Confidence Interval : Altman DG, Machin D, Bryant TN and Gardner MJ. (2000) Statistics with Confidence Second Edition. BMJ Books ISBN 0 7279 1375 1. p. 8992 http://www2.sas.com/proceedings/sugi31/17031.pdfSAS paper discussing the need for Fisher's Transformation Spearman's Correlation Coefficient
Coompare Two Regression Lines
