Contents_3. Comparing Two Groups

Please note : the data presented in all course material for the statistical module are
generated by computers to demonstrate the methodologies, and should not be confused with
actual clinical information
Introduction Compare Two Measurements Compare Two Proportions Compare Two Regressions
Comparing data from two groups is the most basic research model in epidemiology and controlled trials, and produces the effects and their Standard Errors that are used in meta-analysis to develop Evidence Based Practice.
This page supports the multiple programs that compare two groups in StatPgm 3a. Compare Two Measurements, StatPgm 3b. Compare Two Proportions, and StatPgm 3c. Compare Two Regressions.
Introduction
Sample Size
Parametric Data
Nonparametric Data
This panel supports the programs in StatPgm 3a. Compare Two Measurements, which compares measurements in two groups of data
Introduction
Sample size estimation is carried out in the planning stage of research. Discussion in this panel will use the default example data provided in StatPgm 3a. Compare Two Measurements.
We wish to examine the effects of two diets given to the mother during pregnancy on the birth weight of the babies. The subjects for the study are randomly allocated into two groups. - Group 1 : pregnant women provided with a high protein diet
- Group 2 : pregnant women that are controls, not provided with any specific instructions about diets
- From past records, we know that the Standard Deviation of babies at birth is 400g
- We will conclude that high protein affects birth weight (either way), if the difference of birth weight between the two groups is 150g or more.
- The probability of Type I Error (α) to be used to determine statistical significance is the module default value of p=0.05, or the 95% confidence interval of the difference
- The power used is also the module default value of 0.8
- The known Standard Deviation is 400
- The difference to be detected or tested for is 150
- If the question is "Are babies of mothers given protein supplement heavier than those from mothers not giving protein supplement?", then we are only interested in one direction of the results, as we are not interested in whether giving protein supplement reduces birth weight. In this case the one tail model is appropriate, and we will need 89 cases for each of the two groups, a total of 178 cases.
- If the question is "Is there any difference in birth weight between those who received and not received protein supplements?", then we are interested in a difference in both directions, whether those receiving supplement results in bigger or smaller babies. In this case, we will need 113 cases per group, a total of 226 cases.
Although sample size can be calculated to great precision, the parameters required cannot be accurately estimated. The main problem is the background or population Standard Deviation. This value is very difficult to obtain, as nearly all values obtained from publications or previous observations are from samples, and therefore estimates with variations. There is no assurance that the data collected will have the same Standard Deviation, so there is no assurance that the sample size is correct. If the assumption that sample size estimation is at best an approximation is accepted, then short cut approximations can be used, bypassing calculations altogether. The following approximate sample size algorithm is therefore often used, based on the default α=0.05 and power=0.8. - In a two group comparison, where the difference to be detected is very large, or in other words that the researchers is only looking for the clinically obvious, then a difference between the mean that is same as or bigger than the Standard Deviation (Difference/SD>=1.0) can be assumed, and the sample size per group required is 14 for the one tail model and 17 for the two tail model
- In a two group comparison, where the difference to be detected is moderate, clinically meaningful but not obvious, the most common situation in clinical research, then a difference between the mean that is half that of the Standard Deviation (Difference/SD=0.5) can be assumed, and the sample size per group required is 51 for the one tail model and 64 for the two tail model
- In a two group comparison, where the difference to be detected is small, as in some physiological or forensic projects, then a difference between the mean that is a fifth of the Standard Deviation (Difference/SD=0.2) can be assumed, and the sample size per group required is 310 for the one tail model and 394 for the two tail model
In clinical research, data loss is common. Cases depart from the project before data collection is completed, errors in recruitment and group allocation, errors in data recording, and so on. A data loss of 5% to 30% is not uncommon. The actual sample size planned must therefore be adjusted upwards to compensate for anticipated data loss. Introduction
Comparing the difference between the mean of two groups are made under the following assumptions - That the measurement used is parametric, continuous measurements that are Normally distributed
- Other than the parameter that separates them (difference in treatment or classification), the cases in the two groups are similar, with similar variance (distribution around the mean)
Analysis of Results
Discussions in this panel uses the default example data from StatPgm 3a. Compare Two Measurements, a comparison of maternal height in two groups of pregnant mothers, 24 who required delivery by Caesarean Section, and 25 who delivered normally.
- In Group 1, Caesarean Section, sample size n1=24, mean height mean1=153.9cms, and Standard Deviation SD1=3.1cms.
- In Group 2, Normal Delivery, sample size n2=25, mean height mean2=157.1cms, and Standard Deviation SD2=2.8cms.
- The difference (mean 1-mean 2) = -3.2, those requiring Caesarean Section were 3.2cms shorter than those delivered normally
- The Standard Error of the difference = 0.8cms
- degrees of freedom = n1 + n2 -2 = 24 + 25 - 2 = 47
**Statistical significance using the t test**, assuming variations in the two groups are similar- Student's t = difference / Standard Error = -3.2 / 0.8 = -3.8 (Please note rounding error as computer uses 4 decimal point precision)
- Probability of t=-3.8 and df=47 is 0.0002 for the one tail model, and 0.0004 for the two tail model, both statistically significant
**Statistical significance using 95% confidence interval of the difference**- For 95% confidence interval, 5% (0.05) is excluded
- Student's t for p=0.05 and degrees of freedom=47 is t=1.68 for the one tail model and t=2.01 for the two tail model
- The 95% confidence interval is therefore
- For the 1 tail model, -∞ to mean+t*SE = -∞ to -3.2 + 1.68*0.8 = -∞ to -4.6
- For the 2 tail model, mean±t*(SE = -3.2±2.01*0.984 = --4.8 to -1.5
- The 95% confidence interval for both models did not overlap the null value (0), so the difference is statistically significant
- We can draw the conclusion that those requiring Caesarean Section were significantly shorter than those delivered normally
Data Plot
The data are plotted as in the diagram to the right. Each circle represents a data point. the vertical lines the 95% confidence intervals of the measurements, the horizontal lines the means, and the boxes the 95% confidence interval of the means. Introduction
Nonparametric data are those where normal distribution cannot be assumed, so that the powerful and convenient procedures such as the difference between the means cannot be used. In many situations, the data can be transformed into another set of measurements which is normally distributed, and parametric calculations can proceed. How these can be done are discussed in Contents_1. Probability There are however data where no amount of transformation or mathematical manipulation of the numbers will result in normal distribution. These are usually close ended ordinal measurements with narrow range, with measurements that are stepwise, and where the intervals are undefined. Examples of these are pain measurements (0=none, 1=some, 2=severe), or Likert Scales (1=strongly disagree, 2=disagree, 3=neutral, 4=agree, and 5=strongly agree). There are numerous nonparametric statistical procedures, catering to a wide range of research models. This panel however will discuss only the most commonly used procedure, the Robust Rank Ordered Test, provided in StatPgm 3a. Compare Two Measurements. The Robust Rank Ordered Test used to be called the Mann-Whitney U Test, named after the statisticians who developed it. In recent year, the Mann Whitney U Test has evolved into a much more complex test, and the original Mann-Whitney U Test is now generally referred to as the Robust Rank Ordered Test. Both names refer to the same test for this module, but students should be aware that, in recent text books, on the Internet, and on some statistical packages, some confusions exists as to which test the Mann Whitney U Test refers to. Discussion in this panel will use the default example data from that program. This is from a clinical trial, where low risk women in labour were randomly allocated to be cared for by midwives (Group 2) or doctors (Group 1). The data is collected from women after birth, the Likert Scale response to the statement "I am satisfied with the standard of care I received during labour". The measurements are 1=Strongly Disagree, 2=Disagree, 3=neutral, 4=agree, 5=Strongly agree
The program firstly converts the data into ranks, and produce a table of counts as shown to the right, where the number of cases in each group for each rank are calculated. The program then calculated Mann-Whitney's U. A negative value means that group 1 cases are ranked lower than group 2, and a positive value means group 1 cases are ranked higher than group 2. As U is negative in this study, the interpretation is that women in the group looked after by doctors were less satisfied than those looked after by midwives. The program then calculates the Probability of Type I Error (α or p), and found the difference between the two groups not statistically significant (n.s.). U depends on the difference in ranks between the groups (group 1 - group 2). In this example, U=-1.77, indicating that group 1 (Drs.) ranks lower than group 2 (Mws). In other words, this group of women ranks doctors lower than midwives in terms of satisfaction over the quality of care, but the difference is not statistically significant. The conclusion is that there is no significant difference in satisfaction between those looked after by doctors and midwives.
## Sample size for Mann Whitney U TestThere is no precise method of estimating sample size requirement for nonparametric tests, because the distribution of the data is unstated.However, nonparametric procedures have approximately 95% of the power of parametric procedures, so for similar types of research, the same sample size, plus additional 10% of cases, can be used. Using approximate sample size as discussed in parametric comparisons, the following rule of thumb can be used. - In a two group comparison, where the difference to be detected is very large, or in other words that the researchers is only looking for the clinically obvious, then 15 cases per group for the one tail model and 18 for the two tail model are required
- In a two group comparison, where the difference to be detected is moderate, clinically meaningful but not obvious, the most common situation in clinical research, then 56 per group for the one tail model and 70 for the two tail model are required
- In a two group comparison, where the difference to be detected is small, as in some physiological or forensic projects, then 340 per group for the one tail model and 440 for the two tail model are required
Introduction
Chi Square and Fisher's Exact Probability
Risk Difference
Risk Ratio
Odds Ratio
Discussions in this panel supports programs that compare two proportions in StatPgm 3b. Compare Two Proportions.
Binary data indicate positive or negative in an attribute of interest. These are no/yes, positive/negative, false/true, male/female, or anything else that can be one or the other. Numerically, each case can be either 0 for negative, or 1 for positive. The summary of binary data are numbers positive (NPos) or numbers negative (NNeg) in a data set
A proportion is the statistical effect of binary data. The following terms are often used to represent proportions **Proportion**is the mathematical representation, a ratio of positives to the numbers in the data. Proportion = NPos / (NPos + NNeg), a number between 0 for no positives to 1 for all positives.**Percent**is an alternative to proportion that is intuitively easier to understand by non-mathematicians. Percent = proportion * 100.**Probability**and proportion have the same calculation, but conveys a different concept, that of how likely the positive case will be encountered**Risk**is the same as probability except it is used in clinical and insurance domain, denoting how likely the event of interest is to be encountered- Proportion, probability, and risk are usually expressed as a number between 0 and 1, so that 0.25 and 25% are different expressions of the same numerical value.
For example, if a hospital delivers 1000 babies in a month, and 90 of those babies died during or after birth, then the proportion, risk, and probability of perinatal death is 90 / 1000 = 0.09 or 9%.
The odd is another statistical effect of binary data, the ratio of positives to the negatives in the data. Odd = NPos / NNeg, and can be any number greater than 0. In our perinatal death example, the odd of perinatal death is 90 / 910 = 0.099 Odd originated in gambling, as it defines the ratio of winning against losing. The relative simple calculation allows flexibility in calculations, and the odd is extensively used in multivariate calculation involving binary data, the logistic regression. The comparison of odds in two groups is the Odds Ratio. This calculation differs from risk ratio of risk difference, because it makes no presumptions of the group variable preceding and influencing the outcome variable. The odds ratio is therefore a preferred method for retrospective matched paired studies, where the cases are grouped according to outcome, and the frequencies of causal attribute compared.
- Risk = NPos / (NPos+NNeg)
- Odd = NPos / NNeg
- Odd = Risk / (1-Risk)
- Risk = Odd / (1+Odd)
## 95% confidence interval of proportions- Proportion is the summary of individual positive and negative cases in the data, represented as a number between 0 and 1 (0.25=25%), and treated as a mean
- The Standard Error of a proportion (p) with a sample size (n) is SE = sqrt(p(1-p)/n)
- It is assumed to be population based, and normally distributed
- In calculations of confidence interval, the population z is used instead of t
- For 1 tail model, 95% confidence interval = < proportion+1.65SE or >proportion - 1.65SE
- For 2 tail model, 95% confidence interval = proportion±1.96SE
- The same applies for odd, risk difference, Log(risk ratio), and Log(odds ratio)
All tests comparing two proportions uses 4 numbers, NPos1 and NPos2 are the number of cases with positive attributes in groups 1 and 2, NNeg1 and NNeg2 are number of cases with negative attributes in groups 1 and 2 This module presents 5 tests that compares two proportions - Two older tests based on mathematical theories
- The Chi Square Test for 2x2 contingency tables
- Fisher's Exact Probability Test
- Three tests orientated more towards needs of clinical studies
- Risk Difference between two groups
- Risk Ratio between two groups
- Odds Ratio between two groups
Sample size for Odds Ratio, when used in retrospective matched pair controlled studies, requires a different approach to sample size estimation. As this is an advanced subject it will not be covered in the statistics module. Otherwise, sample size calculations are valid for all other comparisons of two proportions. Discussions in this panel will use the default example in the sample size procedure StatPgm 3b i. This is for a controlled trial comparing the two groups of women in labour, where one group (group 1, treatment group) received oxytocin in the third stage, and the other group (group 2, control group) did not received oxytocin. The outcome to compare are the proportion of women that developed post-partum haemorrhage in the two groups From past experience, we expect that 6% (0.06) of women without any medication in the third stage would suffer a post-partum haemorrhage. In this trial, we would be happy to accept that oxytocics are useful if it can half the post-partum haemorrhage rate to 3% (0.03). The two proportions being compared are therefore 0.06 and 0.03. We use the default values of probability of Type I Error α=0.05, and the power of 0.8 in this estimation. The results are that we need 590 cases per group for the 1 tail model, and 749 cases per group for the two tail model. As we are only interested in whether giving oxytocics reduce PPH, and not concerned with whether it increases PPH, the 1 tail model is used. The sample size is therefore 590 cases in each group, a total of 1180 cases.
Introduction
Both Chi Square and Fisher's Exact Probability Tests are based on an estimate of probability of having the numbers in the 4 cells, if the two groups have the same proportions. Fisher's Exact Probability Test is more powerful, in that it is more likely to reject the null hypothesis than the Chi Square Test. However, Fisher's Exact Probability is based on the Poisson distribution (not covered in this module), which requires extensive iterations (repetitions) in calculation, more so when the sample size is large. Even the powerful modern computer has difficulty coping with Fisher's Exact Probability if the total sample size exceeds 500, and the program will either crash or produces unpredictable errors in the results. Using the default example data from StatPgm 3b. Compare Two Proportions, the chi square Test results in the Probability of Type I Error (α) p=0.053, marginally not statistically significant, while Fisher's Exact Probability Test results in the Probability of Type I Error (α) p=0.044, statistically significant. Introduction
Risk Difference is used mostly in controlled trials, cases are randomly allocated to two groups to receive different treatments, and the difference between the two risks, its Standard Error, and its 95% confidence interval calculated. An addition to risk difference is the Using the default example data from StatPgm 3b. Compare Two Proportions, the results of calculation for risk difference are as follows - Risk
_{Grp 1}= 0.1667 Risk_{Grp 2}= 0.075 - Risk Difference = 0.0917, Standard Error (SE) = 0.051
- 95% Confidence Interval of Difference (1 tail) = >0.0077
- 95% Confidence Interval of Difference (2 tail) = -0.0083 to 0.1917
- Number Needed to Treat (NNT) = 11
We conduct a controlled trial on using oxytocin in the management of the third stage of labour, where women in labour were randomly allocated to receive oxytocic (group 1) or nothing (group 2) at the end of second stage, and the risk of post-partum haemorrhage (PPH) compared. The results are as follows - Group 1 (Oxytocic group) : Total = 94, PPH = 2, Risk r1 = 2/94 = 0.021 (2.1%)
- Group 2 (control) : Total 104, PPH = 15, Risk r2 = 15/104 = 0.144 (14.4%)
- Risk Difference rd = 0.021 - 0.144 = -0.123, and the Standard Error = 0.038.
- The 95% confidence interval of the difference (2 tail) is -0.123±1.96*0.038 = -0.197 to -0.049.
- The Number Needed to Treat NNT = 1/0.123 = 8.13, rounded up to 9.
The 95% confidence interval of the risk difference did not overlap the null (0) value, so the difference is statistically significant We can therefore conclude that those receiving oxytocic had 4.9% to 19.7% less PPH, comparing with those receiving no oxytocic. Also, 9 women will need to receive oxytocic to reduce a single case of PPH.. Introduction
Risk Ratio (rr) was initially developed by epidemiologists to examine causal relationships between binary variables, such as whether hypertension causes early death and obesity causes diabetes. The advantages of risk ratio over risk difference is that, being a ratio, it is not constrained by the assumption of normal distribution. This is particularly convenient when group sizes are not similar, when the risk levels are close to 0 or 1, or when the risks in the two groups are widely different. Increasingly therefore risk ratios are used also to analyse data from controlled trials. It is important to realize that ratios has a log-normal distribution. This means that the log value of the ratio is normally distributed. The effect is therefore log(risk ratio) and its standard error. The log values are then converted back to normal scale using the exponential function. The final 95% confidence interval is therefore skewed because of this conversion. Using the default example data from StatPgm 3b. Compare Two Proportions, the results of calculation for risk ratio are as follows - Risk
_{Grp 1}= 0.1667 Risk_{Grp 2}= 0.075 - Risk Ratio = 0.1667 / 0.075 = 2.2222
- Log(Risk Ratio) = 0.7985 Standard Error (SE
_{Log(Risk Ratio)}) = 0.367 - 95% Confidence Interval of Risk Ratio (1 tail) = >1.215
- 95% Confidence Interval of Risk Ratio (2 tail) = 1.0823 to 4.5626
We conducted an epidemiological study, comparing the Caesarean Section rate of women from the Mainland (group 1) against that in women resident in Hong Kong (group 2). The analysis is as follows - Mainland women (group 1) : total = 98, CS = 8, Risk r1 = 9/98 = 0.082 (8.2%)
- Hong Kong women (Group 2) : total = 220, CS = 40, risk r2 = 40/220 = 0.182 (18.2%)
- The risk ratio rr = 0.082 / 0.182 = 0.449
- The logarithm of risk ratio Lrr = log(rr) = Log(0.449) = -0.8008
- The standard Error of Lrr = 0.3678.
- The 95% confidence interval of Lrr = -0.8008-1.96(0.3678) to -0.8008+1.96(0.3678) = -1.5216 to -0.08
- Converting back to non-log values the 95% confidence interval is exp(-1.5216) to exp(-0.08) = 0.2184 to 0.9232, rounded to 0.12 - 0.92.
The 95% confidence interval of log(risk ratio) did not overlapped the null value (0), which is the same as the 95% confidence interval of risk ratio did not overlapped the null value (1). The ratio is therefore statistically significant. We can therefore conclude that the risk of Caesarean section in women from mainland is 0.12 (12%) to 0.92 (92%) that of women resident in Hong Kong. ## IntroductionThe main advantage of the odds ratio is that there is no assumption of causal sequence. The group and dependent variable can both be causal and outcome. In other words, the statistical model assesses the binary relationship between two groups, so can be used flexibly.Although the odds ratio can be used to analyse data from controlled trials and epidemiological surveys, a particular useful model is the retrospective matched paired controlled trials. This model is used to determine causal factors in rare condition, where each case with the condition is matched with one or more normal cases that are similar in every aspect except the suspected causal factor, then the relative frequency of exposure to the causal factors in the two groups are compared. Such models were used to establish the association of thalidomide and limb defects. The default example data, from StatPgm 3b. Compare Two Proportions, is computer generated to represent an imaginary retrospective matched paired controlled study relating whether exposure to the cigarette smoking environment during pregnancy is associated with the child's subsequent poor school performance. We were able to find 60 primary school children categorized as poor learners (group 1), each case is matched with 4 normal children from similar background born at about the same time. (group 2 sample size = 4x60=240). We asked each mother to recall, during pregnancy for that child, whether she smoked, or a member of the household smoked. The analysis is as follows - Children with learning difficulty (Group 1) : exposure to smoke = 10, non exposure = 50, odd of exposure O1 = 10 / 50 = 0.2
- Normal Children (Group 2) : exposure to smoke = 18, non exposure = 222, odd of exposure O2 = 18 / 222 = 0.0811
- Odds Ratio or = 0.2 / 0.0811 = 2.47
- Log Odds Ratio Lor=Log(or)=Log(2.47)=0.9
- Standard Error=0.42.
- The 95% confidence interval of Lor = 0.9±-1.96(0.42) = 0.07 to 1.73.
- These were converted back to non-log values by exp(0.07) = 1.07 to exp(1.73) = 5.67
Given that the 95% confidence interval of Lrr did not overlapped the null value of 0, which is the same as the 95% confidence interval of odds ration did not overlapped the null value (1), the result is statistically significant. The result therefore allows us to conclude that exposure to cigarette smoke during pregnancy is associated with subsequent learning difficulty of the child. Introduction
Comparing two regression lines is the simplest model of The default example data from StatPgm 3c. Compare Two Regressions is used to demonstrate how this is carried out. The data is from 22 newborns, 10 boys and 12 girls, arranged in a 3 column table. - Column 1 is group designation, in this example boy or girl
- Column 2 is the covariate (x in regression), in this example gestation in weeks
- Column 3 is the dependent variable (y in regression), in this example birth weight in g.
- The two regression lines are computed, y
_{1}= a_{1}+ b_{1}x and y_{2}= a_{2}+ b_{2}x. In this example- Group 1 (Boys) y
_{1}=-3772 + 185.3x_{1}, or Birth weight (g) = -3772 + 185.3 Gestation (week) - Group 2 (Girls) y
_{2}=-3999 + 186.9x_{2}, or Birth weight (g) = -3999 + 186.9 Gestation (week)
- Group 1 (Boys) y
- The regression coefficients from the two groups are compared to see if they are statistically significant. In this example, 185.3 from group 1 (Boys) and 186.9 from group 2 (girls). The difference was 1.6g / week, statistically not significant
If the regression coefficients are significantly different, it means that boys and girls grow at different rate near term, so gestation cannot be used as a covariate. The technical term is that **a significant interaction**exists between grouping and covariate.Only when the two regression coefficients are not significantly different, can gestation be used as a covariate to adjust the values for comparison - Assuming that both groups have similar regression coefficient, a new regression coefficient is calculated, combining the data from both groups. In this example, the combined regression coefficient is 186.2g/week
- Using the combined regression coefficient, the birth weights are recalculated in terms of differences between the actual birth weight and the expected birth weight defined by gestation and the combined regression line
- The recalculated birth weights from the two groups are then compared
- The result is the difference in the means from the two groups, adjusted for the influence of gestation. In this example, the adjusted difference is 165g, with a 95% confidence interval of 79g to 251g
The data points and the two regression lines (blue for group 1, boys and red for group 2, girls) are plotted as shown to the right. The combined regression line is shown in black. It can be seen that the difference in birth weight between the two groups (Boys and girls) after adjusting for the covariate (gestation) is more precise, as it reduces the random variations caused by the babies being born at different gestational ages. |