Links : Home Index (Subjects) Contact StatTools 
Related link :
This page explains 6 sets of concordance programs available from StatTools. Although each program is
discussed in some detail in their own panels, this introductory page provides a summary and overview.
Concordance is a measure of agreement between opinions or measurements, so it concerns measurements and judgements rather than the subjects being measured or judges. The programs from StatTools differ in the type of measurement they handled.
Kappa for nominal data was first described by Fleiss in 1969. Fleiss went on to describe another Kappa for ordinal data,
and his name is often associated with this second Kappa. The first Kappa, which is discussed in this panel, is generally known
as Kappa for Nominal Data. The program is in the Kappa for Nominal Data Program Page
Kappa is a measurement of concordance or agreement between two or more judges, in the way they classify or categorise subjects into different groups or categories. The following terms are often used
We used a class of 10 students in their final school year, and each of the 5 councillors interviews every student, and classify them into the following categories.
The alternative data entry is a table of counts. The table has 10 rows, representing the 10 students, and 3 columns representing the 3 classifications. The cell contains the number of time each students is classified in that category. In this example, the first column is for caring profession, the second engineering, and third business, so that
Kappa in this example is 0.41, with a Standard Error of 0.08, and the 95% confidence interval od 0.27 to 0.57 This Kappa is a measurement of agreement between the 5 counsellors. Conventionally, a Kappa of <0.2 is considered poor agreement, 0.210.4 fair, 0.410.6 moderate, 0.610.8 strong, and more than 0.8 near complete agreement. Given Kappa is an estimate from a sample, the se=Standard Error provides an estimate of error. The 95% confidence interval is Kappa +/ 1.96 se. If a different confidence interval is required, the table for probability of z in the Probability of z Explained and Table Page can be consulted. Although concordance is usually used as a scalar measurement of agreement, a 95% confidence interval of Kappa that does not cross the zero value does allow a conclusion that significant concordance exists.
Cohen's Kappa
Cohen's Kappa is a measurement of concordance or agreement between two ratters or methods of measurement. The method can be applied to data that are not normally distributed, even binary (no/yes), but is best suited to a close ended ordinal scale, such as the 5 point Likert Scale. The algorithm is well described by the original papers, text books, and on the Internet (see references). The book by Fleiss is particularly useful as it combines all the developments and enhancements, including the algorithm for estimating variance. There are two ways of calculating Cohen's Kappa, and these produce different results. The first is by Cohen's original 1960 algorithm, now generally known as the unweighted Kappa. The second is by the weighted method, also described by Cohen but later in 1968, which includes a weighting for each cell, where weight for the cell i,j ( w_{ij} = 1  ij/(g1) ), g being the number of categories of scores. Cohen argued that the weighted Kappa should be used particularly if the variables have more categories than binary (more than yes and no), because the distance from agreement should be taken into consideration. The results of both calculations are presented, and the recommendation is to use the weighted value. Fless's Kappa Fleiss's Kappa is an extension of Cohen's kappa to evaluate concordance or agreements between multiple ratters, but no weighting is applied. Therefore, Fleiss's Kappa is similar to Cohen's unweighted Kappa (except for rounding errors) if the same data from two ratters are submitted to the Fleiss algorithm. Nomenclature Ordinal data These are data sets where the numbers are in order, but the distances between numbers are unstated. In other words 3 is bigger than 2 and 2 is bigger than 1, but 32 is not necessarily the same as 21. A common ordinal data is the Likert scale, where 1=strongly disagree, 2=disagree, 3=neutral, 4=agree, and 5=strongly agree. Although these numbers are in order, the difference between strongly agree and agree (54) is not necessarily the same as between disagree and strongly disagree (21). Instrument is any method of measurement. For example, a ruler, a Likert Scale (5 point scale from strongly disagree to strongly agree), or a machine (e.g. ultrasound measurement of bone length). ratters are one or more of the instruments. This is usually a person, hopefully trained, that determines what score should be given to a subject. A carer evaluating a pain score is a ratter, a judge in a beauty contest is a ratter. Subjects are the subjects of the measurements. They are patients, school children, members of the public, monkeys, rats, and so on. Scores or measurements are the quantities produced by the instruments/ratters. Measurements usually means something that are measured physically or chemically. Scores usually is a results produced by human decision. Concordance This usually means how much scores or measurements produced by different instruments agree. Commonly, concordance is expressed as a number between 0 and 1, where 0 represents no agreements at all, and 1 represent complete replications. In some concordance measurements a negative value may be produced, which signifies opposite results. Examples 1. Cohen's Kappa : Data entry option 1 The data consists of two obstetricians (ratters), palpating the abdomen of 30 pregnant women (subjects), and rated each baby as growth retarded (0), normal (1) or macrosomic (2). The data is a table, where each row represent each pregnant abdomen palpated, the two columns representing the two obstetricians, and the value in each cell the classification given to that baby by that obstetrician. The result consists firstly the display of the count matrix, with rows representing obstetrician 1's scoring, column obstetrician 2's scoring, and the cell the number of cases so scored by the two obstetrician. The diagonal cells representing where the two agree, the other cells where the two do not agree. Weighted Cohen's Kappa = 0.28, 95%CI = 0.01 to 0.57. AS the 95% confidence interval overlaps the null value, the conclusion is that there is no agreement between the two obstetricians. Example 2. Cohen's Kappa : Data Entry option 2. In this case, the data is a symmetrical matrix of counts. where two midwives reviewed 85 women at the beginning of labour as to how likely the delivery will require a Caesarean Section no risk at all (1), minimal risk (2), high risk (3) and almost certain (4). The data is a symmetrical table, where rows represents the evaluation of midwife 1, and columns the evaluation of midwife 2. The diagonals are where the two agreed (no risk (25), minimal risk (9) high risk (12) certain (21). The cells below the diagonals are the counts where midwife 1 evaluated a higher risk than midwife 2, and those above the diagonal the other way around. Weighted Cohen's Kappa = 0.82 95% confidence interval = 0.74 to 0.90. As this interval does not overlap the null value (0), the conclusion that the risk assessment of these two midwives significantly agree can be made. Example 3. Fleiss's Kappa : Data Entry option 1 and 2. The data table is similar to that for option 1 in Cohen's Kappa, except that more than two ratters are involved. In this example, we have 5 midwives examining 10 pregnant abdomen, and classify each baby as growth retarded (0), normal (1) or macrosomic (2). The data is therefore a 5 column table, each row representing a baby being assessed each column one of the midwives, and the cell contains the scores. The program first creates a count array, where the rows represent babies, each column represent the score (in this case 3 scores of 0, 1, and 2), and the cells the number of times that baby received that score. The sum of each row must therefore be 5 for the 5 rating midwives. In data entry option 2, the counting table can be entered directly. For both data entry options, Fleiss Kappa for this example is 0.42, 95% confidence interval = 0.28 to 0.56. As this interval
Kendall's coefficient of concordance for ranks (W) calculates agreements
between 3 or more rankers according to the ranking order each placed on the individuals being ranked.
The idea is that n subjects are ranked (0 to n1) by each of the rankers, and the statistics evaluates how much the rankers agree with each other. The program from the Kendall's W for Ranks Program Page modifies the input, so that the values entered by each ranker are ranked before calculation. This means the program can be used when the input data are scores, measurements, or ranks, and even if the scale of measurements used by different rankers are different (providing that they are ranking the same issues and in the same direction). For example, in cases of thyroid dysfunction, how much levels of T3, T4, and TSH agree with each other can be evaluated, as each measurement is converted to ranks before comparison. Kendall's W is therefore useful in that it provides a calculation of concordance for many measurements without any assumption of distribution pattern. Data entry and interpretation are best demonstrated using the example data from the Kendall's W for Ranks Program Page Example In a beauty contest with 10 finalists, 3 judges are to evaluate their relative beauty, with the least beautiful scoring 0 and the most beautiful scoring 9. The data is therefore a table, each row representing a finalist, each column one of the judges, and the cell contains the rank in beauty that judge gives to that contestant. The results are Kendall W = 0.43, Chi Square = 11.65 degrees of freedom = 9 p = 0.23 Please note : Here the statistical significant test is for significant agreement and not significant difference. This means that this set of results indicates that the 3 judges have no significant agreement on their ranking of beauty.
The Kuder Richardson Coefficient of reliability (KR 20) is used to test the reliability of binary measurements
such as exam questions, to see if the items within the instruments obtained the same binary (no/yes, right/wrong) results
over a population of testing subjects.
The formula for the coefficient can easily be obtained from Wikipedia on the Internet. Please Note that the KR 20 was first described in 1937. Hoyt in 1940 modified the formula so that it can be applied to measurements that are not binary. Hoyt's modification eventually was popularised and is now known as Cronbach's Alpha. Cronbach's Alpha, when applied to binary data, will therefore produce the same result as KR20. Cronbach's Alpha is now much preferred, and will be discussed in its own panel on this page. Example Data input and interpretation of results are best demonstrated using the default example in the Kuder Richardson Coefficient for Binary Data Program Page
We have 4 multiple choice questions (T1 to T4), administered to 5 students. 0 represents wrong answer and 1 correct answer, as shown in the table on the left.
The data set to be used is as shown in the table to the right, and the results are KR 20 = 0.75 The interpretation of the KR 20 value is similar to that of Kappa. A KR 20 of <0.2 is considered poor agreement, 0.210.4 fair, 0.410.6 moderate, 0.610.8 strong, and more than 0.8 near complete agreement. The original descriptions of KR 20 provided no test of statistical significance or confidence interval, although these can be obtained using the Cronbach's Alpha algorithm.
Intraclass Correlation Coefficient (ICC) is a general measurement of agreement
or consensus, where the measurements used are assumed to be parametric
(continuous and has a Normal distribution). The Coefficient
represents agreements between two or more ratters or evaluation methods on the
same set of subjects.
ICC has advantages over correlation coefficient, in that it is adjusted for the effects of the scale of measurements, and that it will represent agreements from more than two ratters or measuring methods. The calculation involves an initial Two Way Analysis of Variance, so the program can also be used to conduct a parametric Two Way Analysis of Variance. Data input and interpretation of results are best demonstrated using the default example data in the Intraclass Correlation Program Page Example We are testing different methods of measuring blood pressure, and wishes to know if the readings from the mercury and electronic manometer agree with each other. The data is therefore a two column table, each row represents a patient, and the two rows the two methods of measurement. Please note This is a tiny made up data set to demonstrate the method. In reality there may be many more methods to compare ( more than 2 columns), and the data set should contain many more cases for the results to be stable.
The initial Two Way Analysis of Variance produces the table to the right. It shows that variations between patients (rows) are significantly greater than random measurement error (p=0.0006), but the different between method of measurement (columns) are not statistically significant (p=0.78). We can therefore draw the conclusions that, although blood pressure varies from patient to patient, there is no significant different between the methods of measurement in any patient. Please note significant test in this case is a test of significant difference and not agreement, and a no significant difference between columns indicate that there is no significant disagreement. The program now proceed to provide a coefficient of agreement. It produces 6 in fact, described as follows. There are three models for Intraclass Correlation
In most cases, unless the methodology involves special arrangements, Model 2, individual, is usually used. From our example data therefore the Intraclass Correlation Coefficient is 0.98 Intraclass Correlation Coefficient can be interpreted as follows: 00.2 indicates poor agreement: 0.30.4 indicates fair agreement; 0.50.6 indicates moderate agreement; 0.70.8 indicates strong agreement; and >0.8 indicates almost perfect agreement.
Historical notes :
In 1937, Kuder and Richardson proposed a coefficient to evaluate the reliability of measurements that composed of multiple binary items. In 1941 Hoyt modified this coefficient, adjusting it for continuity, and name this the Kuder Richardson Hoyt Coefficient. Cronbach in 1951 showed that this coefficient can be used generally in all scaled measurements. As he intended this be a starting point to develop even better indices, he named it Coefficient Alpha. This index is now known as Cronbach's Alpha, and is a widely accepted measurement of internal consistency (reliability) of a multivariate measurement composing of correlated items. If Cronbach's Alpha is applied to binary data, the result is the same as the Kuder Richardson Coefficient (KR 20). The initial Cronbach's Alpha, calculated from the covariance matrix, is now known as the Unstandardized Alpha. This value tends to be unstable, and influenced by the scalar measurements used. A better Alpha is considered to be the Standardized Alpha, calculated from the correlation matrix. This is thought to be better as all variables are standardized to a mean of 0 and Standard Deviation of 1, the resulting Alpha is independent of the scales used. Both indices can be used to measure the internal consistency of multipleitem measurements, representing the averaged correlation between the items. As multipleitem measurements are in theory repeated measurements of the same thing, these indices represents the reliability of the overall set of measurements. indices of reliability are often used in the early stages of developing a multipleitems measurement, to ensure that all the items measures a common concept. Items are added, removed, and modified, according to whether the indices of reliability improves, and usually until Alpha is greater than 0.7. A recent development is the calculations of the Standard Error of Alpha, and from which the confidence interval. This algorithm, by Duhachek and Iacobucci, is now included in StatTools The development of the Standard Error measurement allows statistical comparison and significance testing. As well as the 95% confidence interval, z=Alpha / SE can be calculated, and the probability that this does not differ from zero follows the normal z distribution. Example Data entry and interpretation of results are best demonstrated using the default example data from the Cronbach's Alpha Program Page . In this example, we administered 4 multiple choice questions to 20 students, using 0 or wrong answer and 1 for correct answer. The data is therefore a table of 20 rows, each from a student, and 4 columns, each for one of the tests. We with to know if the tests are similar in difficulty, that is, if the correct and incorect answers agree. The program first produces the covariance matrix, the diagonal of which is the variance of each measurement (test), and the off diagonal cells the correlation coefficient between the measurements (tests). Please note that the covariance matrix can also be used as a second option for data entry. The program then calculates the unstandardized Alpha, which is Unstandardized Alpha = 0.61, n=20, SE=0.14, 95%CI=0.33, to 0.89 The program then converts the covariance matrix to a correlation matrix, from which the standardized Alpha is produced. Standardized Alpha = 0.60, n=20, SE=0.16, 95%CI=0.28 to 0.91 Sample Size Calculations for Cronbach's Alpha StatTools provides two sets of sample size programs related to Cronbach's Alpha
Kappa for Nominal Data
