Reliability

 

For a measurement tool or instrument to be considered reliable its results must be reproducible and stable under the different conditions in which it is likely to be used. Reliability is decreased by measurement error - the main components of which are systematic bias and random error.  Reliability may be considered as the amount of measurement error that has been deemed acceptable for the effective practical use of a measurement tool (Atkinson & Nevill, 1998).

The following terms have been used interchangeably with reliability in the literature:

  • Reproducibility
  • Repeatability
  • Consistency
  • Stability
  • Agreement
  • Concordance

Definitions
Relative reliability is the degree to which individuals maintain their position in a sample with repeated measurements. 

Three aspects of reliability have been defined with respect to measurement instruments and tools:

  • test-retest reliability: the degree to which a result with one instrument is equivalent to the result on the same or a parallel instrument across days, also known as stability;
    • (Publication examples are given below)
  • internal consistency reliability: the degree to which items within an instrument correlate to each other or the consistency of an assessment tool across multiple trials within a single administration;
  • inter-rater reliability: the degree to which the measuring instrument yields similar results at the same time with more than one assessor, this is particularly important in investigator-determined subjective methods e.g. interview or observation.

Test-retest methods may be problematic in the assessment of diet and physical activity due to inherent systematic errors, random errors and intra-individual variation.  Some methods are likely to have particularly high day-to-day variation e.g. 24-hour dietary or physical activity recall.   

Absolute reliability is the degree to which repeated measurements vary for individuals.  Food intake and physical activity vary widely with time, so precision at an individual level is poor even if there is good agreement of mean intakes or levels of activity or energy expended. 

It should be noted that high reproducibility is possible due to a consistent over- or under- reporting of diet (or activity) i.e. the method is repeatedly reporting incorrect data (Gibson, 2005). 

Generalisability is how representative the 'study' population is compared to other populations.  Extrapolating the results of a test-retest correlation to a new sample is problematic if differences exist between population groups. 

Statistical methods for assessing reliability
A comprehensive and critical review of the statistical test commonly used to assess reliability is provided by Atkinson & Nevill (1998).

It is important to understand that hypothesis driven tests such as paired t-tests and ANOVA should not be relied on solely to measure reliability, as the amount of random variation determines the detection of a significant difference; significant systematic bias will be less likely to be detected if it is accompanied by large amounts of random error between tests (Altman, 1991; Bland & Altman, 1995).
Researchers should report if a one or two-way ANOVA has been used as the error is defined differently for these two models (Patterson, 2000).  ANOVA has been suggested as the preferred statistical approach to assess reproducibility of any dietary assessment method (Gibson, 2005).  However, the quantity to be estimated must be specified and different models will do this differently depending on the assumptions which underpin it.

Test, re-test reliability
The extent to which a method is repeatable is most commonly expressed as the coefficient of variation, i.e. the standard deviation of the results of repeated analyses of the same parameter expressed as a percentage of their mean. It can also be expressed as mean difference, correlation or by classification into fourths. 

If the data are continuous and normally distributed test, re-test reliability may be measured by the intra class coefficient (ICC), coefficients, actual units of measurement or as a proportion of the measured values (Baumgarter, 1989).  The Pearson correlation is often used, or if the data are not normally distributed they may be normalised by logging or the non parametric correlation-the Spearman correlation may be used.  Although widely used, correlations are not an ideal statisitical method to measure test, re-test reliability; correlation is how close do the observations lie on any straight line. In the case of ordinal data the weighted kappa should be used; kappa statistic should be used if the data are binary.

Example 1

Example 2

Example 3

Time interval between test, re-test
When testing tools or instruments on repeated occasions the length of interval between measurements should be considered in the light of the reference timeframe of the instrument.  It has been argued that the time interval between the administrations of physical activity self-report measures should generally be 1-3 days and not greater than 7 days (Patterson, 2000).  There is a risk that overlapping periods may induce a correlation between the random error of the two (or more) measurements; most methods assume this does not happen.  

In dietary assessment, the time period recommended has been longer for interviews, 4-8 weeks to reduce the chance of the second measure being influenced by the first. 

In a short time frame, (1-7 days) it is difficult to disentangle behavioural and analytical variation, in a longer time frame, the variation is mainly analytical.

Internal consistency
Internal consistency is measured by coefficient alpha or the Spearman-Brown Prophecy Formula. 

Inappropriate testing methods
Significance testing is not relevant to reliability testing.  A Pearson product moment correlation (PMCC) is also not appropriate as is assumes a bivariate situation and reliability is univariate.  Additionally, reliability is concerned with the consistency of scores across trials or days and the PPMC is unable to detect trends (Patterson, 2000).

Absolute reliability
Methods used to describe ‘absolute reliability’ include the Bland-Altman levels of agreement and the Kappa Statistic

Bland and Altman levels of agreement have clinical meaning, are easy to interpret, and allow the demonstration of biases.

The Kappa statistic is used to express agreement in the classification of individuals. 

Increasing reliability
The reproducibility of a diet or physical activity assessment measure depends on the time frame of the method, the population under study, the nutrient or aspect of physical activity of interest, the measurement tool and the between and within individual variation. 

In order to increase the reliability of an assessment the sources and types of error must be identified as either instrument or tool and individual.  Random error may be reduced by more frequent calibration of instruments, improved interviewer training, increased knowledge about the placement of motion sensors, increasing the days of assessment. The reduction of random error will increase reliability (Rikli, 2000).

Web design by Studio 24