skip to primary navigation skip to content

Studying at Cambridge

DAPA Measurement Toolkit

 

Validity

;

The aim of any assessment of diet, physical activity or anthropometry is to accurately estimate of the true value. This estimate consists of the true value plus error, even for the most accurate tool or overall method. Validity is the extent to which the estimated value matches the true value, or, the extent to which a method measures what it is supposed to measure.

Estimated Value = True Value + Total Error

Since we cannot know the true value with absolute certainty, it follows that interpretation of validity cannot be simplified to the question: is this method valid or not? Instead, validity differs according to variable of interest, study design, population and context. Validity can vary when:

  • Two different methods are used to assess the same phenomenon (e.g. self-report vs. laboratory weighing scale measures of adolescent body mass)
  • The same method is used to assess two different phenomena (e.g. accelerometer estimates of activity intensity during running vs. cycling)
  • The same method is applied in different contexts or populations (e.g. self-report of body mass in adolescent vs. adult populations)

Poor validity is typically the result of systematic error, which causes the estimated value to be distorted in a particular direction away from the true value. One example would be measurement of the height of study participants with their shoes still on their feet, which would have the consistent effect of producing values systematically greater than their true height, decreasing the truthfulness of the resulting data.

Validity is closely linked to reliability, however whilst reliability relates to the consistency of a method, validity relates to the accuracy. It is therefore possible for a highly reliable method to have limited validity.

In the height example above, replicate measurements of height of the same individual would be the same each time and therefore reliable. However the measurement would not be valid due to the underlying poor agreement with the true height caused by the shoes. The relationship between reliability and validity is described visually by the target example in Figure C.2.1 below.

Figure C.2.1 Relationships between reliability and validity at individual level.

It is possible for a method to be unreliable at individual level, but provide valid estimates at the group level using the mean, as shown by Figure C.2.2 below. Such a method would not be valid at individual level.

Figure C.2.2 Relationships between reliability and validity at group level. Not reliable and not valid at individual level, but valid at group level using mean of all values.<

Validity is a broad concept which has been defined in different ways and for different purposes. Some of the more commonly used forms of validity are described below.

Face validity

The degree to which a method appears to provide the desired information about the variable it has been designed to measure This is typically a more qualitative judgement which given the multi-dimensional nature of diet, physical activity and anthropometry, can be an important step in determining whether a method is fit for purpose.

Content validity (also known as logical validity)

The extent to which the method is considered to assess specific aspects of the phenomenon it is designed to assess. This is important when measuring health behaviours since they can be broken down into various dimensions and domains. Like face validity, this is a more qualitative judgement made by considering the target variable to be measured alongside the dimensions captured by the method.

Construct validity

The extent to which a method measures the theoretical construct it is designed to measure. It is demonstrated when the method yields data as might be expected, given its intended purpose.

  • For example, a questionnaire assessing occupational physical activity could be expected to produce higher values for bus conductors than for bus drivers, or lumberjacks as compared with office workers
  • If the resulting data were to correlate well with an assessment of physical fitness – such as maximal oxygen consumption – this too would be considered evidence of construct validity

Criterion-related validity

The extent to which estimated values relate compare with those derived from a comparison or ‘criterion’ method, preferably one of very high validity and thought to provide the closest approximation of the true value, commonly referred to as a ‘gold-standard’ method. For example:

  • The criterion validity of a new method to assess total body fat, such as a novel set of skinfold thickness equations, could be evaluated by comparing its data against scores derived by the 4-component body composition model, which has been used as the gold-standard method
  • By measuring the extent to which data derived by the new skinfold equations relate and/or agree relate with those from the criterion method, we can better understand how to interpret data from the new method

Convergent validity

Like criterion validity, this is the extent to which predicted values match those derived from a comparison method, but one not generally accepted to be the gold-standard.

The extent to which a method produces estimated values which are consistent with ‘true’ values can be assessed by a validity study. Typically, the method being examined and another - ideally gold-standard - method are used to assess the same phenomenon, followed by evaluation of the data from each.

1. Validity for absolute and relative measures

The relationship between two measures can be expressed in absolute or relative terms:

  • Absolute validity refers to the agreement between two sets of data measuring the same phenomenon with the same units.
  • Relative validity is the degree to which two methods, irrespective of units, rank individuals in the same order.

One type of measurement may not be valid to capture absolute levels of exposure, but valid to capture relative differences between individuals in a study population. For example, a dietary assessment of the frequency of consuming selected foods (food frequency questionnaires) is often used without assessment of portion sizes. Thus, absolute levels of nutrient intakes cannot be valid.

Despite the absolute measures of nutrient intakes not being valid, ranking individuals by levels of nutrient intakes can be valid and thus be adopted in a study of a lifestyle-disease association. Depending on the research question, validity for absolute measures is not always necessary.

Absolute validity can be separated according to whether the interpretations are to be made about groups or individuals.

2. What if no gold-standard is available?

There are often circumstances in which no gold-standard method is available for use as the criterion. This may be because:

  • No accepted gold-standard method exists or is widely accepted (e.g. habitual dietary energy intake).
  • Gold-standard methods which do exist are inaccessible, impracticable, or unethical. For example, the use of computer tomography to quantify adipose tissue in childhood research studies is limited due to ionising radiation exposure.

In these instances the validity of a method can only be estimated by comparing its data with that of another known to have systematic errors and biases. This type of comparison is known as convergent validity.

When no gold-standard method is available, it is desirable that the comparison method relies on a different type of measurement to obtain data in order to avoid the risk of correlated error. For example, comparing a 24-hour dietary recall to an estimated food diary carries the risk of similar under-reporting from both methods.

Even if the validity of a tool has been assessed through comparison with a gold-standard, it should not be assumed that it is appropriate for use in every research scenario. In practice this is rarely the case; validity is tied to the overall method, plus the intended purpose, population and context where it is applied.

The following should be considered when assessing the validity of a method to measure any aspect of diet, physical activity or anthropometry:

  • The characteristics of the sample used in the validation study
  • The scientific rigour of the validity study
  • The dimension(s) measured in the validity study and those of interest for the research question – i.e. face/content validity
  • The study design being used to answer the research question, and whether absolute or relative validity is necessary
  • Agreement (absolute) or association (relative) between the comparison and the method being assessed – i.e. criterion/convergent validity

Internal and external validity

The sample used in validation or other types of study should be reviewed to ascertain if the results are likely to be generalisable to other populations or contexts. This is known as external validity. Sample characteristics such as age, sex, ethnic origin, socio-economic status may all limit generalisability. For example, an adult physical activity questionnaire which is valid for adult use may not be suitable for use in a youth population.

In contrast, internal validity is the extent to which the study or estimate is free from bias or systematic error – i.e. the appropriateness and rigour of the study design, data collection protocols and/or analysis.

Face and content validity

Another important consideration should be whether the criterion used to evaluate a method would be suitable for use in answering your research question. For example, validity reported when compared to doubly labelled water (gold-standard estimate of overall energy expenditure), would not be sufficient evidence to support use of a questionnaire to estimate subcategories of activity such as active commuting. A method with acceptable validity for one dimension of behaviour may not be relevant or generalisable to another dimension.

Suitability for study design and research question

It is very important to recognise that the degree of validity of a method may be more or less acceptable for studies designed for different purposes. Table C.2.1 illustrates different validity for different outcomes assuming use of a ‘gold-standard’ method, such as:

  • Assessment of 24-hour urinary sodium excretion that precisely captures exposure to dietary sodium
  • 24-hour calorimetry that examines energy expenditure

Table C.2.1 illustrates that even if a perfect method is used, validity of such methods varies by their application.

Table C.2.1 Theoretical validity of a ‘gold-standard’ measurement by exposure type.

N times of the assessment (N participants) Once
(n = 5)
1000 times*
(n = 5)
Once
(n = 50,000)†
1000 times*
(n = 50,000)†
Internal validity
Exposure on a specific day of each person Valid‡ Valid‡ Valid‡ Valid‡
Habitual exposure* of each person ? Valid‡ ? Valid‡
External validity
Average habitual exposure* of the population ? ? Valid‡ Valid‡
Variation of habitual exposure* of the population ? ? ? Valid‡
% of the population meeting a certain public guideline or clinical cut-off ? ? ? Valid‡
* Assumed to be sufficient to represent a habitual condition over a long period in a person.
† Assumed to be sufficient to represent the source population.
‡ Assumed to have no change in participant’s characteristics in response to each measurement and to have no errors in measurement, processing, and analysis.

For example, gold-standard measures of 24-hour calorimetry in 50,000 people can capture energy expenditure of a specific day of each person. Also, even if we know that energy expenditure varies by time, the average of 50,000 measures can be valid to estimate an average of habitual energy expenditure of the parent population.

However, those 50,000 measures do not provide a valid measure of variability of habitual energy expenditure between different individuals. This limitation is because an estimate of variability mixes both between-person and within-person variability together (reliability), precluding a study on between-person variability on. If there is no or little within-person variability in a measurement (e.g. knee height), measuring many individuals just once allows inference of between-individual variability.

References

  1. Albanes D, Conway JM, Taylor PR, Moe PW, Judd J. Validation and comparison of eight physical activity questionnaires. Epidemiology. 1990;1(1):65-71.
  2. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307-10..
  3. Charlton KE, Steyn K, Levitt NS, Jonathan D, Zulu JV, Nel JH. Development and validation of a short questionnaire to assess sodium intake. Public Health Nutr. 2008;11(1):83-94.
  4. Johnson F, Wardle J, Griffith J. The Adolescent Food Habits Checklist: reliability and validity of a measure of healthy eating behaviour in adolescents. Eur J Clin Nutr. 2002;56(7):644-9.
  5. Kelly P, Fitzsimons C, Baker G. Should we reframe how we think about physical activity and sedentary behaviour measurement? Validity and reliability reconsidered. Int J Behav Nutr Phys Act. 2016;13:32.
  6. Kurtze N, Rangul V, Hustvedt BE. Reliability and validity of the international physical activity questionnaire in the Nord-Trondelag health study (HUNT) population of men. BMC Med Res Methodol. 2008;8:63.
  7. Ommundsen Y, Page A, Ku PW, Cooper AR. Cross-cultural, age and gender validation of a computerised questionnaire measuring personal, social and environmental associations with children's physical activity: the European Youth Heart Study. Int J Behav Nutr Phys Act. 2008;5:29.
  8. Rennie KL, Wareham NJ. The validation of physical activity instruments for measuring energy expenditure: problems and pitfalls. Public Health Nutr. 1998;1(4):265-71.
  9. Schmidt ME, Steindorf K. Statistical methods for the validation of questionnaires--discrepancy between theory and practice. Methods Inf Med. 2006;45(4):409-13.