Reliability of categorical versus continuous scoring of welfare indicators: lameness in cows as a case study
Many animal welfare traits vary on a continuous scale but are commonly scored using an ordinal scale with few categories, presumably because it is believed to increase data reliability. Using 54 observers of varying levels of expertise, inter-observer reliability (IOR) and user-satisfaction were compared between a 3-point ordinal scale (OS) versus a continuous modified visual analogue scale with multiple anchors (VAS) for scoring cattle lameness from video. Half the participants scored the first 20 videos using the VAS and the last 20 videos using the OS whereas the other participants used first the OS and then the VAS. Each video concerned a different cow walking over a 6m mat in the same setting. Each video was shown 4 times but only scored once per observer. ANOVA models indicated that IOR was significantly better for the VAS (r = 0.44) than for the OS (r = 0.35; P=0.016). Such a low IOR may not be surprising given the short training session (8 videos were scored communally prior to the trial), the difficulty of scoring from video, and the lack of experience of most observers. IOR increased with self-reported level of expertise for the VAS (P<0.001), whereas for the OS it was highest for moderately experienced observers (P<0.001). The mean continuous and categorical scores were highly correlated (r = 0.93, P<0.0001). Three times as many observers stated to prefer the VAS than the OS for investigating differences in lameness between herds. These results illustrate that it is possible for a continuous score to be more reliable and to have greater user acceptability than a simple categorical scale. As continuous scales are also potentially more sensitive, and produce data more amenable to algebraic processing and more powerful analyses, the scepticism against their application for assessing animal welfare traits should be reconsidered.