Visual search behaviour in skeletal radiographs: a cross-speciality study
J.J.H. Leong,M. Nicolaou,R.J. Emery,A.W. Darzi,G.-Z. Yang
DOI: https://doi.org/10.1016/j.crad.2007.05.008
IF: 3.389
2007-01-01
Clinical Radiology
Abstract:Results Total time spent studying the radiograph was not significantly different between the groups. However, the expert groups had a higher number of true positives ( p < 0.001) with less dwell time on the fracture site ( p < 0.001) and smaller KL distance (r = 0.062, p < 0.001) between trials. The Gaussian mixture model revealed smaller mean squared error in the expert groups in hand radiographs (r = 0.162, p = 0.07); however, the reverse was true in shoulder radiographs (r = −0.287, p < 0.001). The relative duration of the reflective phase decreases as the confidence level increased (r = 0.266, p = 0.074). Conclusions Expert search behaviour exhibited higher accuracy and consistency whilst using less time fixating on fracture sites. This strategy conforms to the discovery and reflective phases of the global–focal model, where the reflective search may be implicated in the cross-referencing and conspicuity of the target, as well as the level of decision-making process involved. The effect of specialization appears to change the search strategy more than the effect of the length of training. Introduction Errors in the interpretation of radiographs in accident and emergency departments (A&E) are estimated to be 1.5%. 1 In a busy unit, most radiographs are read by the treating physicians, as well as radiologists, to ensure a good consensus. However, there is usually a significant delay between the two interpretations. In a 1-year review of 671 cases with discrepancies of radiographic reports between the A&E and radiological staff, 286 cases required further action. 2 Factors influencing accuracy include the duration of training, 3,4 and the difference in training methods between the specialities. 5 It has been estimated that up to 40% of radiographs taken in hospitals are musculoskeletal images. 6 Most of these radiographs have duplicate readings by radiologists and orthopaedic surgeons, and the discrepancies between their interpretations are significant. 7 The use of eye tracking provides a possible means of understanding how these discrepancies have originated. One of the first documented studies of eye tracking was published in a psychology journal in 1901. 8 The technology has since evolved from being invasive, involving the use of a scleral contact lens with embedded search coil, to accurate and non-invasive, video-based eye-tracking devices with bilateral video-oculography methods. 9 Existing research has shown that eye-tracking data imply visual attention and can provide further insight into the cognitive process of image understanding and aberrant or idiosyncratic visual search behaviours. 10–14 Kundel and Nodine postulated the global–focal model for describing the behaviour of radiograph interpretation. They suggested four stages of search that include (1) global impression, which is defined as the initial search using mainly peripheral vision guidance and lasts for less than 200 ms; (2) discovery search, which uses the information from step one and involves a detailed inspection of the target; (3) reflective search, which involves gathering evidence from cross-referencing other potential targets; and (4) post-search recall, which describes the period when the image is no longer available, and is recalled from memory. 15 The first and last phases of the model are difficult to capture by eye tracking, whereas for ambiguous targets, the reflective stage of the visual search was more pronounced. This was reflected by low-contrast lung nodules studies. 15 The purpose of this study was to provide a detailed quantitative analysis of the discovery and reflective stages of the visual search involved in identifying focal fracture sites in skeletal radiographs. It also aimed to establish a numerical framework for the practical application of the global–focal model and evaluating the effect of specialization and training duration on search behaviour. Materials and methods Selection of radiographs A total set of 33 digital radiographic images were obtained from a London hospital, which consisted of 12 images of hands (including two practice ones), nine images of knees, and 12 images of shoulders. All images were converted from DICOM standard to TIFF format using lossless conversion, and only the anteroposterior view was used. The images were standardized in size to fit to a screen resolution of 1280 × 1024, and patient information was removed. All images were reported by a consultant radiologist before the study. One shoulder, two knee, and two hand radiographs had no fractures, and four images had more than one fracture. Data from three images were discarded due to ambiguity of the diagnosis. Eye tracking experiment set-up A Tobii 1750 eye tracker (Tobii Technology, Stockholm, Sweden) was used to display the images. It is a remote eye-tracking device using the standard binocular video-oculography technique with an accuracy of 0.5° and a sampling rate of 30 Hz, integrated with a 17″ TFT display with a resolution of 1280 × 1024 pixels. It can tolerate moderate head movement within a 30 × 15 × 20 cm volume at 60 cm in front of the device, thus providing a relatively natural environment for radiograph interpretation. A total of 25 participants, five consultant radiologists, six consultant orthopaedic surgeons, five orthopaedic specialist registrars (SpRs), four orthopaedic senior house officers (SHOs) and five accident and emergency department (A&E) SHOs, were recruited for the study. Ethical approval was obtained from St Mary's Local Research Ethics Committee, and all participants signed written consents before the study. The instructions were explained in writing and displayed on screen, all experiments were carried out in a darkened room with minimal noise disturbance, and the participants were positioned 60 ± 10 cm in front of the screen, as illustrated in Fig. 1 . After written consent, and standardized five-point calibration on the Tobii eye tracker, repeat instructions were displayed on screen and two slides were used for familiarization at the beginning of each session. Thirty-three images were displayed sequentially. The participants were asked to search for the fracture(s), and fix their gaze on the fracture(s) and press a button. The participants were then required to report aloud a number from 1 to 5 (with 5 being most confident), indicating the confidence level for the diagnosis after each button click. The image was changed when the participants were satisfied that there was no further fracture and said “next”. No other interactions were available for the participants. Pixel coordinates of the eye-tracking data were acquired by using the software provided with ClearView 2.2.0. Fixations were calculated when gaze points fell within a 1.5° visual angle with a minimum duration of 100 ms. The location of the fracture identified by the observer was indicated by the fixation point coinciding with a button click within 200 ms. Time taken to interpret the radiograph, diagnostic performance, and eye-tracking data were analysed for each participant. The dwell time was the amount of time when the participants' fixations fell on the fracture site, and the medians of the group were used for comparison using non-parametric tests. Further analysis was performed by plotting the Cartesian distance between the gaze positions from the centre of the fracture as a function of time. The shape of the curves generated was used to assess the consistency of visual search strategy of the observer and the effect of experience on the search patterns as described below. Statistical analysis There were 19 radiographs with single fractures, and they were used in the subsequent analyses. The Kullback–Leibler (KL) distance was used to calculate the intra-observer similarity of the curves derived from the fixation distance to the target 16 (see Appendix for details). The x -axis was scaled to a standardized time, and the datasets were interpolated to the same number of data points. Each dataset from a radiograph from each participant was used to compare with all the other 18 datasets from the same participant, and the KL distance was calculated. This resulted in a 19 × 19 matrix, having discarded the multiple fractures, normal and ambiguous images, for pair-wise comparisons made for each participant. Observation of the fixation data by two of the authors (J.J.H.L. and G.Z.Y.) revealed distinct bimodal distribution in some of the datasets. The fixation distance data were hence fitted with a two-mode Gaussian mixture model, and the parameters were derived from the expectation–maximization (EM) algorithm 17,18 (see Appendix ). The “goodness of fit” was then calculated by the mean squared error of the curve fitting. The above analyses were done using bespoke software written in C++. As the data were not normally distributed, non-parametric tests were used. The Kruskal–Wallis test was used to demonstrate the difference between more than two groups, and the Mann–Whitney test was used to compare between two groups. Non-parametric correlations were calculated using Spearman's rank test. SPSS 11.5 (Chicago, IL, USA) was used for statistical calculations. Results Fig. 2 shows an example of the fixation distributions of a consultant orthopaedic surgeon and an orthopaedic SHO. It illustrates qualitatively the difference in fixation patterns. To illustrate the quality of the fixation distance-to-fracture data used for Gaussian mixture model fitting, Fig. 3 provides three example plots of a consultant radiologist (a), an orthopaedic surgeon (b), and an A&E SHO (c) examining a hand radiograph. The data points (black diamonds) are gaze measurements, and the lines illustrate the two Gaussian components fitted over the data points. Fig. 3 a and b seem to fit well with the two Gaussian components, in this example the first component is smaller in Fig. 3 a and larger in Fig. 3 b. However, Fig. 3 c displays the lack of organization of the eye fixations in this particular example. Table 1 shows the number of true positives (TP) or identified fractures, and it is evident that the senior clinicians had much higher number of TP ( p < 0.001) and lower false negatives (FN) or missed fractures ( p < 0.001). Although the consultant orthopaedic surgeon group had a higher median than the radiologists (24 versus 20 total of 29), this was not statistically significant ( p = 0.108). The total time taken to examine all the images by different subject groups was not significantly different ( p = 0.72), apart from a radiograph of a fracture of the shoulder with immature bone where the senior group took longer ( p = 0.02). The total time was not significantly different between the groups. Fig. 4 shows the dwell time ratio (which is defined as dwell time on fracture site divided by total time spent on the radiograph) among the five groups. For the TP radiographs, there was a significant difference in dwell time ratios between groups with all the radiographs ( p < 0.001) with an overall median 0.23 (consultant radiologists 0.21, consultant orthopaedic surgeons 0.22, orthopaedic SpRs 0.20, orthopaedic SHOs 0.30, A&E SHOs 0.29). In FN radiographs, the overall median was 0.02, with no significant difference between the groups. KL distance comparison within each participants' scan paths yielded an overall median of 0.35, there was a significant difference between the groups ( p < 0.001). The consultant radiologist group had a significantly lower KL distance than the A&E SHO ( p < 0.001). There was also a significant difference between the consultant radiologists and consultant orthopaedic surgeons groups ( p < 0.001), and between the orthopaedic SHO and A&E SHO groups ( p < 0.001). The mean ranks of the KL distance for the different groups are illustrated in Fig. 5 a. The variance of the KL distance was compared within each subject, there was no significant difference between the groups; however, when radiologists were compared with non-radiologists (groups 2–5), there was a significant difference ( p = 0.042). Fig. 5 b and c show the mean squared error (MSE) of the Gaussian mixture model of the five groups in examining the hand and shoulder radiographs, respectively. The MSE of hand radiographs correlated with experience levels, in the order of consultant radiologists, consultant orthopaedic surgeons, orthopaedic SpRs, A&E SHOs and orthopaedic SHOs (r = 0.162, p = 0.07). The consultant radiologist group had lower MSE than the A&E SHO group ( p = 0.09). However, in Fig. 5 c where shoulder radiographs were used, there was a negative correlation (r = −0.287, p < 0.001) with experience. There were no significant correlations in knee radiographs. Table 1 shows a summary of the results. In Fig. 5 d, where only hand radiographs were used, the covariance of the first Gaussian curve correlated with the confidence level (r = 0.266, p = 0.074; level 1 mean rank 19, level 2 mean rank 17.75, level 3 mean rank 20.27, level 4 mean rank 22.58, level 5 mean rank 27.61). This can be referred back to Fig. 3 , where Fig. 3 a shows an example where the second Gaussian curve had a much higher covariance (wider distribution) than the first when compared with Fig. 3 b. The confidence level is negatively correlated with the experience levels in the order as above (r = −0.303, p = 0.041). Discussion Statement of principal findings Based on the global–focal model, this study aimed to provide a quantitative framework for assessing subtle difference in visual search behaviours in locating focal lesions in musculoskeletal images. It confirmed quantitatively that the more experienced observers have higher accuracies in fracture identification than less experienced ones, along with an explanation of the plausible causes. Dwell time This study showed that it was the distribution of time to interpret each image which was significantly different, not the total time taken. The dwell-time analysis showed that in identified fractures (TP), less time was spent on the fracture site by experts than novices, as shown in Fig. 4 . This implied that with experience less time was needed at the fracture site for identification and decision processing, but more time was spent for cross-referencing or identification of further abnormalities. Two-stage search The Gaussian mixture model fitting was used to dissect the search pattern into two stages, this was decided experimentally after observations of all the raw data. It appeared that expert search strategy (especially in hand radiographs) was more consistent with the two-stage search pattern (see Fig. 3 a–c). Another interesting observation was that the covariance of the first curves increase with confidence, whereas the second covariance exhibit the opposite behaviour, as illustrated in Fig. 5 d. The covariance of the Gaussian mixture model is proportional to the width of the curve, and hence more time was spent in the second stage of search when the diagnosis was less obvious. This further confirmed that the second stage was used for cross-referencing other potential targets, as described in the global–focal model. A strikingly different approach was observed in more conspicuous targets, namely shoulder radiographs. Here the number of potential fracture sites are limited, and are generally more obvious when compared with hand radiographs. 19 Experts in fracture search should be able to detect the targets in the first stage of search (using only peripheral vision), and the next two stages of detailed search would become redundant. The contrast with hand radiographs as displayed in Fig. 5 b and c, in fact, further confirmed that the model best describes search models in subtle targets only. Search consistency Consistency in search strategy was quantified using KL distance in this study, this distance describes the amount of difference between gaze distributions. For each subject, the KL distance was calculated between all the images, so the shorter distance, meant that similar strategies (or scan paths) were used throughout the study. The variance of KL distance also showed a similar trend, which reiterated the consistency in the expert groups. Effect of training and specialization Previous reports have shown that A&E doctors have inferior diagnostic performance when compared with radiologists. The first aim of the study was to compare A&E SHOs to radiology consultants (recognized as the “gold standard” of radiographic interpretation), which should display the most difference in search behaviour. Indeed, it was found that radiologists are significantly more consistent with their search pattern, and seem to adhere to the two-stage search strategy. The comparison of different specialities at similar stages of training provides an interesting contrast that may be explained by the difference in training and the primary aim of reading a radiograph. Although radiology and orthopaedic consultants have similar diagnostic performances, radiologists are more consistent in their approach, and also adhered more closely to the two-stage search model. A&E and orthopaedic SHOs also had similar accuracies in their interpretations. Interestingly, the A&E SHO group are less consistent in their search behaviour. Furthermore, the effect of training was evaluated using the three orthopaedic groups at various levels of training. Orthopaedic surgeons are the only speciality where all three grades regularly review radiographs. It seemed that training has neither changed the consistency nor the search strategies into the two-stage model proposed. Weaknesses of the study It should be noted that this study only used a relatively small number of single-view radiographs with no clinical information given to the participants. We felt that the use of two views would add to the complexity of the analysis, due to participants glancing between the two radiographs, and clinical information might have influenced their search behaviours and biased the results. Reducing the length of the study helped to minimize factors such as fatigue and boredom which would also have made the analysis more complex. The method used to identify fractures in this study required the participants to prolong their gaze at the fracture site, whilst pressing a button. This would obviously artificially increase the dwell time at the fracture; however, this increase should be similar in all groups or perhaps even be prolonged in the senior groups. This method does allow more accurate assessment of their diagnostic performance. Contrasts with previous studies Although extensive previous research has been conducted in visual search scan path analysis in radiological images, the most commonly used metrics have been dwell-time analysis and the time-to-first-hit the targets comparing identified lesions with missed ones. 10,12,13 Spatial patterns do not always convey information regarding the intrinsic visual search behaviour, hence other studies have concentrated on feature extraction in radiological images. 11,20,21 Skeletal radiographs are less studied using eye tracking, as they are more heterogeneous in nature. A study on hand and wrist radiographs revealed that radiologists used four different scan paths (circular, radial, zigzag, complex); however, this was judged subjectively and qualitatively. 22 The search pattern of radiographs with multiple fractures was also studied by Berbaum et al., the aim was to exhibit the effect of premature termination due to satisfaction of search. 23 In terms of comparison of performance between specialities, two studies looked at skeletal radiographs and eye movements of radiologists and orthopaedic surgeons. However, the studies focused on comparing the presenting media of the radiographs only. 24,25 One study did show that radiologists have higher TPs than orthopaedic surgeons, but each group only had three participants. 24 Meaning and implications of this study The comparison of the radiologists with A&E doctors provided possible explanations of the differences in accuracy in diagnosis. The contrast between orthopaedic surgeons and radiologists may be explained by their usual clinical practice. Orthopaedic surgeons tend to examine the patients before reading the radiographs, hence search strategy is heavily influenced by prior knowledge and clinical judgement; this is known as the “top-down” approach. In contrast, radiologists usually receive the radiographs with an insufficiently brief summary of the clinical picture; they also search for all abnormalities in the radiograph (not just fractures). This is called the “bottom-up” approach. 26 The difference between A&E and orthopaedic SHOs in their search consistency is interesting. A&E junior doctors usually have a mixed interest in their future careers; however, the majority of orthopaedic SHOs will have an interest in developing a surgical career, and usually have stronger background knowledge in anatomy and surgical pathology. The postgraduate training of radiologists is very different from orthopaedic surgeons: radiologists tend to be taught formally how to interpret radiographs and usually have their results audited periodically, this is not the case in orthopaedics. This may explain the relatively unchanged search strategies between the three experience groups in orthopaedics. Formal education in orthopaedic surgeons in radiographic interpretation may be beneficial to their search consistency. Future research This study included radiographs with multiple fractures, but the behavioural analysis of them proved to be complex, and is certainly worth further study. The effect of satisfaction of search may be further quantified mathematically. 23 The use of eye tracking may prove to be useful for training in radiographic interpretation. However, its routine use will require further improvement of the eye-tracking technology to make it less intrusive and reduce its effect on the usual behaviour of the observers. Further development in the analysis framework is also necessary to cater for the idiosyncrasy of cognitive visual search strategies used, as this paper is only concerned with the analysis of spatio-temporal scan path patterns. Acknowledgements The authors thank Xiao-Peng Hu and Marcus Ellington for their help with the statistical analysis. Appendix The KL distance was calculated by using the formula below. The KL distance from each pair image was calculated, for 19 images for each participant. This resulted in a 19 × 19 matrix. d = ∑ k p k log 2 ( p k q k ) where d is the KL distance from p , the “true” probability distribution, to q , the “target” probability distribution. The fixation distance data were then fitted with a two-mode Gaussian mixture model: p g ( x | c , θ ) = ∑ i = 1 2 c i g i ( x | θ I ) where ∑ i = 1 2 c i = 1 , and { g ( x | θ i ) = 1 2 π σ i exp ( − ( x − μ i ) 2 2 σ i 2 ) θ i = ( μ i , σ i ) In the above equation, μ i is the mean of the Gaussian component and σ i the covariance. The values were derived from the expectation–maximization (EM) algorithm, 17,18 which was solved iteratively through the following set of equations: P i ( x ) = c i old g i ( x | μ i old , σ i old ) ∑ k = 1 G c k old g k ( x | μ k old , σ k old ) { c i new = ∑ x y ( x ) p i ( x ) ∑ x y ( x ) μ i new = ∑ x y ( x ) p i ( x ) x ∑ x y ( x ) p i ( x ) σ i new = ∑ x y ( x ) p i ( x ) [ ( x − μ i old ) T ( x − μ i old ) ] ∑ x y ( x ) p i ( x ) where y ( x ) is the original y value at time x (see Fig. 3 for example). References 1 J.R. Benger I.D. Lyburn What is the effect of reporting all emergency department radiographs? Emerg Med J 20 2003 40 43 2 S.M. Williams D.J. Connelly S. Wadsworth Radiological review of accident and emergency radiographs: a 1-year audit Clin Radiol 55 2000 861 865 3 S. Tachakra Level of diagnostic confidence, accuracy, and reasons for mistakes in teleradiology for minor injuries Telemed J E Health 8 2002 111 121 4 J.T. Rhea M.S. Potsaid S.A. DeLuca Errors of interpretation as elicited by a quality audit of an emergency radiology facility Radiology 132 1979 277 280 5 J. Eng W.K. Mysko G.E. Weller Interpretation of emergency department radiographs: a comparison of emergency medicine physicians with radiologists, residents with faculty, and film with digital display AJR Am J Roentgenol 175 2000 1233 1238 6 C.V. Cimmino The radiologist and the orthopedist Radiology 97 1970 690 691 7 J. Anglen K. Marberry J. Gehrke The clinical utility of duplicate readings for musculoskeletal radiographs Orthopedics 20 1997 1015 1019 8 R. Dodge T.S. Cline The angle velocity of eye-movements Psychol Rev 8 1901 145 157 9 G.Z. Yang L. Dempere-Marco X.P. Hu Visual search: psychophysical models and practical applications Image Vis Comput 20 2002 291 305 10 C.F. Nodine C. Mello-Thoms S.P. Weinstein Blinded review of retrospectively visible unreported breast cancers: an eye-position analysis Radiology 221 2001 122 129 11 E.A. Krupinski W.G. Berger W.J. Dallas Searching for nodules: what features attract attention and influence detection? Acad Radiol 10 2003 861 868 12 H.L. Kundel C.F. Nodine D. Carmody Visual scanning, pattern recognition and decision-making in pulmonary nodule detection Invest Radiol 13 1978 175 181 13 H.L. Kundel C.F. Nodine E.A. Krupinski Searching for lung nodules. Visual dwell indicates locations of false-positive and false-negative decisions Invest Radiol 24 1989 472 478 14 E.A. Krupinski Visual search of mammographic images: influence of lesion subtlety Acad Radiol 12 2005 965 969 15 H.L. Kundel C.F. Nodine The cognitive side of visual search J.K. O'Regan A. Levy-Schoen Eye movements: from physiology to cognition 1987 Elsevier New York 573 582 16 S. Kullback R.A. Leibler On information and sufficiency Ann Math Stat 1 1951 79 86 17 A. Dempster N. Laird D.B. Rubin Maximum likelihood from incomplete data via the EM algorithm J Roy Stat Soc B 39 1977 1 38 18 Akaho S. The EM algorithm for multiple object recognition. Proceedings of IEEE International Conference on Neural Networks (ICNN'95) , Perth, Western Australia 1995; 5 :2426–2431. 19 E.B. van Onselen R.B. Karim J.J. Hage Prevalence and distribution of hand fractures J Hand Surg [Br] 28 2003 491 495 20 X.P. Hu L. Dempere-Marco G.Z. Yang Hot spot detection based on feature space representation of visual search IEEE Trans Med Imaging 22 2003 1152 1162 21 L. Dempere-Marco X.P. Hu S.L. MacDonald The use of visual search for knowledge gathering in image decision support IEEE Trans Med Imaging 21 2002 741 754 22 C.H. Hu H.L. Kundel C.F. Nodine Searching for bone fractures: a comparison with pulmonary nodule search Acad Radiol 1 1994 25 32 23 K.S. Berbaum E.A. Brandser E.A. Franken Gaze dwell times on acute trauma injuries missed because of satisfaction of search Acad Radiol 8 2001 304 314 24 P.J. Lund E.A. Krupinski S. Pereles Comparison of conventional and computed radiography: assessment of image quality and reader performance in skeletal extremity trauma Acad Radiol 4 1997 570 576 25 E.A. Krupinski P.J. Lund Differences in time to interpretation for evaluation of bone radiographs with monitor and film viewing Acad Radiol 4 1997 177 182 26 W. van Zoest M. Donk Bottom-up and top-down control in visual search Perception 33 2004 927 937