Author Response: Human Attention During Goal-Directed Reading Comprehension Relies on Task Optimization
Jiajie Zou,Yuran Zhang,Jialu Li,Xing Tian,Nai Ding
DOI: https://doi.org/10.7554/elife.87197.3.sa4
2023-01-01
Abstract:Full text Figures and data Side by side Abstract eLife assessment Introduction Results Discussion Materials and methods Data availability References Peer review Author response Article and author information Abstract The computational principles underlying attention allocation in complex goal-directed tasks remain elusive. Goal-directed reading, that is, reading a passage to answer a question in mind, is a common real-world task that strongly engages attention. Here, we investigate what computational models can explain attention distribution in this complex task. We show that the reading time on each word is predicted by the attention weights in transformer-based deep neural networks (DNNs) optimized to perform the same reading task. Eye tracking further reveals that readers separately attend to basic text features and question-relevant information during first-pass reading and rereading, respectively. Similarly, text features and question relevance separately modulate attention weights in shallow and deep DNN layers. Furthermore, when readers scan a passage without a question in mind, their reading time is predicted by DNNs optimized for a word prediction task. Therefore, we offer a computational account of how task optimization modulates attention distribution during real-world reading. eLife assessment This study provides a valuable contribution to the study of eye-movements in reading, revealing that attention-weights from a deep neural network show a statistically reliable fit to the word-level reading patterns of humans. Its evidence is convincing and strengthens a line of research arguing that attention in reading reflects task optimization. The work would be of interest to psychologists, neuroscientists, and machine learning researchers. https://doi.org/10.7554/eLife.87197.3.sa0 About eLife assessments Introduction Attention profoundly influences information processing in the brain (Posner and Petersen, 1990; Treisman and Gelade, 1980; Rayner, 1998), and a large number of studies have been devoted to studying the neural mechanisms of attention. From the perspective of David Marr, the attention mechanism can be studied from three levels, that is, the computational, algorithmic, and implementational levels (Marr, 1982). At the computational level, attention is traditionally viewed as a mechanism to allocate limited central processing resources (Kahneman, 1973; Franconeri et al., 2013; Lennie, 2003; Carrasco, 2011; Borji and Itti, 2012). More recent studies, however, propose that attention is a mechanism to optimize task performance, even in conditions where the processing resource is not clearly constrained (Dayan et al., 2000; Gottlieb et al., 2014; Legge et al., 2002; Liu and Reichle, 2010; Najemnik and Geisler, 2005). The optimization hypothesis can explain the attention distribution in a range of well-controlled learning and decision-making tasks (Najemnik and Geisler, 2005; Navalpakkam et al., 2010), but is rarely tested in complex processing tasks for which the optimal strategy is not obvious. Therefore, the computational principles that underlie the allocation of human attention during complex tasks remain elusive. Nevertheless, complex tasks are critical conditions to test whether the attention mechanisms abstracted from simpler tasks can truly explain real-world attention behaviors. Reading is one of the most common and most sophisticated human behaviors (Li et al., 2022; Gagl et al., 2022), and it is strongly regulated by attention: Since readers can only recognize a couple of words within one fixation, they have to overtly shift their fixation to read a line of text (Rayner, 1998). Thus, eye movements serve as an overt expression of attention allocation during reading (Rayner, 1998; Clifton et al., 2016). Computational modeling of the eye movements has mostly focused on normal reading of single sentences. At the computational level, it has been proposed that the eye movements are programmed to, for example, minimize the number of eye movements (Legge et al., 2002). At the algorithmic and implementational level, models such as the E-Z reader (Reichle et al., 2003) can accurately predict the eye movement trajectory with high temporal and spatial resolution. Everyday reading behavior, however, often engages reading of a multiline passage and generally has a clear goal, for example, information retrieval or inference generation (White et al., 2010). Few models, however, have considered how the reading goal modulates reading behaviors. Here, we address this question by analyzing how readers allocate attention when reading a passage to answer a specific question in mind. The question may require, for example, information retrieval, inference generation, or text summarization (Figure 1). We investigate whether the task optimization hypothesis can explain the attention distribution in such goal-directed reading tasks. Figure 1 with 2 supplements see all Download asset Open asset Experiment and performance. (A) Experimental procedure for Experiments 1–3. In each trial, participants saw a question before reading a passage. After reading the passage, they chose the answer to the question from four options. (B) Accuracy of question answering for humans and computational models. The question type is color coded and an example question is shown for each type. trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. (C) Time spent on reading each passage. The box plot shows the mean (horizontal lines inside the box), 25th and 75th percentiles (box boundaries), and 25th/75th percentiles ±1.5× interquartile range (whiskers) across participants (N = 25). (D) Illustration of the training process for transformer-based models. The pre-training process aims to learn general statistical regularities in a language based on large corpora, while the fine-tuning process trains models to perform the reading comprehension task. Finding an optimal solution for the goal-directed reading task, however, is computationally challenging since the information related to question answering is sparsely located in a passage and their orthographic forms may not be predictable. Recent advances in DNN models, however, provide a potential tool to solve this computational problem since DNN models equipped with attention mechanisms have approached and even surpassed mean human performance on goal-directed reading tasks (Lan et al., 2020; Liu et al., 2019). Attention in DNN also functions as a mechanism to selectively extract useful information, and therefore, attention may potentially serve a conceptually similar role in DNN. Furthermore, recent studies have provided strong evidence that task-optimized DNN can indeed explain the neural response properties in a range of visual and language processing tasks (Yamins et al., 2014; Kell et al., 2018; Goldstein et al., 2022; Schrimpf et al., 2021; Hasson et al., 2020; Donhauser and Baillet, 2020; Rabovsky et al., 2018; Heilbron et al., 2022). Therefore, although the DNN attention mechanism certainly deviates from the human attention mechanism in terms of its algorithms and implementation, we employ it to probe the computational-level principle underlying human attention distribution during real-world goal-directed reading. Here, we investigated what computational principles could generate human-like attention distribution during a goal-directed reading task. We employed DNNs to derive a set of attention weights that are optimized for the goal-directed reading task and tested whether such optimal weights could explain human attention measured by eye tracking. Furthermore, since both human and DNN processing is hierarchical, we also investigated whether the human attention distribution during different processing stages, which are characterized through different eye-tracking measures, and the DNN attention weights in different layers may be differentially influenced by visual features, text properties, and the top-down task. Additionally, we recruited both native and non-native readers to probe how language proficiency contributed to the computational optimality of attention distribution. Results Experiment 1: Task and performance In Experiment 1, the participants (N = 25 for each question) first read a question and then read a passage based on which the question should be answered (Figure 1A). After reading the passage, the participants chose from four options which option was the most suitable answer to the question. In total, 800 question/passage pairs were adapted from the RACE dataset (Lai et al., 2017), a collection of English reading comprehension questions designed for Chinese high school students who learn English as a second language. The questions fell into six types (Figure 1B and C): three types of questions required attention to details, for example, retrieving a fact or generate inference based on a fact, which were referred to as local questions. The other three types of questions concerned the general understanding of a passage, for example, summarizing the main idea or identifying the purpose of writing, which were referred to as global questions. None of the question directly appeared in the passage, and the longest string that overlapped in the passage and question was 1.8 ± 1.5 words on average. Participants in Experiment 1 were Chinese college or graduate students who had relatively high English proficiency. The participants correctly answered 77.94% questions on average and the accuracy was comparable across the six types of questions (Figure 1B). We employed computational models to analyze what kinds of computations were required to answer the questions. The simplest heuristic model chose the option that best matched the passage orthographically (Figure 1—figure supplement 1A). This orthographic model achieved 25.6% accuracy (Figure 1B). Another simple heuristic model only considered word-level semantic matching between the passage and option, and achieved 27.3% accuracy (Figure 1B). The low accuracy of the two models indicated that the reading comprehension questions could not be answered by word-level orthographic or semantic matching. Next, we evaluated the performance of four context-dependent DNN models, that is, Stanford Attentive Reader (SAR) (Chen et al., 2016), BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and RoBERTa (Liu et al., 2019), which could integrate information across words to build passage-level semantic representations. The SAR used the bidirectional recurrent neural network (RNN) to integrate contextual information (Figure 1—figure supplement 1B) and achieved 47.6% accuracy. The other three models, that is, BERT, ALBERT, and RoBERTa, were transformer-based models that were trained in two steps, that is, pre-training and fine-tuning (Figure 1D). Since the three models had similar structures, we averaged the performance over the three models (see Figure 1—figure supplement 2 for the results of individual models). The model performance on the reading task was 37.08 and 73%, respectively, after pre-training and fine-tuning (Figure 1B). Computational models of human attention distribution In Experiment 1, participants were allowed to read each passage for 2 min. Nevertheless, to encourage the participants to develop an effective reading strategy, the monetary reward the participant received decreased as they spent more time reading the passage (see ‘Materials and methods’ for details). The results showed that the participants spent, on average, 0.7 ± 0.2 min reading each passage (Figure 1C), corresponding to a reading speed of 457 ± 142 words/min when divided by the number of words per passage. The speed was almost twice the normal reading speed for native readers (Rayner, 1998), indicating a specialized reading strategy for the task. Next, we employed eye tracking to quantify how the readers allocated their attention to achieve effective reading and analyze which computational models could explain the reading time on each word, that is, the total fixation duration on each word during passage reading. In other words, we probed into what kind of computational principles could generate human-like attention distribution during goal-directed reading. A simple heuristic strategy was to attend to words that were orthographically or semantically similar to the words in the question (Figure 1—figure supplement 1A). The predictions of the heuristic models were not highly correlated with the human word reading time, and the predictive power, that is, the Pearson correlation coefficient between the predicted and real word reading time, was around 0.2 (Figure 3—figure supplement 1A). The DNN models analyzed here, that is, SAR, BERT, ALBERT, and RoBERTa, all employed the attention mechanism to integrate over context to find optimal question answering strategies. Roughly speaking, the attention mechanism applied a weighted integration across all input words to generate a passage-level representation and decide whether an option was correct or not, and the weight on each word was referred to as the attention weight (see Figure 1—figure supplement 1B and Figure 2B for illustrations about the attention mechanisms in the SAR and transformer-based models, respectively). When the attention weights of the SAR were used to predict the human word reading time, the predictive power was about 0.1 (Figure 3A, Supplementary file 1a). Figure 2 Download asset Open asset Human attention distribution and computational models. (A) Examples of human attention distribution, quantified by the word reading time. The histograms on the right showed the mean reading time on each line for both human data and model predictions. trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. (B) The general architecture of the 12-layer transformer-based models. The model input consists of all words in the passage and an integrated option. Output of the model relies on the node CLS (Legge et al., 2002), which is used to calculate a score reflecting how likely an option is the correct answer. The CLS node is a weighted sum of the vectorial representations of all words and tokens, and the attention weight for each word in the passage, that is, α, is the deep neural network (DNN) attention analyzed in this study. Figure 3 with 3 supplements see all Download asset Open asset Model word reading time in Experiment 1. (A, B) Predict the word reading time based on the attention weights of deep neural network (DNN) models, text features, or question relevance. The predictive power is the correlation coefficient between the predicted word reading time and the actual word reading time. Predictive power significantly higher than chance is denoted by stars on the top of each bar. **p<0.01. trans_rand: transformer-base models with randomized parameters; trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. (C) Relationship between the word reading time and line index. The word reading time is longer near the beginning of a passage and the effect is stronger for global questions than local questions. (D) Relationship between the word reading time and question relevance. Line 0 refers to the line with the highest question relevance. The word reading time is higher for the question-relevant line. Color indicates the question type. The shade area indicates 1 standard error of the mean (SEM) across participants (N = 25). In contrast to assigning a single weight on a word, the transformer-based model employed a multihead attention mechanism: Each of the 12 layers had 12 parallel attention modules, that is, heads. Consequently, each word had 144 attention weights (12 layers × 12 heads), which were used to model the word reading time of humans based on linear regression. Since the attention weights of three transformer-based models showed comparable power to predict human word reading time, we reported the predictive power averaged over models (see Figure 3—figure supplement 1A for the results of individual models). The attention weights of randomly initialized transformer-based models could predict the human word reading time and the predictive power, which was around 0.3, was significantly higher than the chance level and the SAR (Figure 3A, Supplementary file 1a). The attention weights of pre-trained transformer-based models could also predict the human word reading time, the predictive power was around 0.5, significantly higher than the predictive power of heuristic models, the SAR, and randomly initialized transformer-based models (Figure 3A, Supplementary file 1a). The predictive power was further boosted for local but not global questions when the models were fine-tuned to perform the goal-directed reading task (Figure 3A, Supplementary file 1a). The weights assigned to attention heads in the linear regression are shown in Figure 3—figure supplement 2. For the fine-tuned models, we also predict the human word reading time using an unweighted averaged of the 144 attention heads and the predictive power was 0.3, significantly higher than that achieved by the attention weights of SAR (p=4 × 10–5, bootstrap). These results suggested that the human attention distribution was consistent with the attention weights in transformer-based models that were optimized to perform the same goal-directed reading task. Factors influencing human word reading time The attention weights in transformer-based DNN models could predict the human word reading time. Nevertheless, it remained unclear whether such predictions were purely driven by basic text features that were known to modulate word reading time. Therefore, in the following, we first analyzed how basic text features modulated the word reading time during the goal-directed reading task, and then checked whether transformer-based DNNs could capture additional properties of the word reading time that could not be explained by basic text features. Here, we further decomposed text features into visual layout features, that is, position of a word on the screen, and word features, for example, word length, frequency, and surprisal. Layout features were features that were mostly induced by line changes, which could be extracted without recognizing the words, while word features were finer-grained features that could only be extracted when the word or neighboring words were fixated. Linear regression analyses revealed layout features could significantly predict the word reading time (Figure 3B, Supplementary file 1b). Furthermore, the predictive power was higher for global than local questions (p=4 × 10–5, bootstrap, false discovery rate [FDR] corrected for comparisons across three features, i.e., layout features, word features, and question relevance), suggesting a question-type-specific reading strategy. Word features could also significantly predict human reading time, even when the influence of layout features was regressed out. Additionally, a linear mixed effect model revealed significant fixed effects for question type and all text/task-related features, as well as significant interactions between question type and these text/task-related features (Supplementary file 1c; Pinheiro and Bates, 2006; Kuznetsova et al., 2017). The predictive power of the layout and word features, however, was lower than the predictive power of attention weights of transformer-based models (p=4 × 10–5, bootstrap, FDR corrected for comparisons across two features, i.e., layout and word features). When the layout and word features were regressed out, the residual word reading time was still significantly predicted by the attention weights in transformer-based models (Figure 3—figure supplement 1B, predictive power about 0.3). This result indicated that what the transformer-based models extracted were more than basic text features. Next, we analyzed whether the transformer-based models, as well as the human word reading time, were sensitive to task-related features. To characterize the relevance of each word to the question answering task, we asked another group of participants to annotate which words contributed most to question answering. The annotated question relevance could significantly predict word reading time, even when the influences of layout and word features were regressed out (Figure 3B, Supplementary file 1b). When the question relevance was also regressed out, the residual word reading time was still significantly predicted by the attention weights in transformer-based models (Figure 3—figure supplement 1C, p=0.003, bootstrap, FDR corrected for comparisons across 12 models × 6 question types), but the predictive power dropped to about 0.2. Furthermore, a linear mixed effect model also revealed that more than 85% of the DNN attention heads contribute to the prediction of human reading time when considering text features and question relevance as covariates (Supplementary file 1c). These results demonstrated that the DNN attention weights provided additional information about the human word reading time than the text-related and task-related features analyzed here. Further analyses revealed two properties of the distribution of question-relevant words. First, for local questions, the question-relevant words were roughly uniformly distributed in the passage, while for global questions, the question-relevant words tended to be near the passage beginning (Figure 3—figure supplement 3A). The eye-tracking data showed that readers also spent more time reading the passage beginning for global than local questions (Figure 3C), explaining why layout features more strongly influenced the answering of global than local questions. Second, few lines in the passage were question relevant (Figure 3—figure supplement 3B), and the eye-tracking data showed that readers spent more time reading the line with the highest question relevance (Figure 3D), confirming the influence of question relevance on word reading time. Attention in different processing stages for humans and DNNs Next, we investigated whether humans and DNNs attended to different features in different processing stages. The early stage of human reading was indexed by the gaze duration, that is, duration of first-pass reading of a word, and the later stage was indexed by the counts of rereading. Results showed the influence of layout features increased from early to late reading stages for global but not local questions (Figure 4A, Supplementary file 1d). Consequently, the passage beginning effect differed between global and local questions only for the late reading stage (Figure 4—figure supplement 1A). The influence of word features did not strongly change between reading stages, while the influence of question relevance significantly increased from early to late reading stages (Figure 4A, Figure 4—figure supplement 1B). These results suggested that attention to basic text features developed early, while the influence of task mainly influenced late reading processes. Figure 4 with 3 supplements see all Download asset Open asset Factors influencing attention distribution in different processing stages for humans and deep neural networks (DNNs). (A) Human attention in early and late reading stages is differentially modulated by text features and question relevance. The early and late stages are separately characterized by gaze duration, that is, duration for the first reading of a word, and counts of rereading, respectively. **p<0.01; ***p<0.001. (B) DNN attention weights in different layers are also differentially modulated by text features and question relevance. Each attention head is separately modeled and averaged within each layer, and the results are further averaged across the three transformer-based models. Shallow layers of both fine-tuned and pre-trained models are more sensitive to text features. Deep layers of fine-tuned models are sensitive to question relevance. trans_rand: transformer-base models with randomized parameters; trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. In the following, we further investigated whether transformer-based DNN attended to different features in different layers, which represented different processing stages. This analysis did not include layout features that were not available to the models. The attention weights in shallow layers were sensitive to word features in randomized, pre-trained, and fine-tuned models (Figure 4B and C). Only in the fine-tuned models, however, the attention weights in deep layers were sensitive to question relevance (see Figure 4—figure supplements 2 and 3 for results of individual models). Therefore, the shallow and deep layers separately evolved text-based and goal-directed attention, and goal-directed attention was induced by fine-tuning on the task. Experiment 2: Question type specificity of the reading strategy In Experiment 1, different types of questions were presented in blocks which encouraged the participants to develop question type-specific reading strategies. Next, we ran Experiment 2, in which questions from different types were mixed and presented in a randomized order, to test whether the participants developed question type-specific strategies in Experiment 1. Since it was time consuming to measure the response to all 800 questions, we randomly selected 96 questions for Experiment 2 (16 questions per type). In Experiment 2, the reading speed was on average 298 ± 123 words/min, lower than the speed in Experiment 1 (p=6 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments), but still much faster than normal reading speed (Rayner, 1998). The word reading time was better predicted by fine-tuned than pre-trained transformer-based models (Figure 5A, Supplementary file 1e). For the influence of text and task-related features, compared to Experiment 1, the predictive power in Experiment 2 was higher for layout and word features, but lower for question relevance (Figure 5B, Supplementary file 1f). For local questions, consistent with Experiment 1, the effects of question relevance significantly increased from early to late processing stages that are separately indexed by gaze duration and counts of rereading (Figure 5—figure supplement 1A, Supplementary file 1d). The passage beginning effect was higher for global than local questions (Figure 5C, second column, p=2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments), but the difference was smaller than in Experiment 1 (Figure 5C, Figure 5—figure supplement 2A, p=2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). The question relevance effect was also smaller in Experiment 2 than Experiment 1 (Figure 5D, Figure 5—figure supplement 2B, p=2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). All these results indicated that the readers developed question type-specific strategies in Experiment 1, which led to faster reading speed and stronger task modulation of word reading time. Figure 5 with 2 supplements see all Download asset Open asset Influence of task and language proficiency on word reading time. (A, B) Predict the word reading time using attention weights of deep neural network (DNN) models, text features, and question relevance for all four experiments. Predictive power significantly higher than chance is marked by stars of the same color as the bar. Significant differences between experiments are denoted by black stars. trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. *p<0.05; **p<0.01; ***p<0.001. (C, D) Passage beginning and question relevance effects for all four experiments. The shade area indicates 1 SEM across participants (N = 25 for Exp 1; N = 20 for Exps 2-4). Experiment 3: Effect of language proficiency Experiments 1 and 2 recruited L2 readers. To investigate how language proficiency influenced task modulation of attention and the optimality of attention distribution, we ran Experiment 3, which was the same as Experiment 2 except that the participants were native English readers. In Experiment 3, the reading speed was on average 506 ± 155 words/min, higher than that in Experiment 2 (p=6 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). The question answering accuracy was comparable to L2 readers (Figure 1B). The word reading time for native readers was slightly better predicted by fine-tuned than pre-trained transformer-based models (Figure 5A, Supplementary file 1e). For the influence of text and task-related features, compared to Experiment 2, the predictive power in Experiment 3 was higher for word features, but lower for layout features and question relevance (Supplementary file 1f). For local questions, the layout effect was more salient for gaze duration than for counts of rereading. In contrast, the effect of word-related features and task relevance was more salient for counts of rereading than gaze duration (Figure 5—figure supplement 1B, Supplementary file 1d). The passage beginning effect was higher for global than local questions, but the difference was smaller than in Experiment 2 (Figure 5C, Figure 5—figure supplement 2A, p = 2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). The question relevance effect was also smaller for Experiment 3 than Experiment 2 (Figure 5D, Figure 5—figure supplement 2B, p=2 × 10–4, bootstrap, FDR cor