|
|
拉风的牛肉面
4 月前 |
We introduce PRELUDE , a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by 15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.
1 IntroductionThe emergence of new LLM-driven applications, such as multi-document analysis (Google DeepMind, 2024 ; Wang et al., 2024d ; Gutierrez et al., 2024 ) , personal assistants with chat histories (Wu et al., 2024 ; Xu et al., 2025a ) , autonomous agents (Hu et al., 2025 ; OpenAI, 2025 ; Wang et al., 2025 ) , and repository-level coding tools (Jimenez et al., 2024 ) , has created increasing demands for robust long-context understanding and reasoning.
To better support long inputs, many techniques have been proposed, primarily focusing on efficient attention mechanisms (Xiong et al., 2021 ; 2023 ) and retrieval-augmented generation (RAG) (Lewis et al., 2020 ; Xu et al., 2024a ; Edge et al., 2024 ; Gutierrez et al., 2024 ; Asai et al., 2023 ) . Alongside these technical advances, there is a growing need for effectively evaluating long context understanding and reasoning capabilities. To this end, several benchmarks have recently been introduced (see Section 2 ). Building on this progress, recent work has extensively discussed the criteria that a strong benchmark for long-context understanding and reasoning should satisfy (Press et al., 2023 ; Yu et al., 2023 ; Liu et al., 2024 ; Yen et al., 2024 ; Fang et al., 2024 ; Wu et al., 2025b ) . To rigorously evaluate a model’s capabilities in this domain, several key criteria have emerged as essential:
Beyond Memorization. LLMs memorize content from pretraining, especially for popular texts (Tirumala et al., 2022 ; Delétang et al., 2023 ; Sutskever, 2023 ) , enabling answers without true comprehension. The existence of this shortcuts blurs the line between long-context understanding and mere activation of parametric knowledge memorized during pretraining. As training data grows, this issue worsens. As a Necessity Condition , a robust benchmark must prevent solutions based on memorization alone, ensuring full-context reasoning remains essential.
Global Dependency. The task should require aggregating evidence scattered across the context or exhibiting global dependencies; otherwise, it reduces to a short-context problem focused on retrieval rather than true long-text understanding.
Depth of Reasoning. Long-context reasoning should inherently require synthesizing multiple pieces of evidence and multi-step deduction. However, many existing benchmarks focus on shallow reasoning, such as decomposition or enumeration ( e.g. , counting facts (Xu & Ma, 2025 ) , or sequential sub-questions (Yang et al., 2018 ) ), which reduces the need for global, multi-step inference and oversimplifies the task.
Human-Machine Gap. To highlight essential capabilities that general-purpose intelligent systems should possess, a benchmark should show a significant gap between humans and machines. Low human accuracy or agreement typically signals poor annotation quality or high subjectivity, undermining reliable evaluation.
Beyond Summarization/Salience. Often overlooked but crucial, a strong benchmark should require attention to fine-grained details beyond high-level abstraction to remain challenging and meaningful. Otherwise, it risks reducing to a summarization task that is solvable without long-context understanding (Chang et al., 2023 ) .
Figure 1 evaluates several widely-used benchmarks for story understanding across the aforementioned dimensions. It shows that all existing benchmarks fall short in at least one aspect, with a particularly notable limitation in mitigating the effects of Memorization and encouraging Deep Reasoning . Detailed evaluation metrics and experimental settings are provided in Appendix A .
To address these limitations, we propose a novel task format that enables comprehensive assessment across all the identified criteria. The task involves presenting the model with a list of hypothetical prequels for important supporting book characters, and asking it to determine whether each prequel is consistent with the canonical story. Each hypothesis is presented as a concise bullet point (see Figure 1 ) summarizing a proposed setting. These hypotheses are annotated by human experts who have read the canonical stories multiple times and performed literary analysis before. This annotation process is both efficient and high-quality, ultimately yielding a dataset of 1K labeled examples. Intuitively, our task design naturally mitigates the key limitations observed in existing benchmarks:
First, the Memorization shortcut is alleviated by construction, as the input prequels are newly generated and do not appear in the training data of any existing LLMs. The limited performance of OpenAI’s Deep Research further supports this, suggesting that it is difficult to locate human-summarized evidence on the internet to solve our task.
Second, our task encourages global reasoning . This is because (1) determining whether a consistent prequel aligns with the canonical story typically requires aggregating evidence across the whole character story; and (2) contradictory prequels often involve inconsistencies that span several scattered events due to the narrative structure of the original work. Empirically, our annotation analysis reveals that 88% of the examples in PRELUDE require non-local evidence to resolve.
Finally, our task encourages deep reasoning , because of the fact that the canonical story reflects non-immediate consequences of the prequels. To solve our task, LLMs must unfold the implications of a prequel and align them with the story, often requiring multi-step inference. For instance, the second example in Figure 1 involves reasoning that Faria was arrested when Napoleon was still emperor , and then inferring a contradiction from the fact that the Bourbon Restoration removed Napoleon from power . This kind of non-immediate causality resists shallow reasoning shortcuts that decomposes the problem into subquestions.
We conduct extensive experiments on our proposed task using ICL, RAG, and DeepResearch across state-of-the-art commercial and open-source LLMs. The results reveal several key findings: (1) The best-performing system lags human performance by over 15%; (2) LLMs often arrive at correct predictions with flawed reasoning, resulting in a reasoning accuracy gap of 30% compared to humans; (3) Our task cannot be solved by searching for information on the web, making the advanced DeepResearch systems struggle and underperform RAG; (4) Supervised training and many-shot ICL yield no performance gains, highlighting LLMs’ intrinsic limitations in long-context reasoning.
These findings highlight that PRELUDE requires deeper long-context reasoning capabilities beyond what current methods offer, pointing to important directions for future research.
Book: The Chronicles of Narnia Char: Eustace ScrubbFollowing the needle-in-a-haystack task that examines LLM’s in-context searching (Kamradt, 2023 ) , a line of works have since focused on probing the LLM ability to trace and utilize information pieces in stretched long context (Hsieh et al., 2024 ; Li et al., 2025b ; Yu et al., 2025b ) , while others also fuse reasoning in their task design such as sorting or relation identification, beyond the mere retrieval aspect (Kuratov et al., 2024 ; Wang et al., 2024a ; Dong et al., 2024 ; Wang et al., 2024c ; Lee et al., 2025 ) . Besides these works that specifically stress-test long context utilization, other related LLM tasks could also reflect such ability, e.g. , many-shot in-context learning (Agarwal et al., 2024 ; Xu et al., 2024b ; Li et al., 2025c ) .
Orthogonal to those synthetic stress-testing tasks, another line of works target the more natural question answering settings for realistic long context evaluation, primarily utilizing stories in various domains, such as NarrativeQA (Kočiskỳ et al., 2018 ) , NovelQA (Wang et al., 2024b ) , DetectiveQA (Xu et al., 2025b ) , CharToM-QA (Zhou et al., 2025 ) . Realistic long context QA has gained particular attention in many recent LLM benchmarks, such as LongBench (Bai et al., 2024a ; b ) , XLBench (Ni et al., 2024 ) , CLongEval (Qiu et al., 2024 ) , LooGLE (Li et al., 2024 ) , Bench (Zhang et al., 2024 ) , LaRA (Li et al., 2025a ) , etc .
Our task is closely related to fact verification over multiple documents or web sources, as exemplified by FEVER (Thorne et al., 2018 ) and its extensions (Wadden et al., 2020 ; Yin et al., 2021 ; Schlichtkrull et al., 2023 ) .
Among this line of work, NoCha (Karpinska et al., 2024 ) is the most relevant, as it also uses book narratives as context. However, a key distinction lies in the nature of the hypotheses: NoCha uses summaries or conclusions of the original story, which often share semantic overlap with the canonical book. Therefore, this task design is vulnerable to memorization or summarization shortcuts, as shown in Figure 1 . To mitigate this, NoCha uses recently published books. Yet as training corpora expand, newer LLMs inevitably become familiar with these texts, reducing the task’s effectiveness. As shown in our experiments, while the public subset of NoCha has been largely conquered by LLMs, our subset comprising works from the same period or earlier remains challenging. This shows that our task is not becoming easier for models over time.
Our task is formulated as binary classification. The input consists of a book that is split into consecutive chunks ; and a prequel for a character , which is a short text describing an experience of the character prior to the story of happens. The task is then predict whether aligns with . The labels to predict thus are .
3.2 Why Prequels?Our prequel entailment task is naturally a long-context understanding task and a form of everyday research task. To solve the task, a model needs to judge whether a character’s prequel remains consistent with the behaviors and experiences throughout the narrative and makes counterfactual reasoning when necessary. These requirements make our task well-suited for benchmarking long-context reasoning for the following desirable properties:
Cognitive research practices in daily life: While formal research is often confined to scientific domains, its core cognitive components, such as gathering evidence, forming hypotheses, and drawing conclusions, are deeply embedded in daily reasoning. Our task scenario mirrors this real-life cognition, as humans frequently make similar judgments while watching films, reading novels, or engaging in social interactions.
Light dependency on background knowledge: This task requires little reliance on external or specialized knowledge. A reader with a full understanding of the story, even as a middle school student, can often make accurate judgments. As a result, the task emphasizes fluid intelligence rather than crystallized knowledge acquired through prior learning.
Annotators are instructed to follow the definitions and examples provided in Table 1 . They are guided by a flowchart that first identify Contradict - Local , Contradict - Global I , and Contradict - Global II respectively. If no contradiction is identified, they then determine whether the consistent prequel provides key missing information. During annotation, annotators are required to carefully consult the original book to identify any contradictions.
Annotation following this flowchart is generally sufficient for people familiar with the book. However, during trial annotation, we identified three issues hence introduced the following rules:
First, judgments must be based solely on the content of the original novel. Adaptations, derivative works, or historical inspirations behind the characters should not be considered. Otherwise, annotators might incorrectly flag a contradiction on a case consistent with the novel based on, for example, a historical figure’s biography.
Second, annotators should assume that the prequel is followed immediately by the original story, with no additional text or events in between. This rule addresses a tendency among human annotators to over-interpret. For instance, when facing a clear contradiction, they might imagine that some unseen event occurred between the prequel and the original story to make it eventually consistent. This rule reduces such subjectivity.
Finally, unless a character’s statements are later explicitly revealed to be deceptive, they should be treated as factual, akin to narrative exposition. Otherwise, one could dismiss any contradiction with the canonical text as intentional deceit.
We prompt DeepSeek-R1 and GPT-4o to generate prequel samples, using the prompt provided in Appendix B.1 . In the prompt, we explicitly instruct the LLMs to generate a prequel for each character in Markdown list format. Each bullet point is treated as an individual prequel example for annotators to label according to the types defined in Table 1 .
Four annotators labeled the prequels for 40 characters across 13 books (see Appendix B.2 ), as shown in Table 2 . These books were selected to represent diversity in genre, popularity, and original language. Two annotators are graduate students majoring in literature or related fields, while the other two major in computer science. Each annotator worked on books they were familiar with to ensure high-quality annotations. 1 1 1 By “familiar”, we require that the annotator has read the book multiple times and can recall the overall plot in reasonable detail. The label distribution in Table 3 shows that humans identified contradictions in nearly half of the generated prequels.
The annotation process resulted in a total of 795 instances, with each case taking approximately 10 minutes to complete. After training, the annotators reached substantial agreement, with a Kappa score of 0.7828, though some subjectivity in interpretation remained. Most unresolved cases were due to differing interpretations of characters, ambiguities left by the original authors, or inherently fuzzy logic. Representative examples are shown in Appendix D.1 , Examples • ‣ D.1 and • ‣ D.1 .
4 Compared MethodsWe compare multiple state-of-the-art LLMs under the following settings. The implementation details can be found in Appendix C .
This is the vanilla prompting approach that first presents the LLMs with the task instruction and then examples ( in our case). It does not provide book context in the input so the LLMs need to rely on their inherent parametric knowledge to solve the task. We use the prompts shown in Figure 9 and 11 from appendix, but with the field Original Excerpt omitted from the input.
For open-source LLMs, we use the Instruct versions of the Qwen2.5/3 models and DeepSeek-R1. We also compare against commercial LLMs, including GPT-4o, o3-mini, and the Gemini 2.5 family, accessed via API calls.
We enable the LLMs to access the canonical novels via retrieval-augmented generation (RAG) (Lewis et al., 2020 ; Guu et al., 2020 ) . We experiment with various embedding models and hyperparameters, as detailed in Section 5.3 . Our final system uses the Qwen3-Embedding-8B model to retrieve the top 40 chunks, each with a length of 500 tokens.
Prior work has shown that fine-tuning with as few as 1K examples can elicit specific capabilities in LLMs, such as mathematical reasoning or general instruction following (Zhou et al., 2023 ; Zhao et al., 2024 ; Muennighoff et al., 2025 ) . These successes rely on the assumption that such capabilities are already present in the model acquired by pretraining thus can be activated with minimal supervision.
In contrast, if an LLM lacks the potential for a given capability, training with a small number of examples is unlikely to produce meaningful gains. Thus, the effectiveness of low-resource in-domain training can serve as a diagnostic tool to assess the intrinsic difficulty of a task, as demonstrated in (Yu et al., 2025a ) . Following this idea, we fine-tune on our labeled dataset (excluding the human study subset, 700 examples) and evaluate on the held-out human study subset.
Similar to the in-domain training approach, many-shot ICL (Agarwal et al., 2024 ; Bertsch et al., 2025 ) provides a large number of examples in the input context to elicit the latent capabilities of the LLM. We use the same data split as in the in-domain training experiment.
Commercial deep research services are offered by several companies, notably OpenAI DeepResearch 2 2 2 https://openai.com/index/introducing-deep-research/ and Google Gemini DeepResearch 3 3 3 https://gemini.google/overview/deep-research/ . These services showcase the ability to retrieve and synthesize information from multiple sources to generate reports using an agentic approach. We use the web interface of OpenAI DeepResearch, which has demonstrated strong performance across a wide range of everyday tasks.
5.1 Human PerformanceWe selected 100 examples to compute human performance. Three participants who had not involved in our task annotations and have similar backgrounds to our annotators were asked to annotate examples from books they were familiar with. The results show strong performance, with an F1 score of 81.7% (an accuracy of 82%), indicating that the task is largely solvable by humans.
Upon examining the disagreements, we found that most could be resolved, as they were often due to annotators overlooking information (either from fatigue after extended work or the the fallibility of human memory). The remaining unresolved cases are of similar types to those analyzed in Section 3.3 .
Our study further reveals that humans tend to adopt a DeepResearch-style approach, which involves iteratively generating hypotheses and resolving them by locating relevant supporting evidence.
Table 4 compares state-of-the-art open-source and commercial LLMs on our task. The Gemini-2.5-Pro model shows a clear advantage over the others, yet still falls short of human performance by 15%.
Another key finding is that, except for the Gemini-2.5 models, all other LLMs tend to overpredict the Consistent label when not given access to the original books, resulting in unbalanced performance. This suggests that these models struggle to encode details of the novels within their parametric knowledge, underscoring the importance of incorporating canonical books in our task.
A per-book breakdown of the results from this study and the following RAG study is provided in Appendix D.2 .
Table 4 also presents the performance of various LLMs under the RAG setting. For most models, RAG improves the F1 score on the Contradict class. However, we also observe a tendency toward over-rejection, where LLMs predict Contradict more frequently with hypercritical reasoning that focuses on minor or debatable inconsistencies. One illustrative example can be found in Appendix D.1 , Example • ‣ D.1 .
Notably, RAG results in worse performance for the Gemini-2.5-Pro model on both classes. This suggests that the retrieved contexts do not effectively contribute new or useful information for the strong Gemini-Pro model. It also highlights a broader limitation that despite recent advancements, long-context reasoning remains a persistent challenge for LLMs.
As shown in Table 5 , on our held-out subset, neither in-domain fine-tuning nor many-shot ICL improves performance over the baseline usage of the same LLMs respectively. It indicates that current LLMs are still fundamentally limited in the type of reasoning required for our task.
Finally, Table 5 shows that DeepResearch performs worse than the best LLMs, both with and without RAG. It is also less effective at identifying contradictory evidence compared to most RAG-based systems. Since DeepResearch primarily relies on retrieving human-written analyses from the Internet, these results suggest that our task cannot be solved using existing external commentary or interpretations alone.
We manually verify the outputs of the LLMs and find that, although they correctly answer a large portion of the tasks, they often fail to arrive at the correct answers through valid reasoning. Due to the labor-intensive nature of this evaluation, we select the best outputs from systems that achieve the highest performance on at least one metric in each block of Table 5 . We also include DeepSeek-R1 with RAG, as its best run yields the highest answer accuracy.
The final column of Table 5 and Figure 2 present the results, revealing a clear gap between answer accuracy and reasoning accuracy. A such example can be found in Appendix D.1 , Example D.1 . In contrast, human annotators generally agree on their reasoning, suggesting that current models still lack true comprehension in solving our task.
It is noteworthy that, despite its lower answer accuracy, DeepResearch exhibits the smallest performance drop when considering reasoning accuracy. This suggests combining a strong reasoning model with reflective mechanisms can lead to more reliable reasoning traces.
Figure 3 (a) investigates the impact of retrieved context lengths. The RAG system achieves its best performance when the input length is around 20k tokens. With shorter contexts, the retriever often fails to include important evidence due to limited capacity. Conversely, with longer contexts, the excess information can overwhelm the LLM and hinder its ability to effectively utilize the extended context.
Table 6 provides ablation study on our retrieval method using the following variations:
Sorting the retrieved chunks according to their order in the books : While this intuitively provides a more coherent context, it overlooks chunk relevance, resulting in decreased performance.
Doubling the chunk size while keeping the input length : This causes a slight performance drop, likely due to the reduced effectiveness of embedding models on longer chunks (Wu et al., 2025a ) .
Replacing the embedding model with BGE-M3 : This slightly reduces performance.
Replacing with BGE-M3 while doubling the chunk size : This results in a further decrease in performance, likely due to BGE-M3’s weaker handling of long inputs.
Figure 3 (b) shows how model performance changes with increasing model size. We experiment with both the Qwen2.5 and Qwen3 series. As shown in Table 4 , Qwen models rely on RAG to incorporate additional knowledge and improve performance. Therefore, all experiments in this section are conducted under the RAG setting, with each model evaluated over three runs.
The results indicate that for both Qwen series, performance consistently improves as model size increases. However, this improvement begins to plateau beyond the 32B model. Notably, the 235B model even outperforms the larger 671B R1 model on certain metrics, suggesting that simply scaling up model size is not efficient to our task.
Our results in Table 4 show that while some LLMs benefit from retrieved contexts, advanced reasoning models, such as DeepSeek-R1 and Gemini-2.5, exhibit a notable performance drop when context is provided. This gap becomes more pronounced as the base models grow stronger.
These findings suggest that recent improvements in LLMs’ general reasoning capabilities do not necessarily translate to better long-context reasoning. One possible explanation is that as models become more powerful, their internal (parametric) knowledge is more efficient to solve tasks, making them prone to ignore external inputs during post-training. This highlights the need for improved training data and strategies specifically designed to encourage long-context reasoning.
If our dataset construction method introduced bias, we would expect a model to perform worse on examples it generated itself, assuming that LLMs inherently trust their own outputs. However, the results in Table 4 show that this is not the case, indicating that our construction process does not introduce significant bias toward any particular model family. This also suggests that state-of-the-art LLMs do not inherently trust their own generations and remain susceptible to hallucination, even when evaluating content they previously produced.
Combined with the observation that DeepResearch performs poorly on our task, it becomes evident that the task cannot be solved simply by retrieving existing information from the web. Instead, it requires generating new knowledge through reasoning based on learned rules, aligning with the notion of fluid intelligence tests (Chollet, 2019 ; Chollet et al., 2025 ; Yu et al., 2025a ) . Unlike prior work, our task represents the first fluid intelligence assessment conducted entirely in the natural language space.
7 ConclusionWe introduce PRELUDE , a new benchmark for evaluating long-context comprehension and reasoning in LLMs. Our task design addresses several key shortcuts present in prior long-context benchmarks. Experiments show that state-of-the-art models still fall significantly short of human performance, particularly in generating valid reasoning. PRELUDE calls for further research into robust long-context understanding and the development of models with stronger global reasoning capabilities.
Our task empirically mitigates the shortcuts observed in prior work (as shown in Figure 1 ). However, due to the inherent complexity of long-context reasoning and the subjective nature of interpreting literary narratives, human performance on our task is also non-perfect. In future work, we aim to improve the annotation framework to further enhance inter-annotator consistency and reduce the annotation and human study workload of the task.
For each criterion discussed in Section 1 , we first go through the definition and examples of the criterion, then propose its delegate measure:
LLMs are known to memorize large amounts of training data (Tirumala et al., 2022 ; Delétang et al., 2023 ; Sutskever, 2023 ) . For popular texts ( e.g. , widely read books), models may recall content or associated analysis from pretraining, bypassing the need for actual comprehension. As training datasets continue to expand, this issue becomes increasingly problematic. Therefore, a benchmark should be designed such that it cannot be solved purely through memorized knowledge, ensuring that the full length and structure of the context remain necessary for reasoning.
Measurement: We first evaluate the memorization performance of GPT-4o by having it answer questions without using RAG. We then compute its quantile within the range defined by human and random performance. This measurement is illustrated in Figure 4 .
The task should require aggregating information from multiple pieces of evidence that are scattered across the context or exhibit global dependencies. Otherwise, it reduces to a short-context problem solvable by retrieving the relevant passage for a given question. In such cases, the task becomes more about improving retrieval quality than long-context understanding ability.
Measurement: This dimension differs slightly from the previous one in how the performance interval is defined. To evaluate how much better a model performs compared to using a single document, the reference point should be the model’s performance with sufficient evidence, rather than human performance. To approximate this, we retrieve the top-20 documents using both the question and the answer, and treat this as the upper bound of performance with sufficient evidence. We then ask GPT-4o to select the best supporting document from the top-20 retrieved using both the question and answer, treating it as the strongest single piece of evidence . We compute the quantile of the model’s RAG performance using only this selected document, relative to the interval defined by the QA-top-20 RAG performance and the random baseline.
By nature, long-context reasoning involves synthesizing multiple pieces of evidence across the input. The complexity of reasoning—especially multi-step deduction—is closely tied to task difficulty. Many existing benchmarks emphasize shallow reasoning, primarily requiring decomposition or enumeration (e.g., counting mentions of a fact (Xu & Ma, 2025 ) , or multi-hop QA where the questions are often constructed as sequences of relatively simple sub-questions (Yang et al., 2018 ) . This limits the need for global reasoning and makes tasks easier than intended.
Measurement: Similar to the previous dimension, the key in this measurement lies in identifying a representative performance interval and a suitable reference model for computing the quantile. We choose QwQ-32B as a strong reasoning model and treat its RAG performance using the top-20 documents retrieved with both the question and answer (as defined in the previous measurement) as the upper bound. We then compute the quantile of Qwen2.5-7B under the same RAG setting. This gap reflects the potential for improvement attributable to stronger reasoning capabilities. Thus, the deeper the reasoning required by a dataset, the larger this gap is expected to be.
Note that this proxy becomes less informative if the dataset is so challenging that neither model significantly outperforms the random baseline. However, as shown in Table 7 , this is not currently the case. Therefore, this measurement remains a meaningful indicator.
This often-overlooked criterion is crucial: Tasks that can be resolved simply by generating a summary of a long input are less challenging and may no longer probe deep understanding, especially given recent advances in summarization ( e.g. , hierarchical or iterative methods (Chang et al., 2023 ) ). In such cases, the challenge of long-context understanding is effectively reduced to a summarization task over short segments, creating a shortcut. A high-quality benchmark should instead require attention to fine-grained details that go beyond high-level abstraction, or it can be reduced to a relatively easier task of summarization thus is less meaningful.
Measurement: This dimension is measured by comparing the performance of RAG using original text chunks against RAG using chunk summaries of the same input length. We do not use quantiles here but instead directly report the relative improvement, as some datasets actually perform worse when using the original texts, indicating that the task questions primarily target salient events, making summarization sufficient for answering.
To highlight essential capabilities that general-purpose intelligent systems ought to possess, a meaningful benchmark should show a significant gap between humans and machines. At the same time, poor human accuracy or inter-annotator agreement typically signals low annotation quality or high subjectivity — both detrimental to robust evaluation.
Measurement: We identify the best machine performance from the evaluations in the Beyond Memorization and Beyond Summarization dimensions and compute its quantile within the interval defined by human performance and the random baseline.
We compare our PRELUDE with representative benchmarks from prior work, including (Kočiskỳ et al., 2018 ; Karpinska et al., 2024 ; Zhao et al., 2024 ; Xu et al., 2025b ) . These widely used datasets span multiple genres such as classic and detective novels, support both English and Chinese languages, and cover a range of task formats including free-form QA, multiple choice, and true-or-false questions. For a fair comparison, we sample 100 questions from each benchmark: InfBench.MC (25 books), DetectiveQA (10 books), and NarrativeQA (10 books from the development set), ensuring a similar number of examples across datasets.
B.1 Prompt for Generating PrequelsWe simple prompt the LLMs with no additional context to generate the prequels for human annotation. Figure 8 demonstrates the English translation of our prompt.
Please write a prequel story for the character "{{ char }}" based on the plot of the novel {{ book_name }}. The story should provide a background that does not contradict the original work . It should explain all of the character â s major actions in the original novel , their key personality traits , and their major relationships . Your prequel should focus on reasonably answering unresolved questions about the character that are raised in the original novel . These include : important issues that cannot be explained using the original content alone , plot points that appear inconsistent with the character â s established traits or internal logic , and significant past experiences that are mentioned but not elaborated upon . You only need to write an outline . The outline should summarize key elements of the character â s backstory and major life events . Each item in the outline should include specific story - level details , rather than general concepts or attributes . The total length of the outline should be within 1,000 words . Please present your answer in a Markdown list format .We apply LoRA (Hu et al., 2021 ) to all available target modules for in-domain training with RAG and non-RAG settings. The training is conducted on the dataset excluding the human-study subset, resulting in a total of 695 instances. Since the average token length in the RAG setting is above 20k, we perform LoRA tuning on both settings using 360-LLaMA-Factory (Haosheng Zou & Zhang, 2024 ) . We fine-tune each model for 2 epochs with a batch size of 1 on a single machine with 8 H20 GPUs. The dimension of LoRA’s attention layer is set to 64, while is configured to 128. In addition, the learning rate and warmup ratio are set to 1e-4 and 0.1 respectively.
C.2 Prompt TemplatesWe use the prompt template in Figure 9 and 11 for the RAG experiment, depending on whether the instance originates from a Chinese book or an English book. For the “No Context” experiment, we remove the original excerpt part both in the ICL examples and the actual instance. The exact content of ICL in both languages is in Figure 10 and 12 .
Subjective Example – Fuzzy Logic : Example • ‣ D.1 shows an example that our annotators felt difficult to arrive an agreement due to the fuzzy reasoning. This setting results in the fact that the character appears in the story at around 120 years old. The novel does feature several characters in their 100s who remain healthy and skilled in combat. Still, as the most advanced martial artist in the story, whether 120 is a reasonable age is a matter of fuzzy logic.
Example I. An example of disagreement among human annotators due to fuzzy reasoning.
Book: Demi-Gods and Semi-Devils Char: Sweeper Monk Groundtruth: ConsistentSubjective Example – Different Interpretation of Persona (Example • ‣ D.1 ) : In the story, the Sweeper Monk is not a formal disciple of Shaolin; he is more like a hermit. So the disagreement arises that when others mention that many Shaolin disciples had learned Xiaowuxiang Gong, whether the Sweeper Monk would interpret that as referring to himself? The annotators would project themselves into the character’s perspective when making predictions, which introduces subjectivity.
Example II. An example of disagreement among human annotators due to subjective interpretation of character persona.
Example that RAG Performs Worse : Example • ‣ D.1 gives an example in which the vanilla LLM can predict correct answer but fails when equipped with RAG. The prequel is consistent because it does not violate any part of the story. Specifically, in a flashback scene from the original novel, it is revealed that Murong Bo’s mother subjected him to harsh training in order to raise him as a future monarch for national restoration. When the RAG system fails to recognize that this passage describes Murong Bo’s childhood, it tends to rely solely on the more prominent information in the input (e.g., Document 0) to make its judgment.
Example III. An example that RAG makes the performance worse.
Book: Demi-Gods and Semi-Devils Char: Murong Bo Groundtruth: ConsistentExample with Correct Answer but Incorrect Reasoning : In Example D.1 , the model correctly answers the question, but bases on incorrect reasons. The actual basis for judging this as a contradiction lies in the fact that both characters are portrayed as emphasizing personal martial prowess, with no demonstration of strategic ability or any depiction of possessing military treatises in the original novel. The act of gifting The Six Secret Teachings (Liu Tao)—a text regarded in Chinese tradition as one of the most profound and authoritative works on military strategy—stands in sharp contrast to their characterization.
The model’s explanation is invalid because it fails to grasp that Wu Anguo was representing Beihai’s forces (under Kong Rong) in the battle at Hulao Pass. Instead, it superficially interprets the text and incorrectly assumes that Wu Anguo’s sphere of activity was limited to Hulao Pass.
This reflects the same underlying issue as in the RAG failure example: LLMs are heavily influenced by the provided context and tend to overanalyze non-informative clues, often becoming overly literal and fixating on minor wording differences.
Book: Romance of the Three Kingdoms Char: Taishi Ci Groundtruth: Contradict