Do LLMs Benefit from Their Own Words?

Jenny Y. Huang · Leshem Choshen · Ramon Astudillo · Tamara Broderick · Jacob Andreas

MIT EECS · MIT-IBM Watson AI Lab · IBM Research | arXiv:2602.24287v1 [cs.CL] 27 Feb 2026

Abstract · 摘要

Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses.

Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model.

To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10×.

To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. Only a subset of user prompts (33.1%) reference an earlier assistant response without giving actionable feedback.

When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns.

Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.

摘要批注

段落功能提出核心研究问题与反直觉结论

逻辑角色：全文论点的高度浓缩，为读者建立期望框架

核心论点（黄色）：保留助手历史并非必然有益，在大量对话轮次中可安全省略。

论证技巧："To our surprise"是刻意修辞，暗示该结果违反领域共识，增强论文贡献感。

潜在漏洞摘要未区分"不同模型间差异"——实验结果在不同模型上并不一致（某些模型FC仍更优），摘要的表述略显过度概括。

Section 1 · 引言

1 Introduction

As large language models (LLMs) are deployed in increasingly complex multi-turn interactions, context management becomes an important challenge. Long contexts increase computational costs, slow inference speeds, and can impair a model's capacity to attend to relevant information. In response, agentic systems like Claude Code and Cursor have adopted context-editing strategies.

Despite increasing efforts to compress and prune older segments of conversation history, one key assumption remains largely underexamined: that retaining past model outputs reliably improves downstream response quality in real-world multi-turn conversations.

In this work, we analyze in-the-wild multi-turn chats from WildChat and ShareLM to ask: Do current models benefit from conditioning on their own prior responses?

1.1 Context Management

Single-turn prompt compression. A line of work studies prompt compression in the context of single-turn retrieval-augmented generation (RAG), where retrieved documents are filtered or compressed before being provided to the model. These approaches typically operate at the token- or sentence-level rather than at the turn level.

Multi-turn context editing. Other work studies context editing of multi-turn conversation histories. More recently, ERGO attempts to dynamically realign conversation context in multi-turn settings by rewriting all prior user inputs into a single prompt and omitting past assistant responses. They find that the combination of consolidating user prompts and omitting assistant responses increases performance over full context on multi-turn math and coding. Notably, their findings are based solely on synthetic conversations.

We identify two gaps: First, there is a lack of evaluation on real-world multi-turn conversational data. Second, both research and deployed systems often treat the storage of prior assistant responses as a default design choice, without examining when user-side history alone is sufficient.

Following this observation, we make two key observations: (1) Multi-turn dependence is not inherent in real-world multi-turn chats — 36.4% of turns are self-contained. (2) Models can sometimes over-condition on their past responses, resulting in context pollution.

第1段批注

建立背景确立问题重要性

论证链起点：从已知痛点（计算成本、注意力退化）引出未被检验的假设

使用已有文献背书（Liu et al. 2024; Lee et al. 2026）强化问题真实性。列举业界产品（Claude Code、Cursor）提升现实感。

核心问题批注

提出研究问题点明论文核心

"underexamined assumption"是全文论证支点——论文价值在于质疑"默认设计"

注意将"默认保留历史"界定为"未检验假设"是带有修辞性的框架设置，实际上该做法有其合理动机。

1.1节批注

2 Do LLMs Benefit from Their Own Words?

2.1 Experimental Setup

To evaluate whether retaining prior assistant responses provides measurable benefits, we conduct a controlled experiment across four LLMs: Qwen3-4B, DeepSeek-R1-Distill-Llama-8B, GPT-OSS-20B, and GPT-5.2.

We conduct our experiments on real-world multi-turn conversations drawn from WildChat-4.8M and ShareLM. We focus on technical conversations (coding and mathematics), as in-the-wild datasets often contain toxic, off-topic, or loosely structured dialogues.

2.1.1 Generating Responses

For each model, we generate responses under two context configurations: Full Context (FC), in which the model is prompted with both prior user and assistant turns, and Assistant-Omitted (AO) context, in which the model is prompted with only prior user turns. To construct the AO-context, all past assistant turns are replaced with the placeholder phrase [Response provided].

2.1.2 Evaluating Responses

To evaluate responses, we use GPT-5 as an LLM-judge. For each conversation round starting from round 2, the LLM-judge receives both the FC and AO responses alongside the full conversation history. It then selects a winner or declares a tie for each of two evaluation dimensions: response quality and task adherence.

To mitigate position bias, we randomize response ordering for each comparison. Since we set out to investigate the impact of distraction from accumulated assistant responses, one natural concern is that the LLM-judge itself may be susceptible to distraction. To address this concern, we supplement the full-context LLM-judge with a variant that receives only the prior user turns during evaluation.

方法论批注

方法说明建立实验可信度

方法论节点：为后续所有结论提供"合法性担保"

设计亮点：选取4个规模跨度较大的模型（4B→frontier），增强结论的普遍性。

数据聚焦：专注技术类对话（coding/math）是务实选择——此类对话对正确性更敏感，便于评判；但也限制了结论的泛化范围（创意写作、情感对话等场景可能不同）。

生成策略批注

操作定义精确界定变量

将抽象问题转化为可操作的双变量（FC vs AO）对照实验

值得注意用"[Response provided]"占位而非直接删除，是为维持对话格式的合理工程决策，但该占位符本身是否引入偏差（模型可能推断出存在先前回答）未被讨论。

评估设计批注

反驳预期异议处理评估偏差

作者主动预判"法官模型本身也可能受干扰"的批评并给出双重评估方案

论证优势用两种judge配置相互校验，是方法论上的一大亮点——展现了研究者对自身局限性的自觉。

Section 2.2–2.4 · 主要发现

2.2 Storing Assistant-Side History Is Not Uniformly Beneficial

We find that storing prior assistant responses in context is not uniformly beneficial across models. Under the full-context LLM-judge, average response quality is maintained for DeepSeek-R1-Distill-Llama-8B and GPT-OSS-20B. In contrast, for Qwen3-4B and GPT-5.2, the average response quality decreases to some extent with the omission of assistant-side history.

Under the LLM-judge that sees only the prior user-side history, omitting past assistant responses leads to improved response quality across all four models.

Finally, we find that user-turn only prompting substantially reduces context length consumption. The full-context histories grow linearly with conversation depth, reaching approximately 25,000–55,000 characters by round 8. In contrast, the user-turn-only context remains nearly constant, consuming only 5,000–10,000 characters — a 5 to 10× reduction.

2.3 Assistant-Side History Is Less Beneficial for New Asks

Such new ask prompts constitute a substantial fraction (36.4%) of user turns. From manual inspection of a random sample of fifty chats, we find conversations can be categorized into: (i) sequences of loosely-related standalone prompts, (ii) single main prompt followed by related queries, and (iii) conversations centered on a single evolving intent.

Moving to the prompt level, we categorize prompts:

New Ask: non-initial user prompts that introduce a new, self-contained request. These can be addressed without dependence on prior conversation rounds.

Follow-up with Feedback: user prompts that provide concrete, actionable feedback on a prior assistant response (e.g., "Use Python instead of Java").

Follow-up without Feedback: user prompts that reference a prior round without concrete feedback (e.g., "Reflect on your response").

In our dataset, new-ask prompts account for 36.4% of user turns, follow-up with feedback for 30.5%, and follow-up without feedback for 33.1%.

2.4 Many Follow-Ups Remain Answerable Without Assistant History

Upon manually inspecting 50 follow-up prompts that perform better under AO-context, we find that many provide sufficiently concrete instruction to be addressed from scratch. The current user prompt together with a prior user prompt, commonly the initial prompt or the immediately preceding one, often provides the needed information.

The prevalence of follow-ups that provide concrete, self-contained feedback or rely solely on user-side context helps explain why AO-context still achieves win rates of roughly 40% for Qwen3-4B and 30% for GPT-5.2 across both follow-up categories.

2.2节批注

呈现证据核心数据结果

论证主体：用实验数据支撑"历史并非均匀有益"这一核心论点

重要细节两种judge评估结论方向不同：FC-judge下Qwen3-4B/GPT-5.2更倾向FC；AO-judge下所有模型均倾向AO。这意味着结论依赖于evaluation设定，论文选择"FC-judge为主"是相对保守的做法。

10×上下文压缩是最实用、最具说服力的附加发现。

2.3节批注

概念建构提出分析框架

引入三分类（New Ask / Follow-up w/ Feedback / Follow-up w/o Feedback）是论文的核心概念贡献之一

分类标准：基于"是否依赖先前助手回答"，逻辑清晰。

潜在漏洞分类由GPT-5自动完成，分类器本身的准确性未被充分验证（仅凭研究者定性判断）。且类别边界模糊（部分Follow-up兼具Feedback与New Ask特征）。

2.4节批注

深化解释解释反直觉结果

为"Follow-up下AO也能有较高胜率"提供机制性解释

论证优势通过50条样本的人工审查进行质性补充，是对纯量化结果的有效辅助，增加了可信度。但样本量偏小。

Section 2.5 · 上下文污染

2.5 Context Pollution: When Seeing Past Responses Becomes Counterproductive

We find cases where earlier assistant turns introduce errors, hallucinations, or stylistic artifacts that propagate into future turns. We call this phenomenon context pollution. Past works have also observed that models can over-condition on their past outputs.

To identify instances of context pollution, we identify cases where AO-context largely outperforms full context by running an additional judging configuration in which the LM-judge assigns a 1–10 score to both the FC and AO-context responses at each conversation round. We then sort conversations by the score difference (AO minus FC) in descending order and manually examine them, starting from those with the largest positive gaps.

Representative examples of context pollution include:

• t-SNE vs. UMAP Code: The model incorrectly carries over UMAP-specific arguments (metric="jaccard") from an earlier turn when asked to rewrite code using t-SNE, introducing a bug. FC score: 3.0, AO score: 8.0.

• Book Recommendations: The model hallucinates book recommendations and persists in mentioning them in later turns.

• Hallucinated Citation: The model misattributes authorship of a research paper by carrying forward details from a different closely-related paper.

• Stylistic Inertia: Instead of following a new user instruction ("Reflect on your response"), the model continues generating content in the same tutorial style as an earlier turn. FC score: 3.0, AO score: 7.0.

Notably, we also observe instances of context pollution in GPT-5.2, indicating that state-of-the-art models are also susceptible to being misled by their past responses.

2.5节批注

命名概念核心概念提出

"Context Pollution"是论文最具原创性的术语贡献，将抽象现象具象化

修辞效果："污染"一词带有负面价值判断，暗示这是需要修复的问题而非权衡取舍，引导读者接受论文的解决方案框架。

论证优势具体案例（代码参数泄漏、幽灵引用、风格惯性）极具说服力，是全文最生动的部分，与数字结果相互印证。

识别方法批注

方法透明说明案例发现过程

通过score差值排序+人工审查的混合方法，兼顾规模与深度

选择性偏差案例来自"AO大幅优于FC"的子集，是有意的最优案例展示（cherry-picking），不能代表污染的普遍频率，但对于现象存在性的论证是有效的。

GPT-5.2让步批注

扩展论点强化结论普遍性

通过"即使最先进模型也不免疫"来抬高研究stakes

这是有效的让步处理：不是说"小模型有问题"（容易被反驳），而是说"frontier模型也受影响"——使结论难以用"用更好的模型就行了"来反驳。

Section 3 · 自适应策略

3 Adaptive Assistant Response Omission

In Section 2, we observed that some user prompts benefit from access to prior assistant responses, while others are unaffected or even negatively impacted. In this section, we explore a strategy to selectively choose a context configuration. All experiments in this section use GPT-5.2, where the AO-context performs significantly worse than full context.

3.1 Learning the Preferred Context Configuration

To predict the preferred context configuration, we use (i) metadata on the current round; (ii) the prompt category (new ask, follow-up with or without feedback); and (iii) dense vector embeddings of the user prompt as well as the past conversation history, obtained from a pretrained text embedding model. We fit an L1-regularized logistic regression model.

3.2 Selectively Omitting Assistant Responses

Several adaptive configurations retain over 95% of FC-only performance while substantially reducing context usage (the adaptive performs similarly to FC-only at 70% of the context consumption).

We also evaluate a simple heuristic baseline that omits assistant responses only on "New Ask" turns. This "Omit on New Ask" rule performs substantially worse than the learned classifier.

Notably, our current adaptive strategy makes a binary choice between full-context and user-turn-only prompting. A natural extension of this work is to develop a finer-grained approach for context filtering that preserves only the specific past assistant responses relevant to a given prompt.

第3节批注

提出解决方案从诊断到治疗

论证结构转折：从"发现问题"转向"提供工具"，完成论文的工程贡献

模型选择：专注GPT-5.2（FC明显优于AO的情形）是合理的——在最难的case上验证自适应策略才有意义。

分类器局限跨验证F1仅0.61，作者在附录中承认"特征与judge偏好的关系较弱"，说明任务本身难度较高。

3.2节批注

呈现关键结果效率-质量权衡

"95%性能 + 70%上下文"是论文最实用的量化贡献

论证优势用Pareto前沿图展示阈值τ的参数空间，使工程师可以根据需求自行选择质量-效率权衡点，实用价值高。

局限性批注

承认局限指向未来工作

将二元选择的局限性转化为未来工作方向，是成熟的学术写作策略

指出"细粒度过滤（只保留相关轮次）"作为扩展方向，实际上暗示了当前方法的简化性，但以积极语气呈现。

Section 4 · 讨论与结论

4 Discussion

In this work, we analyze real-world multi-turn chat logs and uncover a surprising finding: omitting past assistant responses often maintains comparable downstream response quality, while substantively reducing cumulative context lengths.

While one cannot rule out the possibility that a future query may depend on an earlier assistant response, we observe that such dependence occurs less frequently than one might expect in real-world conversation logs, and that follow-up queries can often be answered from seeing the user-side history alone.

We hope that these findings motivate further research into context management systems that more carefully weigh the consequences of preserving past assistant responses. Future work may look into designing context management systems that predict, from user-side behaviors alone, whether retaining past assistant responses is likely to benefit a downstream conversation. For example, (1) when the user poses a sequence of largely independent queries, generous filtering of assistant responses may be beneficial; or (2) when there is a clear topic shift, assistant responses related to earlier topics can be safely discarded.

We note that our evaluation relies on an LLM-as-judge framework, which means that these findings depend on the reliability of the automated evaluator. While we perform a human-alignment analysis and observe that the LM-judge achieves ≥ 90% alignment, future work should extend this evaluation using a larger-scale human study.

Given our finding that multi-turn dependence is not inherent in multi-turn chats, we suggest the need for more carefully-curated real-world conversation benchmarks that reflect true multi-turn dependence, to allow for accurate future benchmarking of models' long-context reasoning capabilities.

讨论节批注

重申结论强化核心主张

回溯摘要的核心论点，形成论证闭环

以"surprising finding"再次强调反直觉性，首尾呼应，增强读者对贡献的印象。

局限声明批注

处理异议主动承认局限

双重让步：承认LLM-as-judge的局限性，以及未来依赖场景的可能性

诚实透明承认评估框架依赖自动化评判，并提供≥90%的人工对齐数据作为部分缓解。

数据规模300条对话（WildChat 150 + ShareLM 150）对于得出如此宽泛结论而言样本量偏保守，是论文的主要局限。

基准建议批注

提出宏观建议影响力扩展

将研究发现升华为对整个领域的基准建设呼吁，扩大论文影响面

指出现有多轮对话基准（如Lost-in-Conversation）是合成的，真实多轮依赖性的缺失会导致能力评估失真——这是一个重要的元科学观察。