Do LLMs Benefit from Their Own Words?
MIT EECS · MIT-IBM Watson AI Lab · IBM Research | arXiv:2602.24287v1 [cs.CL] 27 Feb 2026
Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses.
Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model.
To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10×.
To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. Only a subset of user prompts (33.1%) reference an earlier assistant response without giving actionable feedback.
When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns.
Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.
论证技巧:"To our surprise"是刻意修辞,暗示该结果违反领域共识,增强论文贡献感。
潜在漏洞摘要未区分"不同模型间差异"——实验结果在不同模型上并不一致(某些模型FC仍更优),摘要的表述略显过度概括。
1 Introduction
As large language models (LLMs) are deployed in increasingly complex multi-turn interactions, context management becomes an important challenge. Long contexts increase computational costs, slow inference speeds, and can impair a model's capacity to attend to relevant information. In response, agentic systems like Claude Code and Cursor have adopted context-editing strategies.
Despite increasing efforts to compress and prune older segments of conversation history, one key assumption remains largely underexamined: that retaining past model outputs reliably improves downstream response quality in real-world multi-turn conversations.
In this work, we analyze in-the-wild multi-turn chats from WildChat and ShareLM to ask: Do current models benefit from conditioning on their own prior responses?
1.1 Context Management
Single-turn prompt compression. A line of work studies prompt compression in the context of single-turn retrieval-augmented generation (RAG), where retrieved documents are filtered or compressed before being provided to the model. These approaches typically operate at the token- or sentence-level rather than at the turn level.
Multi-turn context editing. Other work studies context editing of multi-turn conversation histories. More recently, ERGO attempts to dynamically realign conversation context in multi-turn settings by rewriting all prior user inputs into a single prompt and omitting past assistant responses. They find that the combination of consolidating user prompts and omitting assistant responses increases performance over full context on multi-turn math and coding. Notably, their findings are based solely on synthetic conversations.
We identify two gaps: First, there is a lack of evaluation on real-world multi-turn conversational data. Second, both research and deployed systems often treat the storage of prior assistant responses as a default design choice, without examining when user-side history alone is sufficient.
Following this observation, we make two key observations: (1) Multi-turn dependence is not inherent in real-world multi-turn chats — 36.4% of turns are self-contained. (2) Models can sometimes over-condition on their past responses, resulting in context pollution.
2 Do LLMs Benefit from Their Own Words?
2.1 Experimental Setup
To evaluate whether retaining prior assistant responses provides measurable benefits, we conduct a controlled experiment across four LLMs: Qwen3-4B, DeepSeek-R1-Distill-Llama-8B, GPT-OSS-20B, and GPT-5.2.
We conduct our experiments on real-world multi-turn conversations drawn from WildChat-4.8M and ShareLM. We focus on technical conversations (coding and mathematics), as in-the-wild datasets often contain toxic, off-topic, or loosely structured dialogues.
2.1.1 Generating Responses
For each model, we generate responses under two context configurations: Full Context (FC), in which the model is prompted with both prior user and assistant turns, and Assistant-Omitted (AO) context, in which the model is prompted with only prior user turns. To construct the AO-context, all past assistant turns are replaced with the placeholder phrase [Response provided].
2.1.2 Evaluating Responses
To evaluate responses, we use GPT-5 as an LLM-judge. For each conversation round starting from round 2, the LLM-judge receives both the FC and AO responses alongside the full conversation history. It then selects a winner or declares a tie for each of two evaluation dimensions: response quality and task adherence.
To mitigate position bias, we randomize response ordering for each comparison. Since we set out to investigate the impact of distraction from accumulated assistant responses, one natural concern is that the LLM-judge itself may be susceptible to distraction. To address this concern, we supplement the full-context LLM-judge with a variant that receives only the prior user turns during evaluation.
数据聚焦:专注技术类对话(coding/math)是务实选择——此类对话对正确性更敏感,便于评判;但也限制了结论的泛化范围(创意写作、情感对话等场景可能不同)。
2.2 Storing Assistant-Side History Is Not Uniformly Beneficial
We find that storing prior assistant responses in context is not uniformly beneficial across models. Under the full-context LLM-judge, average response quality is maintained for DeepSeek-R1-Distill-Llama-8B and GPT-OSS-20B. In contrast, for Qwen3-4B and GPT-5.2, the average response quality decreases to some extent with the omission of assistant-side history.
Under the LLM-judge that sees only the prior user-side history, omitting past assistant responses leads to improved response quality across all four models.
Finally, we find that user-turn only prompting substantially reduces context length consumption. The full-context histories grow linearly with conversation depth, reaching approximately 25,000–55,000 characters by round 8. In contrast, the user-turn-only context remains nearly constant, consuming only 5,000–10,000 characters — a 5 to 10× reduction.
2.3 Assistant-Side History Is Less Beneficial for New Asks
Such new ask prompts constitute a substantial fraction (36.4%) of user turns. From manual inspection of a random sample of fifty chats, we find conversations can be categorized into: (i) sequences of loosely-related standalone prompts, (ii) single main prompt followed by related queries, and (iii) conversations centered on a single evolving intent.
Moving to the prompt level, we categorize prompts:
New Ask: non-initial user prompts that introduce a new, self-contained request. These can be addressed without dependence on prior conversation rounds.
Follow-up with Feedback: user prompts that provide concrete, actionable feedback on a prior assistant response (e.g., "Use Python instead of Java").
Follow-up without Feedback: user prompts that reference a prior round without concrete feedback (e.g., "Reflect on your response").
In our dataset, new-ask prompts account for 36.4% of user turns, follow-up with feedback for 30.5%, and follow-up without feedback for 33.1%.
2.4 Many Follow-Ups Remain Answerable Without Assistant History
Upon manually inspecting 50 follow-up prompts that perform better under AO-context, we find that many provide sufficiently concrete instruction to be addressed from scratch. The current user prompt together with a prior user prompt, commonly the initial prompt or the immediately preceding one, often provides the needed information.
The prevalence of follow-ups that provide concrete, self-contained feedback or rely solely on user-side context helps explain why AO-context still achieves win rates of roughly 40% for Qwen3-4B and 30% for GPT-5.2 across both follow-up categories.
10×上下文压缩是最实用、最具说服力的附加发现。
潜在漏洞分类由GPT-5自动完成,分类器本身的准确性未被充分验证(仅凭研究者定性判断)。且类别边界模糊(部分Follow-up兼具Feedback与New Ask特征)。
2.5 Context Pollution: When Seeing Past Responses Becomes Counterproductive
We find cases where earlier assistant turns introduce errors, hallucinations, or stylistic artifacts that propagate into future turns. We call this phenomenon context pollution. Past works have also observed that models can over-condition on their past outputs.
To identify instances of context pollution, we identify cases where AO-context largely outperforms full context by running an additional judging configuration in which the LM-judge assigns a 1–10 score to both the FC and AO-context responses at each conversation round. We then sort conversations by the score difference (AO minus FC) in descending order and manually examine them, starting from those with the largest positive gaps.
Representative examples of context pollution include:
• t-SNE vs. UMAP Code: The model incorrectly carries over UMAP-specific arguments (metric="jaccard") from an earlier turn when asked to rewrite code using t-SNE, introducing a bug. FC score: 3.0, AO score: 8.0.
• Book Recommendations: The model hallucinates book recommendations and persists in mentioning them in later turns.
• Hallucinated Citation: The model misattributes authorship of a research paper by carrying forward details from a different closely-related paper.
• Stylistic Inertia: Instead of following a new user instruction ("Reflect on your response"), the model continues generating content in the same tutorial style as an earlier turn. FC score: 3.0, AO score: 7.0.
Notably, we also observe instances of context pollution in GPT-5.2, indicating that state-of-the-art models are also susceptible to being misled by their past responses.
论证优势具体案例(代码参数泄漏、幽灵引用、风格惯性)极具说服力,是全文最生动的部分,与数字结果相互印证。
3 Adaptive Assistant Response Omission
In Section 2, we observed that some user prompts benefit from access to prior assistant responses, while others are unaffected or even negatively impacted. In this section, we explore a strategy to selectively choose a context configuration. All experiments in this section use GPT-5.2, where the AO-context performs significantly worse than full context.
3.1 Learning the Preferred Context Configuration
To predict the preferred context configuration, we use (i) metadata on the current round; (ii) the prompt category (new ask, follow-up with or without feedback); and (iii) dense vector embeddings of the user prompt as well as the past conversation history, obtained from a pretrained text embedding model. We fit an L1-regularized logistic regression model.
3.2 Selectively Omitting Assistant Responses
Several adaptive configurations retain over 95% of FC-only performance while substantially reducing context usage (the adaptive performs similarly to FC-only at 70% of the context consumption).
We also evaluate a simple heuristic baseline that omits assistant responses only on "New Ask" turns. This "Omit on New Ask" rule performs substantially worse than the learned classifier.
Notably, our current adaptive strategy makes a binary choice between full-context and user-turn-only prompting. A natural extension of this work is to develop a finer-grained approach for context filtering that preserves only the specific past assistant responses relevant to a given prompt.
分类器局限跨验证F1仅0.61,作者在附录中承认"特征与judge偏好的关系较弱",说明任务本身难度较高。
4 Discussion
In this work, we analyze real-world multi-turn chat logs and uncover a surprising finding: omitting past assistant responses often maintains comparable downstream response quality, while substantively reducing cumulative context lengths.
While one cannot rule out the possibility that a future query may depend on an earlier assistant response, we observe that such dependence occurs less frequently than one might expect in real-world conversation logs, and that follow-up queries can often be answered from seeing the user-side history alone.
We hope that these findings motivate further research into context management systems that more carefully weigh the consequences of preserving past assistant responses. Future work may look into designing context management systems that predict, from user-side behaviors alone, whether retaining past assistant responses is likely to benefit a downstream conversation. For example, (1) when the user poses a sequence of largely independent queries, generous filtering of assistant responses may be beneficial; or (2) when there is a clear topic shift, assistant responses related to earlier topics can be safely discarded.
We note that our evaluation relies on an LLM-as-judge framework, which means that these findings depend on the reliability of the automated evaluator. While we perform a human-alignment analysis and observe that the LM-judge achieves ≥ 90% alignment, future work should extend this evaluation using a larger-scale human study.
Given our finding that multi-turn dependence is not inherent in multi-turn chats, we suggest the need for more carefully-curated real-world conversation benchmarks that reflect true multi-turn dependence, to allow for accurate future benchmarking of models' long-context reasoning capabilities.
数据规模300条对话(WildChat 150 + ShareLM 150)对于得出如此宽泛结论而言样本量偏保守,是论文的主要局限。