Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch
A study by Microsoft reveals that large language models (LLMs) can significantly corrupt document content—averaging a 25% degradation—when tasked with multi-step workflows, raising concerns about their reliability for knowledge tasks. The research highlights the need for incremental human review and the development of domain-specific tools to mitigate errors, as current models struggle with complex editing tasks and the presence of irrelevant distractor documents.
The DELEGATE-52 benchmark reveals that current large language models are prone to significant content corruption, with models like GPT 5.4 and Gemini 3.1 Pro corrupting 25% of document content during multi-step workflows. This highlights a crucial insight for deploying AI in professional settings: incremental human review is necessary, and AI applications should focus on short, transparent tasks instead of complex autonomous processes. For AI deployment, organizations must develop domain-specific tools and reversible editing tasks to ensure reliability and mitigate errors in automated workflows.