Why AI Struggles With Real Analyst Work
Due diligence requires evidence across firms and time, where current systems fail, according to new research.
AI looks impressive when you ask a narrow question about a single company filing, such as revenue last quarter.
Ask it to compare two firms’ risk disclosures or track strategy over several years, and performance drops fast. Fin-RATE, a new benchmark from researchers at Yale and Goldman Sachs, measures that gap and identify where the model breaks.
Most financial benchmarks reduce SEC filings to lookup tasks: find a number in a 10-K and repeat it back accurately. That design misses how analysts actually work. Real due diligence requires synthesizing disclosures across companies, time periods, and filing types simultaneously. A pass/fail system doesn’t tell you whether errors came from retrieval, hallucination, or broken reasoning chains.
Here’s what they did:
Built a body of 15,311 document segments from 2,472 SEC filings (10-K, 10-Q, 8-K, DEF 14A, and others) covering 43 companies across 36 industries, 2020–2025. Sourced from EDGAR, segmented at official SEC item boundaries, converted to structured Markdown.
Designed three task types
Single-document questions
Cross-company comparisons
Multi-year analysis within one firm
Created 7,500 question-answer pairs with numbers manually verified against source filings.
Evaluated 17 models, including closed-source systems, major open-source models, and finance-tuned variants
Tested performance with passages provided directly versus retrieved using four RAG methods.
The findings:


