Why AI Struggles With Real Analyst Work

Due diligence requires evidence across firms and time, where current systems fail, according to new research.

Feb 25, 2026

∙ Paid

AI looks impressive when you ask a narrow question about a single company filing, such as revenue last quarter.

Ask it to compare two firms’ risk disclosures or track strategy over several years, and performance drops fast. Fin-RATE, a new benchmark from researchers at Yale and Goldman Sachs, measures that gap and identify where the model breaks.

Most financial benchmarks reduce SEC filings to lookup tasks: find a number in a 10-K and repeat it back accurately. That design misses how analysts actually work. Real due diligence requires synthesizing disclosures across companies, time periods, and filing types simultaneously. A pass/fail system doesn’t tell you whether errors came from retrieval, hallucination, or broken reasoning chains.

Here’s what they did:

Built a body of 15,311 document segments from 2,472 SEC filings (10-K, 10-Q, 8-K, DEF 14A, and others) covering 43 companies across 36 industries, 2020–2025. Sourced from EDGAR, segmented at official SEC item boundaries, converted to structured Markdown.
Designed three task types
- Single-document questions
- Cross-company comparisons
- Multi-year analysis within one firm
Created 7,500 question-answer pairs with numbers manually verified against source filings.
Evaluated 17 models, including closed-source systems, major open-source models, and finance-tuned variants
Tested performance with passages provided directly versus retrieved using four RAG methods.

The findings:

Continue reading this post for free, courtesy of Matt Robinson.

Or purchase a paid subscription.

AI Street

Why AI Struggles With Real Analyst Work

Due diligence requires evidence across firms and time, where current systems fail, according to new research.

Continue reading this post for free, courtesy of Matt Robinson.