Hey, it’s Matt. This week on AI Street:

📃 Research: IBM Uses Small AI for Accuracy

📰 News: The latest in AI on Wall Street

🎙 Interview: The Fin AI on Trading Agents

Forwarded this? Subscribe here. Join readers from McKinsey, JPMorgan, BlackRock & more.

RESEARCH

Stopping AI from Making Things Up

Made with Ideogram

The biggest issue with large language models is that they can give different answers to the same question. IBM researchers recently showed that a small model with a strict structure can deliver the same output every time.

LLMs are probabilistic. Unlike traditional software that relies on explicit rules, they learn from data. Consistency is hard because there’s no fixed path from input to output.

IBM’s team found that smaller models, when engineered for determinism, can produce stable, repeatable answers.

Here’s what they did:

They evaluated five architectures: Qwen2.5-7B, Granite-3-8B, Llama-3.3-70B, Mistral-Medium-2505, and GPT-OSS-120B.
Locked the model into its most predictable setting by turning temperature to zero and fixing every parameter that could introduce randomness.
Forced the retrieval step to behave the same way every time, so the model always looked at the same parts of the 10-Ks in the same order.

The findings:

Small models were consistent. In all 16 runs of each test, they produced the exact same answer.
Large models drifted even with temperature at zero, making them unreliable for sensitive work.
RAG was the weakest link. Structured tasks stayed stable, but retrieval made answers much more likely to change.

Most of the inconsistency wasn’t in the model itself — it came from which parts of the document the model looked at. Once they forced the model to always look at the same paragraphs in the same order, small models proved consistent.

“The architecture matters more than the constraints was the takeaway,” Raffi Khatchadourian, one of the paper’s coauthors, tells AI Street. “It’s like smaller models (7-8B) have simpler attention mechanisms that actually follow deterministic paths, while larger models have inherent randomness from batch effects that you can’t control away - even at zero temp.”

Zero temperature is a setting that removes randomness from a model’s output.

One thing to note: This study wasn’t testing correctness, only whether an LLM could produce the same output over and over.

That’s important because regulated industries need systems that behave the same way today, tomorrow, or next year.

Khatchadourian is presenting the paper at the ACM ICAIF conference in Singapore.

Takeaway

A year ago, if you asked an LLM to do pretty basic math, it likely bungled the answer, but now they’re “smart” enough to use a calculator.

With the right architecture, they might soon retrieve the right answer every time.

AI Is Not to Blame for Job Losses

The current conventional wisdom is that AI is inevitably going to take people’s jobs, but there’s no real meaningful data to support that. It was good to see some stories push back on this narrative:

Morgan Stanley: AI is unlikely to cause near-term job losses | Investing
Goldman survey: Firms use AI for growth, not layoffs | Axios

AI is a Work in Progress

While your LinkedIn feed is full of gamechanger! and rocket emojis 🚀 the reality is this technology has lots of bugs. It’s only a few years old, and the people who came up with it weren’t exactly expecting a trillion dollar impact.

Lots of things need to be worked, out and its capabilities are often oversold.

Researchers say inconsistent benchmarks inflate AI claims | NBC

AI Regulation Is Scant

Very little regulation on Wall Street deals with AI directly. Regulatory rules generally are written broadly so they can cover emerging technologies.

Fed’s Barr Urges AI Guardrails For Finance | BBG

Barr highlights concerns over biases in AI. That is certainly an issue given that no one knows exactly how LLMs work. (You can’t exactly pop open the hood to see its inner workings.) The Fed official also mentions concerns over AI colluding in trading. I’ve not heard of anyone using agents for trading. Maybe I’m naive, but I don’t think the tech is mature enough for someone to feel confident to trade with real money.

AI Adoption Is Growing

The weird thing with AI is that hallucinations, or making things up, is pretty well understood (I would think by now?) But adoption keeps growing:

AI Becomes UK’s Top Financial Tool | Fintech Times

Lloyds Banking Group is currently piloting a new tool with employees that gives customers a personal AI assistant for spending insights, savings guidance, and investment support. I’m not aware of any other bank this close to launching something with these features.

Lloyds Tasks 7,000 Staffers With Testing Out AI Assistant | BBG

AI Agents Are Rolling Out

AI agents are software that breaks down tasks, makes decisions, uses tools, and adapts without constant human prompting. Early adopters like BNY and Walmart are already seeing measurable productivity gains from AI agents.

Companies Begin to See a Return on AI Agents | WSJ

nCino, a banking technology provider, is releasing role-based AI agents called Digital Partners for executives, analysts and account holders for routine tasks.

nCino Begins Rollout Of AI Financial Services Agents | Wilmington Biz

New Constructs, a financial data and research firm, announced FinSights, an AI research agent that draws on the firm’s forensic accounting data to help investors assess performance and valuation.

New Constructs Partners with Google Cloud to Launch FinSights | PR

SPONSORSHIPS

Reach Wall Street’s AI Decision-Makers

Advertise on AI Street to reach a highly engaged audience of decision-makers at firms including JPMorgan, Citadel, BlackRock, Skadden, McKinsey, and more. Sponsorships are reserved for companies in AI, markets, and finance. Email me ([email protected]) for more details.

INTERVIEW

Benchmarking AI Agents in Live Markets

AI doesn’t have enough real scoreboards.

It’s hard to tell what model or framework is better without real results, so that’s why I was pleased to see the folks at Fin AI build a platform that runs multiple AI trading agents in live markets, feeds them the same data, and tracks how they actually perform.

They’re not using real money, and transaction costs aren’t factored in, but it’s a helpful showcase of the pros and cons of the technology.

What it is

Agent Market Arena (AMA) is a live, open benchmark that paper-trades multiple AI trading agents across stocks and crypto, logging results on a public leaderboard. It evaluates agent frameworks under the same data, rules, and timing so results are comparable.

Assets covered: TSLA, BMRN, BTC, ETH, updated continuously.
Agents included: InvestorAgent (single-agent baseline), TradeAgent, HedgeFundAgent, DeepFundAgent — developed in collaboration with PAAL.AI and DEEPKIN.AI. Each is also tested with different LLMs like GPT-4o, GPT-4.1, Claude 3.5, Gemini 2.0.
Timebase: Daily trading under a unified execution protocol with identical start capital and rules for all entrants.
Results: Early results show that agent design matters more than AI model. Most agent investors still struggle to beat a simple buy-and-hold strategy.

I was happy to speak with Jimin Huang, the founder of Fin AI, which advocates for open source financial AI, and Lingfei Qian, first author of the paper behind the benchmark, about why they created a real-time testing ground to see if AI agents can process news, prices, and trends and make profitable trading decisions.

This interview has been edited for clarity and length.

Matt: Why did you build this?

Lingfei Qian: Right now, there are a lot of large language model applications — in finance, medical, everywhere. People are using them to understand financial news and market data to see if LLMs can handle complex information and make correct decisions.

We wanted to go further — to focus on agents, because agents are a more comprehensive framework. They can combine different LLMs to work together, handle different information sources and modalities, and understand things like news, prices, and historical trends.

We built Agent Arena to test how well agent frameworks perform with different kinds of information and whether they can make correct decisions. We provide the same news, historical data, and price information to different agents and see how they perform in real time — whether they can make profits in the market.

Jimin Huang: We wanted to combine different agents together and see how they perform in real time.

If you notice the current leaderboards, most models show similar patterns — similar ways of thinking. That’s because they’re not trained for decision-making. So, what we’re trying to show is that it’s not the model that matters — it’s the agent framework. The models are like engines, but the agent is like the car frame. The same engine can perform differently depending on the car’s design.

We want to show that the orchestration of agents — whether single or multi-agent — makes the difference in how decisions are made.

Full Interview Here

ROUNDUP

What Else I’m Reading

Balyasny hires CIA's chief AI data scientist | eFinancialCareers
AQR’s Asness Sees Expensive Stock Market But No Bubble | BBG
Anthropic on Track to Turn a Profit Faster Than OpenAI | WSJ
Companies Unite Tech and HR to Manage AI Jobs Impact | WSJ
‘Total Portfolio Approach’ Is Shifting How Trillions Get Managed | BBG
Brevan Howard hires Man Group’s Mace as head of AI | Financial News
Big Minds: How top investors are investing in AI | AInvestor

CALENDAR

Upcoming AI + Finance Conferences

ACM ICAIF 2025 – November 15–18, 2025 • Singapore
Top-tier academic/industry conference on AI in finance and trading.
Momentum AI Finance 2025 – November 17–18, 2025 • New York
Reuters summit featuring execs from major banks, asset managers, and fintechs, with sessions on AI infrastructure, ROI, agentic systems, and agent demos.
AI for Finance – November 24–26, 2025 • Paris
Artefact’s AI for Finance summit, focused on generative AI, future of finance, digital sovereignty, and regulation
NeurIPS Workshop: Generative AI in Finance – Dec. 6/7 • San Diego One-day academic workshop at NeurIPS focused on generative AI applications in finance, organized by ML researchers.

Is there a conference I missed? Reach out: [email protected]

IBM Stops AI from Hallucinating: Study