INTERVIEW

Benchmarking AI Agents in Live Markets

AI doesn’t have enough real scoreboards.

It’s hard to tell what model or framework is better without real results, so that’s why I was pleased to see the folks at Fin AI build a platform that runs multiple AI trading agents in live markets, feeds them the same data, and tracks how they actually perform.

They’re not using real money, and transaction costs aren’t factored in, but it’s a helpful showcase of the pros and cons of the technology.

What it is

Agent Market Arena (AMA) is a live, open benchmark that paper-trades multiple AI trading agents across stocks and crypto, logging results on a public leaderboard. It evaluates agent frameworks under the same data, rules, and timing so results are comparable.

Assets covered: TSLA, BMRN, BTC, ETH, updated continuously.
Agents included: InvestorAgent (single-agent baseline), TradeAgent, HedgeFundAgent, DeepFundAgent — developed in collaboration with PAAL.AI and DEEPKIN.AI. Each is also tested with different LLMs like GPT-4o, GPT-4.1, Claude 3.5, Gemini 2.0.
Timebase: Daily trading under a unified execution protocol with identical start capital and rules for all entrants.
Results: Early results show that agent design matters more than AI model. Most agent investors still struggle to beat a simple buy-and-hold strategy.

I was happy to speak with Jimin Huang, the founder of Fin AI, which advocates for open source financial AI, and Lingfei Qian, first author of the paper behind the benchmark, about why they created a real-time testing ground to see if AI agents can process news, prices, and trends and make profitable trading decisions.

This interview has been edited for clarity and length.

Matt: Why did you build this?

Lingfei Qian: Right now, there are a lot of large language model applications — in finance, medical, everywhere. People are using them to understand financial news and market data to see if LLMs can handle complex information and make correct decisions.

We wanted to go further — to focus on agents, because agents are a more comprehensive framework. They can combine different LLMs to work together, handle different information sources and modalities, and understand things like news, prices, and historical trends.

We built Agent Arena to test how well agent frameworks perform with different kinds of information and whether they can make correct decisions. We provide the same news, historical data, and price information to different agents and see how they perform in real time — whether they can make profits in the market.

Jimin Huang: We wanted to combine different agents together and see how they perform in real time.

If you notice the current leaderboards, most models show similar patterns — similar ways of thinking. That’s because they’re not trained for decision-making. So, what we’re trying to show is that it’s not the model that matters — it’s the agent framework. The models are like engines, but the agent is like the car frame. The same engine can perform differently depending on the car’s design.

We want to show that the orchestration of agents — whether single or multi-agent — makes the difference in how decisions are made.

Matt: You’ve mentioned before that even different large language models tend to behave similarly. Why is that?

Jimin: Because they aren’t trained for decision-making. Most LLMs, even when fine-tuned for finance, end up showing similar thinking patterns — they make forecasts, but not real decisions. That’s why the agent framework matters more than the model itself.

Matt: So when you have one agent, how does that fit in — is it based on a small or large language model?

Lingfei: It depends. When handling large amounts of news and data from different sources, larger language models tend to perform better. But for narrow tasks, like predicting whether a price will go up or down once the information has been summarized, smaller trained models can perform just as well.

Matt: You mentioned FinCon earlier — can you talk about how agents actually make decisions?

Jimin: FinCon was our first multi-agent trading system. It’s modeled like a human investment organization: a group of analyst agents works on different information sources, and their outputs are passed to manager agents who make trading decisions.

We also built a verbal reinforcement learning system that allows managers and analysts to learn from market feedback. Later research built on that, exploring how to optimize models based on the consequences of their decisions — not just predicting the next token or sentence.

We don’t want models that only generate text. We want models that can make decisions, receive feedback, and improve — just like humans. That’s the goal of agent systems in finance.

Matt: How do you handle issues like hallucination or agents picking up wrong information?

Lingfei: There are different steps. For example, a checker verifies whether outputs come from the actual data or from hallucinations. The manager agent also gets feedback from the market — if it makes a wrong decision, it reflects on why that happened and sends feedback to worker agents to improve future performance.

Jimin: But not all agents can do that. Some can only perform single-agent reflection. They can generate decisions but can’t think back on why they went wrong or update themselves. That’s what we’re working on — adding a full backward process where agents can truly reflect, not just act. Without that, you can get lucky for a while, but you won’t be consistent in the long run.

Matt: Tell me about your collaboration — who’s powering the live trading setup?

Jimin: We’re collaborating with our partners at PAAL.AI and DEEPKIN.AI, two innovative AI trading platforms that support the project by covering API and compute resources for our experiments.

We invite everyone to “join the arena.” Anyone can test their own framework, even non-AI ones. We provide the same data and take in their predictions. Our goal is transparency — to see what actually works in live markets.

Matt: So in a way, you’re creating a kind of public lab or testing ground.

Jimin: Exactly. We want people to have a fair view of the technology’s real state — whether these systems can truly trade. No current solution can show consistent improvement compared with even a simple buy-and-hold strategy. It’s still early, but this gives everyone a common ground to test and compare.

In the future, we imagine an agent marketplace where anyone can deploy their own trading agents, not just large financial institutions. The idea is to lower the barrier to building and using professional-grade trading systems.

Matt: What’s your long-term goal for Agent Arena?

Jimin: We plan to host it indefinitely — it’s meant to be a long-term benchmark that people can keep testing against. We’re also expanding the community: researchers, developers, and financial experts are collaborating on related projects.

Ultimately, we want to build models that can make sequences of financial decisions, not just answer questions. It’s like a new kind of Turing test — not about whether a model can talk like a human, but whether it can decide like one.

Benchmarking AI Agents in Live Markets

INTERVIEW

Benchmarking AI Agents in Live Markets

What it is

Reply

Keep Reading

AI Street