AI Street

AI Street

Man Group Details its AlphaGPT

Matt Robinson's avatar
Matt Robinson
Nov 20, 2025
∙ Paid

Hey, it’s Matt. This week on AI Street:

🤖 Man Group Details its AlphaGPT

🧰 AI Struggles with Deductive Logic 

🤠 News Roundup: Adoption Trends


NEXT WEEK
  • AI Street will be out on Wednesday instead of Thursday because of Thanksgiving 🦃 


ANALYSIS

Hedge Funds & AI Frameworks

Over the last year, we've seen well-known hedge funds talk publicly about how AI is impacting returns — at least in broad strokes. Here are a couple:

  • In March, Bridgewater Associates CEO Nir Bar Dea said its $2 billion AI fund is generating "unique alpha that is uncorrelated to what our humans do."

  • In July, Man Group’s quant equity unit said it’s using an internal tool, called AlphaGPT, to generate, code and backtest trading ideas, mimicking how researchers develop new trading signals.

Both hedge funds recently published research about using LLMs in investing.

AI in investing looks less like a single model answering questions and more like a small organization. Coming up with trade ideas is broken up into roles: one agent handles search, another interprets the information, another challenges the reasoning, and a final agent signs off. That structure ends up behaving a lot like a real firm, a group of teams.

Man Group on What AI Can (and Can't Yet) Do for Alpha

Man says the technology helps address a growing challenge in quantitative investing: the surge of available data and possible market relationships that outstrip human bandwidth. Early tests show the system can identify connections that researchers may overlook and produce a broader set of ideas. The big picture: AI gives them scale to test more ideas.

The hedge fund has built a workflow that mirrors a quant research pod:

  • Idea generator that proposes hypotheses at scale

  • Code implementer that writes production-grade Python against their internal research stack

  • Evaluator that performs significance, risk and economic-sense checks

The firm also highlights risks. More testing increases the odds of finding bogus signals. Man says it uses safeguards such as prompt controls, validation checks and human review to limit errors and prevent hallucinations or p-hacking, getting a result you want by chance.

The system has been most effective so far in equity research, though Man says its modular design should allow it to expand to other asset classes. The company expects similar AI tools to spread across the industry but argues its long-built data, infrastructure and research processes will remain a competitive advantage.

“To date, the system has produced signals that meet our standards and pass the same evaluation thresholds required for human-generated research,” Man Numeric’s Ziang Fang, CFA, Senior Portfolio Manager, said in the post.

Bridgewater’s AIA Forecaster

The firm released research on an AI forecasting system, which also uses a multi-agent framework rather than rely on a single model. One agent runs adaptive search across curated news sources, others generate and debate forecasts, and a supervisor agent resolves disagreements before a final statistical calibration step. In tests on standard forecasting benchmarks, the system matched human superforecasters.

I plan to do a deeper dive on this paper in the coming weeks but, unlike AI that can work 24/7, I cannot.

Takeaway

Taken together, these projects point in the same direction. AI in investing is less of a single model and more of a framework. Hedge funds are tapping AI’s pattern-matching firepower and applying it to their investment strategies with bespoke agents.

Further Reading

  • What AI Can (and Can't Yet) Do for Alpha | Man Numeric

  • AIA Forecaster: Technical Report | Bridgewater AIA Labs


RESEARCH

AI Struggles with Logic 

LLMs Rely on Memory, Not Deductive Reasoning

It’s hard to test a technology that’s been trained on the Internet. Research shows that LLMs can “remember” training data even if the model has only seen it a handful of times.

To get around this, researchers came up with a creative solution: a nonsensical but logically sound test. In other words, it’s a logic test not based on actual facts.

Because the arguments do not match real-world facts, the model cannot rely on what it’s seen in its training data.

It must rely on the formal logic of the premises. Here’s an example question:

What was tested

  • The study evaluated a range of models from OpenAI, Meta, DeepSeek, and Alibaba, covering both reasoning-focused systems and general instruction-tuned models to compare their deductive reasoning performance.

Results

  • DeepSeek R1 performs best at 80.9%, beating the human average but still well below the ceiling.

  • OpenAI o1 scores 72.9%, roughly matching typical human performance.

  • GPT-4o, Llama 3, and other non-reasoning models fall lower, often in the 50 to 65% range.

  • Humans (18 online volunteers) scored 73% with some participants reaching 100%.

Models do well when the argument requires only one to three steps, but their accuracy drops quickly as the reasoning becomes deeper. DeepSeek R1 holds up better on these harder problems, while smaller reasoning models such as o1-mini eventually perform about as poorly as models that were not trained for reasoning.

The clearest performance gains come from teaching models to walk through problems step by step and from prompting them to lay out their thinking in detail. These changes make a much bigger difference than simply building a larger model, which tends to produce only small improvements on its own.

The researchers note that human performance is likely higher than the measured average, because some participants appeared to skim through the hardest questions instead of solving them carefully.

These gaps in formal logic may not matter as much as they sound. LLMs are trained on real facts and human-written reasoning, so they can often shortcut to the correct answer by exploiting statistical regularities in the data instead of working through a symbolic proof.

That tension is what led me to this paper in the first place: after seeing ChatGPT repeatedly describe what it was doing as “thinking,” I wanted to know what that actually means.

“We definitely shouldn’t overhype LLMs: they are still lacking/different in many areas,” said Michael Chen, the paper’s lead author, told me in an email. “I also don’t think we have settled on a good definition for ‘thinking’ and ‘reasoning’ with respect to LLMs.”

Takeaway

User's avatar

Continue reading this post for free, courtesy of Matt Robinson.

Or purchase a paid subscription.
© 2026 Matt Robinson · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture