• AI Street
  • Posts
  • AI Stack with Sparkline Capital's Kai Wu

AI Stack with Sparkline Capital's Kai Wu

Hey, it’s Matt. In this AI Street Markets:

🎙️ An interview with Kai Wu, Sparkline Capital founder & CIO, on how he’s using LLMs to better quantify intangible assets.

Forwarded this? Subscribe here. Join readers from Citi, the Fed, JPMorgan & more.

INTERVIEW

Quantitative investors have historically relied on accounting data and price metrics.

Kai Wu thinks they're missing the soft factors that drive stock performance today.

As Sparkline Capital founder and CIO, he uses AI to analyze patents, corporate communications, and other unstructured data to identify what he calls "intangible value"—the intellectual property, brand strength, and human capital that he believes traditional financial statements understate.

He started his career at GMO working on Jeremy Grantham's $40 billion asset allocation team, helping manage a $2.5 billion global macro hedge fund. In 2014, he co-founded Kaleidoscope Capital, a quantitative hedge fund in Boston that grew to $350 million in assets before selling his stake in 2018.

He founded Sparkline Capital that same year, spending time exploring where the investment industry was headed and discovering large language models—well before they became mainstream. He launched his first ETF in 2021 and has since built a suite of active ETFs centered on his intangible value framework.

In our conversation, Wu explains how he applies AI on centuries of patent data and culture indicators, why he thinks the line between quantitative and fundamental investing is blurring, and why transfer learning made text-based factor investing viable. He also shares his view on what investors can learn from Renaissance Technologies’ use of unstructured data.

This interview has been edited for clarity and length. 

"The four largest companies today by market value do not need any net tangible assets. They are not like AT&T, GM, or Exxon Mobil, requiring lots of capital to produce earnings. We have become an asset-light economy."

Tell me about Sparkline Capital

The main business at Sparkline is asset management through ETFs. We’re still trying to create alpha using quantitative techniques but in terms of structure we are trying to skate to where the puck is going. A lot of assets and investor interest are moving into ETFs, specifically active ETFs.

Historically, ETFs were synonymous with index funds. But due to a variety of changes, we’re now seeing more active strategies put into ETF wrappers. That provides efficiency, operational benefits, and tax advantages compared with traditional hedge funds. There’s a lot of interest in that category.

I launched my first fund four years ago, a second one about a year ago, and now I’m building out a suite of products centered on the concept of intangible value. I believe that if value investing, in the Ben Graham and Warren Buffett sense, is going to thrive in the digital economy, then we need to adapt the definition of intrinsic value to include intangible assets.

The techniques we use—LLMs and unstructured data—are what make this possible. If you just look at accounting data, you’re missing out on the most valuable information on intangible assets. There’s simply not enough information. Why wouldn’t you also look at the 80-plus percent of data that’s unstructured? And why wouldn’t you use the latest tools to analyze it?

Nobody I know is really trying to solve this problem.

How did you end up focusing on intangible value?

Historically quants have excelled in some dimensions, right? We have the ability to process larger amounts of data faster and in a more disciplined way. We're less emotional, so we're not gonna just sell all our stocks in ‘08.

The downside of being a quant is that historically only a small percentage of the potential universe of information on companies is accessible. Until more recently, quants have been restricted to accounting-based information, price, volume, PE ratios, asset turnover ratios, all that kind of stuff.

But a lot of information isn't even digital. And even that which is digital has historically been very difficult for quants to ingest because you can't take these textual documents and put them through linear regression.

And that’s where LLMs are a huge breakthrough for us, because now we can start saying, let’s base things on text. I wrote a paper called Text-Based Factor Investing, and you can probably guess what that means. The idea was: can we create factors—like Value, Carry, Momentum—but derived from textual data instead? Using NLP, we can generate culture scores or innovation scores and turn those into factors that can be incorporated alongside traditional ones in an investment process.

I think we’re seeing a convergence. Quants are starting to encroach on the discretionary investor’s world, and are now able to incorporate information that historically wouldn’t have been accessible. At the same time, it’s moving in the other direction too. Discretionary investors are being given tools they can use without needing to be master coders. A lot of what we’ve mentioned is increasingly available off the shelf—though of course, there’s still the challenge of sorting through all the different vendors.

They theoretically enable an analyst with no programming experience to benefit from many of the insights AI can provide. Over time, I think these things are going to meet in the middle, where the distinctions between quant and fundamental will matter less.

What tools do you use?

One of the challenges today is that there's been a proliferation in the number of vendors. If you're a fund manager, you're being pitched a million things from different startups.

We can count the number of foundational model companies on one hand, but on top of that there's a whole layer claiming to offer specialized services to investors. It's just really difficult to diligence.

What do you actually use?

I generally try to go homegrown, although I'm probably unique because I've been working with large language models since about 2019.

What were you doing back then? Not too many people knew about LLMs at that time.

I had a career transition. I sold my last hedge fund and was starting my business. It gave me some time to reset and say: Where are the big industry trends? And that's where I discovered large language models and natural language processing techniques. My goal was to quantify intangible assets from the perspective of a value investor. I used to work for GMO, a quant value investment manager, and the problem I recognized was that a lot of the intangible assets were not accurately measured by accounting statements.

The question became, how can we go about quantifying the value hidden in patents or trademarks, these unstructured data sets? It became clear to me that LLMs and AI provided the key to unlocking these data.

When did you first hear about LLMs?

Obviously the [Attention is All You Need] paper and BERT were the big breakthroughs in 2017-8. But I think that the bigger breakthrough was actually less about the models and more about the data.

Deep neural networks were invented decades ago. It was just that computers weren't fast enough and there wasn't enough data to train them in an effective way.

So the architectural breakthrough of the transformer was better than the alternatives at the time. But I don't think that was the actual game changer. The game changer was transfer learning. At the time, you could develop a specialized model trained on 10-Ks, but the problem was there just weren’t many 10-Ks. You’re talking about an extremely small sample to train a large model on so the results wouldn’t be very accurate.

I actually wrote a paper called Deep Learning in Investing in 2020. The takeaway was that training a deep learning model on domain-specific financial data produced results that were worse than a logistic regression or standard dictionary-based approaches. It was only once you took a pre-trained language model trained on a larger corpus of general-purpose text, like the internet or books, and then fine tuned it for our use case and only then did it become more powerful. So that's the key insight, which is the pre-trained aspect, which I think was a big breakthrough. And people oftentimes credit OpenAI or Google with the architectures. But I think really it's the less glamorous component of just being able to scrape all this information, put it into a format that can then be ingested for training purposes.

And then just the insight that you're better off training on all data, even if it's general and not specific to your industry, as opposed to trying to be like an expert at one thing and train only on your narrow data set, but not cross-train.

It's kind of like sports, right?

You want to be a generally athletic person who does lots of different things and dabbles, that's more important than just spending all your day hitting forehands if you play tennis or something else.

What were the conversations like with your peers about this tech back in 2019-2020?

I think a lot of the attention was on the broader use case of deep neural networks.

We had faster computers, machine learning was becoming more prevalent, and many quants were trying to use those things to do portfolio construction or signal selection. So given a panel of a million alphas, different signals, can you select which are the most robust based on historical data? They were trying to use it as basically a fancier version of a regression or boosting type model. I was writing about this in 2020 that I thought that was a dead end. The problem is that the data sets are too small in finance. There's too much noise in the data set. Maybe if you're doing high frequency, it's different, but at least on my frequency, which is medium- to long-term investing, it didn't make sense. Markets are dynamic, anomalies get arbitraged away. Instead, I argued researchers should focus on a subclass of deep learning, specifically LLMs, which excelled at processing text and other unstructured data.

The best example, I think is Renaissance Technologies, the world's best hedge fund. They became really good when they hired the speech recognition team from IBM, I think that's more than a coincidence. My guess is that they were onto a lot of stuff we're now doing, who knows if the architectures are the same, but my guess is that the idea of taking unstructured data and creating signals from that was a core insight they had decades ago, and that's why they're the best hedge fund.

When you say unstructured data, what data sets are you looking at?

I tend to use publicly available information. My edge is analytical, not necessarily access to proprietary data. Today, there are hundreds of data vendors with proprietary datasets, but the ability to purchase these data sets is not really a durable edge. My thought would be to instead focus on publicly available information such as patents that are large, messy and intractable for the average investor.

That's a huge database. I've tried looking through it. It’s not intuitive.

There's a lot of technical language and it goes back to 1790, the first patent. It’s a super long data set, a lot of breadth, a lot of the assignees are actually private companies or individuals. So yeah, it's a messy data set to look through, and that's the stuff I love. The work is in going through it and trying to make sense of it. And that's where being, someone who has spent time working with a lot of text and unstructured data, I feel I have an edge.

How did you like today's newsletter?

Login or Subscribe to participate in polls.

Reply

or to participate.