RESEARCH
Treating Limit Order Books as a “Language”

LLMs are prediction machines. They excel by forecasting the next token, typically, thought of as the next word.
But you can train the same transformer architecture on the weather, genes and financial transactions if you have a large enough dataset.
And there’s growing research that just as transformer-based models excel at language, they can excel in other domains.
You just need a lot of computing power and terabytes of data. And financial markets are brimming with these data streams.
I recently spoke with Juho Kanniainen, a professor at Tampere University’s Data Science Research Centre, on how he and his colleagues trained a transformer-based model on limit order book messages with superior predictive results.
We talked about his new pre-print, “LOBERT: Generative AI Foundation Model for Limit Order Book Messages,” co-authored with Eljas Linna, Kestutis Baltakys, and Alexandros Iosifidis and presented at QuantMinds and the NeurIPS Workshop on Generative AI in Finance. Here’s what they did:
Building a Market Foundation Model
Dataset and Timeframe: The researchers used 470 million messages from the Nasdaq ITCH feed covering four major stocks (AAPL, INTC, MSFT, and FB). The data spanned from May 11, 2015, to September 30, 2015, and was divided into roughly 919,000 sequences for training and testing. The final version of the paper will be published with a more extensive set of securities and more recent data.
Architecture Design: They adapted the BERT architecture to handle Limit Order Book (LOB) data by creating a "one-token-per-message" scheme. This allowed the model to process a unified representation of discrete trade types (like "new order" or "execution") alongside continuous values for price, volume, and time.
Pre-training Technique: The model underwent Masked Message Modeling (MMM), where it learned the "language" of the market by predicting missing pieces of a message sequence before being fine-tuned for specific tasks like price forecasting.
The fine-tuned, task-specific predictors run efficiently over long horizons and eliminate the need for iterative generation, thereby meeting strict latency constraints.
Results: Improved Forecasting Accuracy and Efficiency
27.8% Accuracy: The paper’s main objective is not to generate predictions directly with the pre-trained MMM, but to learn the grammar of market microstructure and then use this model to fine-tune prediction heads for downstream tasks. As a side product, one-step message prediction accuracy was substantially improved: In predicting the next full market message, LOBERT (when combined with a book snapshot) achieved 27.8% accuracy, compared to just 6.1% for previous leading models. This represents a massive leap in the model's ability to understand the complex sequence of market events.
20x Fewer Tokens: By consolidating entire messages into single tokens, LOBERT processes approximately 20 times fewer tokens per sequence than previous methods. This makes the model significantly more efficient at handling the vast context of historical data needed for financial forecasting.
>82% Selective Performance: LOBERT demonstrates strong calibration when filtering for confidence. When the confidence threshold is raised > 0.9, the F1 score increases from 0.51-0.55 to 0.82-0.88.
Outperformance: The model clearly outperforms the DeepLOB baseline in mid-price prediction, particularly in high-confidence scenarios. For example, for a 10-step horizon, the average F1 score increases from 0.48 to 0.82 when using a 90% confidence threshold.
What follows is a conversation about the potential future of “foundation models for markets.”
This interview has been edited for clarity and length.
Matt Robinson: Could you explain the core idea behind using these models for market data and how you address the question of market efficiency?
Juho Kanniainen: The models themselves don't really care if the sequences are about words or limit order book messages, as long as there are patterns. When it comes to limit order book messages and the whole stream of data, patterns emerge because trading and market making algorithms react to past events, continuously creating chains of reactions. This is why (illegal) spoofing exists in markets, where manipulators induce desired reactions using orders that are never intended to be executed. The key point is that certain patterns emerge in the limit order book, whether unintentionally or intentionally created, enabling short term prediction with ML models that are capable of capturing them.
Matt Robinson: How did you design the model architecture to capture these patterns?
Juho Kanniainen: We trained a general-purpose encoder-only model to learn the grammar of microstructure market data. On top of that, we can fine-tune the model for specific downstream tasks. In our paper, we focused on mid-price prediction, which is very relevant for market making. You can place your bid and ask orders in relation to the predicted mid-price rather than the current one, allowing you to manage your inventory better. We could also predict things like whether the bid and ask prices will cross within a given time horizon or if the spread will increase.
Matt Robinson: When did you first get started with this research?
Juho Kanniainen: We have a long history in this field, as we have published on limit order book modeling with machine learning for years. My first papers were published in 2017, and we happened to be the first researchers in that field. In 2018, we published the TABL model.
Matt Robinson: Beyond alpha generation, what are the primary use cases for these models in the industry?
Juho Kanniainen: In addition to market making and arbitrage trading, these models are relevant for optimal execution. You can use these models to quantify market impact in a purely data-driven way. Imagine you have a visible sequence of order book messages and you add one more message representing your own transaction. As a large asset management company, you can estimate how the market reacts to your transaction over the next 50 milliseconds, 100 milliseconds, or even 10 seconds. You can make a prediction with your transaction and one without it, and the difference quantifies the expected market impact. This allows you to save money by executing large transactions optimally.
Matt Robinson: Does this technology have applications in regulatory surveillance as well?
Juho Kanniainen: Yes. Market supervisors can use this to identify possible manipulation. You can continuously construct a market impact index for every message by comparing the model’s predictions with and without the trader’s message. If a supervisor identifies a sequence of messages from a given trader that are canceled rather than executed, but all of them have a very high market impact, that could indicate spoofing—especially if that trader then takes advantage of the price movements caused by those messages. Our model allows supervisors to track exactly how much messages from a given trader are expected to change the market.
Matt Robinson: Given the speed of high-frequency trading, how do you manage the latency involved in running these models?
Juho Kanniainen: We are currently working on projects with Business Finland and the Nasdaq Nordic Foundation involving generative models, and we are preparing a live demo using crypto data. Crypto data allows us to access Level 3 limit order book data for free. The question is whether the models are fast enough that the signals they provide remain valid within the prediction horizon. The unoptimized inference speed of LOBERT is around 3 milliseconds, and the prediction horizon in our paper was at most 100 events, corresponding to approximately 2 seconds. This makes the approach very promising. The idea is that the outputs of these models could be used directly with existing trading algorithms to improve performance. It is all about latency; the data processing and the inference take time, and we will verify this with the live demo next spring.
Matt Robinson: You were able to create a high level of efficiency with a relatively low-parameter model. Was that achieved by packing more information into fewer tokens?
Juho Kanniainen: We focused on minimizing the number of tokens. Volumes can range from zero to any amount, which could explode the number of tokens. Instead, we used bins and placed regression heads on top to predict the exact volume at a bid. This was one way of reducing the number of tokens, which is critical for latency. What also seems to work well is how we model the "delta t," or the time intervals between messages. This is very important information for market making and arbitrage because it helps predict when the next message will arrive. We need to minimize computation latency, even at the cost of the size of the model.
Matt Robinson: I understand your use of a "confidence threshold" has received positive feedback. How does that improve the model’s utility?
Juho Kanniainen: It is a simple idea. Mid-price prediction is essentially a classification problem: will the price go up, stay the same, or go down? We use the model only when it is very confident, meaning the softmax value for one class is much higher than the others. If none of the classes meet the confidence threshold, the model does not make a prediction. For example, you might only make a prediction on every third event and ignore the rest, but your prediction performance becomes much better. Often, the reason a mid-price changes is completely unrelated to past limit order book dynamics, so the model should not attempt a prediction every time.
Matt Robinson: Does accuracy decline as the prediction horizon increases?
Juho Kanniainen: Interestingly, if we increase the prediction horizon, the model performance does not necessarily decrease; it can actually increase slightly before it starts to decline. We observed that for some stocks, like Microsoft, we achieved better accuracy at a 100-step horizon than at 10 steps.
Matt Robinson: What has been the most surprising finding in your recent research compared to previous models?
Juho Kanniainen: The most surprising thing is the performance gap between this new model and older ones like DeepLOB or the TABL model we published in 2018. Those older models used snapshots of the limit order book as input but did not handle the message data itself. Our current model uses a sequence of messages combined with snapshots, which clearly improves predictions. When I talk to market makers, they look at the message flow because trading algorithms react to transactions, cancellations, and other messages that just took place.
Matt Robinson: How do you see these advancements affecting the different segments of the financial industry?
Juho Kanniainen: The needs are very different across segments, from high-frequency trading to asset management. Large firms are continuously developing and training new models because the markets change every few weeks. Training these models is the expensive part. Mid-size hedge funds may not have the same resources for exploiting these possibilities. There is also a fundamental difference in frequency: in mid and low frequency trading, there is typically a human in the loop, but in high-frequency trading, you must trust the models 100%. Our research aims to fill the gap where information on high-frequency strategies is often kept private by large firms.

SPONSORSHIPS
Reach Wall Street’s AI Decision-Makers
Advertise on AI Street to reach a highly engaged audience of decision-makers at firms including JPMorgan, Citadel, BlackRock, Skadden, McKinsey, and more. Sponsorships are reserved for companies in AI, markets, and finance. Email me ([email protected]) for more details.

