Poolside's Model Factory: Honest Engineering, and the Data It Can't Show You
May 27, 2026
DISCALIMER:** The views and opinions expressed in this blog are solely my own and do not reflect those of my employer, or any of its affiliates.
TL;DR; A lab now headquartered in Europe (Poolside) shipped a seriously strong open model, XS.2, in five weeks, and unlike most “Teaser Papers,” they actually showed us the assembly line: the Model Factory, end-to-end lineage, hash-checked distributed training. This is the opposite of the open-washing I complained about last time. The one thing they don’t reveal, where the data comes from, isn’t them being cagey. It’s the visible scar of a regulatory regime that forces a European champion to choose between being commercially viable and being scientifically open. You can’t do both here. That’s the real story.
A nice surprise, for once
A while back I wrote an angry post about “Teaser Papers”: gorgeous PDFs from Big Tech that read like science but behave like advertisements. Beautiful results, poetic intuition sections, and then a single line: “we trained on a high-quality dataset of N trillion tokens.” End of paragraph. I called it a denial-of-service attack on academic research, and I stand by every word.
So you can imagine my reaction when I sat down with the XS.2 technical report (the “Laguna” report) from Poolside1, and found… the opposite. XS.2 is a Mixture-of-Experts model built for long-horizon agentic coding (33.4B parameters total, ~3B active), with open weights released under Apache 2.02. They built it in five weeks3 and then spent pages telling you how the machine works, not just how well it scored.
I want to talk about it, because it’s the kind of report I keep asking for. And because the one place where it goes quiet is, for once, not the lab’s fault, and that’s worth a whole section on its own.
(Full disclosure on bias: I’m an EU-based AI researcher, my PhD is in reinforcement learning and language modeling4, and I’m congenitally happy whenever a European lab does something good. Calibrate accordingly.)
The part European labs should photocopy: the Model Factory
The thesis of the report is almost boring in how sensible it is: treat foundation-model development as an industrial process. They call their stack the Model Factory: a set of components and pipelines whose entire job is to automate the plumbing so the researchers can spend their scarce hours on actual research questions instead of babysitting infrastructure.
Two things in here are, to me, the whole game.
1. End-to-end lineage. Every change (data, config, code) is tracked with full provenance. This sounds like a DevOps footnote. It is not. When you track everything, your ablations fall out of the experiment log instead of being a separate, soul-crushing campaign of re-runs. Even better: you can come back months later and ask questions you didn’t know to ask at the time, and answer them with proper statistical tools instead of squinting at a graveyard of spreadsheets. This is the difference between a lab that has data about itself and a lab that merely produces data and then forgets it. Honestly, if you take one thing from the report, take this.
2. They’re transparent about distributed training. Optimization is the part everyone does and almost nobody writes about; it usually gets dismissed as plumbing, beneath the real science. Poolside spends real ink on it, including a detail I loved: hash checks to catch silent (and non-silent) failures in large-scale runs.5 If you’ve ever had a multi-node run quietly corrupt a shard and poison a checkpoint without throwing a single error, you know this unglamorous paragraph is worth more than half the benchmark tables in the field. Their custom systems work has clear market value, and they shared it anyway.
This is exactly what I begged for in the SOTA Trap post: mechanism, not just metrics. Recipe, not just the finished meal. Poolside delivered the recipe. Credit where it’s due.
The one thing they don’t share, and why I’m not mad about it
Here’s the gap. The report is detailed about the machinery of data curation, including a clean decomposition of document quality into independently learnable properties (a noise axis and an information axis) recombined into a composite score.6 But it stays quiet on where the training data actually comes from.
In my old post, that silence would have earned a full rant. Here it doesn’t, because the situation is different.
A US or Chinese lab that stays vague about data origins is usually choosing opacity as strategy: commoditize the complement, keep the recipe, win. A European lab faces a different incentive structure. Under the current European regime (the AI Act’s transparency duties for general-purpose models, plus a copyright framework built around text-and-data-mining opt-outs7), spelling out your sources mostly buys you copyright litigation. A lab established in the EU sits squarely in that blast radius; a competitor in San Francisco has a fair-use defense to fall back on and feels it far less, even though the Act reaches it too once it serves the EU market.
So Poolside did the responsible thing inside the box it’s stuck in: it opened up the process, which advances the science, and stayed careful about the provenance, which would invite a lawsuit. I can’t be angry at a company for refusing to fall on a sword only European companies are asked to fall on.
Which brings me to who I am a little angry at. But first, some gifts.
Ideas the report sparked
Reading a good report is generative; it makes you think. So here are a handful of directions XS.2 nudged me toward. None of these are criticisms; they’re the things I’d be excited to argue about over coffee.
Make the data mixture adaptive, not static. Poolside’s AutoMixer learns a surrogate model mapping data-mixture proportions to downstream capability metrics, then optimizes the mixture over that learned surface.8 It’s elegant. My one itch: the mixture it produces is essentially static: solved once, decoupled from the model’s current knowledge state. But what a model needs from its data at 200B tokens isn’t what it needs at 2T tokens. I’d love to see the mixture recomputed periodically on cheap distilled proxy models during the run, treating data mixing as curriculum scheduling rather than a one-shot allocation. This connects to something the report itself observes: once tokens are abundant, the bottleneck shifts from “maximizing precision under scarcity” to “controlling repetition and diversity under long-horizon training.”9 When you have enough tokens, when you feed something starts to matter as much as whether you include it. That’s a curriculum problem hiding in a data-mixing coat. Geometrically, AutoMixer optimizes a single point on the data-mixture simplex (x ∈ Δᵈ in the paper); a curriculum is a path across that simplex, which is what the figure below traces, with training time as the vertical axis.
Illustrative: the corners are three example data buckets and the path shape is invented, not Poolside's published mixture.
Multi-objective RL might finally be production-ready. Their reward-design section made me sit up. I spent my PhD on reinforcement learning, and “multi-objective RL” has long lived in the “great in theory, cursed in practice” drawer. Seeing serious reward engineering at this scale makes me wonder if it’s time to drag it out of that drawer and treat multi-objective RL as something that genuinely works in production, not just in a paper with three toy environments.
Welcome back, centralized-vs-decentralized RL. I had to laugh at the trainer-to-inference weight sync machinery. We are, in a very real sense, back to the old centralized/decentralized RL debates, except this time the axis isn’t algorithmic elegance, it’s GPU efficiency. Everything old is new again, just more expensive.
And the one I can’t help wanting: theory of mind. This is my bias showing, but the half of my thesis that wasn’t RL was theory of mind: a model’s ability to represent and match the understanding and context of the human it’s working with.10 For a model whose whole purpose is collaborating with developers, I’d have loved to see even a gesture toward ToM in the agent design. Matching the agent’s model of the task to the human’s model of the task is, I’d argue, a defining quality of a genuinely useful agent, not just a capable one. Maybe next report.
(Poolside, if you’re reading this: those are free. You know where to find me. 😉)
The real subject: the false choice
Now, the part I’m actually annoyed about, and notice it isn’t pointed at Poolside.
European researchers and European companies are being forced onto two diverging tracks, moving at completely different speeds:
- The open track: document your sources in enough detail to expose yourself, honor every machine-readable opt-out, and absorb the copyright risk that comes with it. You can mitigate it (filtering, takedown pipelines, opt-out detection), but that’s months of work and lawyers a startup racing to ship in five weeks does not have.
- The viable track: stay quiet about provenance, ship, survive, and accept that your “science” is now only partly falsifiable from the outside.
The asymmetry comes down to one thing: a fair-use defense. In the US, a lab can argue that training on copyrighted material is fair use. That argument is unsettled and being fought hard right now (billion-dollar settlements, the New York Times suing OpenAI),11 but it is a real legal shield, and Europe offers no equivalent. Under EU copyright law there is no fair-use escape hatch: if a rightsholder reserves their work you have to honor it, and the AI Act makes you publish a summary of what you trained on.
Those disclosure rules do reach any lab that sells a model into the EU, so this is not Europe binding only its own. But a European lab is the most exposed of all: based where enforcement is closest, open to local copyright suits, and with no fair-use defense to fall back on. That pressure splits European AI in two: publicly funded efforts that can afford to be fully open, and commercial labs that stay closed to survive. I keep coming back to that divide in these posts, where the two sides lean in different directions even while sharing the same public compute.
Let me be fair to the EU here. GDPR, the AI Act’s transparency rules, and the copyright opt-out all come from a defensible place: protecting people’s privacy, giving creators a real say over whether their work trains a model, and pulling AI out of the black box so the public and rightsholders can see what went in. The goal is to protect citizens and creators, and that goal is right. My argument is narrower: the net effect on European builders cuts against the very competitiveness Europe keeps saying it wants.
You cannot fault a company for choosing to survive. You can fault a structure that makes openness commercially suicidal for its own champions, and then wonders why those champions go quiet, or get acquired, or relocate.
I want more European labs like Poolside, not fewer. That is why the regulatory environment around them needs to stop forcing this trade-off.12 We are building out AI Factories to give Europe compute. Good. Now we need a data-and-disclosure regime that lets a European lab be both fast and open without betting the company on it. Otherwise we will keep producing exactly this: excellent work, transparently engineered, with a polite silence where the most scientifically interesting paragraph should be, and we’ll keep blaming the labs for a choice we made for them.
Closing
The XS.2 report is one of the more honestly engineered reports I’ve read in a while. It gave me the mechanism I keep demanding from the field, it sparked half a dozen research arguments I’d happily lose, and it was transparent about everything its lawyers would allow.
To be fair about the word “honest,” though: the model that impressed me most on openness was Apertus, the Swiss LLM from EPFL, ETH Zurich and CSCS that released its training data and full recipe, not just its weights. That contrast is the whole point, and no knock on Poolside. Apertus could open its data because it did the slow compliance work a publicly-funded academic project has time for: filtering to respect opt-outs, stripping personal data, documenting everything. A company shipping in five weeks does not have that runway. Same continent, two sets of rules, and only one of them gets to be fully open.
Fix that asymmetry, and Europe might have a shot.
As always: if I got something wrong, or you want to fight me about adaptive data mixtures, theory of mind, or the AI Act, tell me. I can’t get better without feedback.
Footnotes
-
Poolside was founded in 2023 in San Francisco by Jason Warner (ex-GitHub CTO) and Eiso Kant; it later relocated its headquarters to Paris and is backed by, among others, French investor Xavier Niel. So “European” here means established and operating in the EU, which (as we’ll see) is exactly what puts it in the regulatory crosshairs, not “European-born.” ↩
-
XS.2 weights are on Hugging Face under Apache 2.0. The report positions it as “competitive with state-of-the-art open models in their respective weight classes”, so “strong open model,” not “GPT-5 at home.” Credit to them for shipping the weights, not just the benchmarks. ↩
-
Their words: “Building XS.2 from inception to delivery in five weeks was only possible because we treat foundation model development as an industrial process.” Five weeks. I’ve spent longer than that fighting a single SLURM queue. ↩
-
For the curious: Conversational agents in human-machine interaction: reinforcement learning and theory of mind in language modeling. Yes, that “theory of mind” bit becomes relevant later. ↩
-
“Critically, we found hash checks important to prevent silent and non-silent failures of large-scale training runs.” The word “silent” is carrying a lot of trauma. ↩
-
A “noise axis” capturing whether a document is mostly junk, and an “information axis” capturing educational / informational / pre-training value, recombined into a composite contribution score. Clean idea. I’d still love to see the labeling distribution (maybe a v2?). ↩
-
The relevant pieces are the EU AI Act’s transparency requirements for general-purpose AI (including a public summary of training content via the Commission’s template, with GPAI obligations applying from August 2025) and the DSM Directive’s (2019/790, Art. 4) text-and-data-mining exception with rights-holder opt-out. One honest caveat: the training-summary duty is extraterritorial; it applies to anyone placing a GPAI model on the EU market, not only EU providers. So the asymmetry I’m describing is really about copyright-litigation exposure and enforcement reach landing hardest on the company with a European address, not a clean EU-only obligation. I’m compressing a lot of nuance here. ↩
-
Formally: learn a surrogate ℳ from data mixtures to downstream evaluation metrics, then optimize the mixture against the learned surrogate. The static-vs-adaptive trade-off is real (recomputing burns compute), but proxy models are cheap, and the prize (less wasted training) is large. ↩
-
Their framing: “The challenge shifted from maximizing precision under scarcity to controlling repetition and diversity under long-horizon training.” This is one of the most quietly important sentences in the report. ↩
-
I will take any excuse to talk about theory of mind in agents, and I’m not sorry. ↩
-
The Bartz v. Anthropic author class action settled for $1.5 billion, and The New York Times v. OpenAI remains in active litigation through 2026. US fair use for AI training is a real defense, but far from settled. ↩
-
A contested read, worth flagging. The Draghi report argues the EU regulatory stack hampers innovation; others counter that capital access and market fragmentation, not the AI Act, are the binding constraint. The EU’s May 2026 Digital Omnibus has already simplified and delayed parts of the Act, so the regime is moving, not frozen. ↩