A year ago, running a capable model on your own hardware meant babysitting a Python script. Today it means installing a package and forgetting about it. This is the story of how Lemonade grew up — and what we built on top of it once it did.
Earlier in this series: Strix Halo on Ubuntu: From Zero to Local Tokens introduced the hardware. This post is what came next.
TL;DR
- Lemonade went from a Python SDK to a packaged appliance — a real service you install with one command, not a virtualenv you nurse.
- It now serves a multi-model fleet over an OpenAI-, Anthropic- and Ollama-compatible API, does image generation, speech-to-text and text-to-speech, and drops straight into 100+ apps — including Claude Code.
- It’s no longer AMD-only: alongside ROCm and Vulkan, it now runs CUDA and Apple MLX. If you passed on it because you don’t own a Strix Halo, this is your invitation back.
- Run big models overnight, burst when you need speed. 128 GB of memory holds dense 27B-class models for zero-marginal-cost batch work — and the same server bursts to a private, hourly-billed GPU instance (think GLM-5.2 open weights) when a job needs to be fast.
- The payoff is privacy and permanence: your data never leaves the building, and the standing bill is a few cents of electricity instead of a metered cloud invoice. My kids named ours DadGPT.
Where this started
Last fall we wrote a long, command-dense guide to getting an AMD Strix Halo box — 128 GB of unified memory, the Ryzen AI Max+ platform — serving tokens on Linux. It helped us win an AMD developer contest and kicked off a world of new AI industry relationships. But re-reading it now, the install section feels like a relic. Page after page of install commands, dependency pins, and “if this fails, try that.” It worked, but it was a procedure, not a product.
That gap — between something that works and something you can hand to a business — is exactly what closed this year.
From script to appliance
The old Lemonade was a Python SDK. You created an environment, resolved dependencies, and kept the terminal open. The new Lemonade is an appliance: a compiled background service (lemond), a small command-line tool (lemonade), and a tray app, shipped as native packages for Debian, Ubuntu, Fedora, Arch, plus Docker, Snap, macOS and Windows. You install it, it runs as a systemd service, and it’s there after every reboot whether or not anyone is logged in.
That shift sounds mundane. It isn’t. “Runs as a service” is the line between a demo and infrastructure. It means declarative configuration (drop in a config file, restart, done), idempotent model loads that don’t race each other, live model management that pins the models you want kept hot and auto-unloads the rest when you’re tight on VRAM, and pre-flight checks that fail fast on a full disk instead of retrying forever. These are unglamorous, and they are exactly what you want from something that has to stay up.
It also introduced recipes and backends as first-class ideas. A backend is the engine and device target — llama.cpp on Vulkan, ROCm, CUDA, CPU, or Apple Metal/MLX; the NPU runtimes for Ryzen AI; whisper.cpp for audio; sd.cpp for images. A recipe is a saved, named configuration so the machine comes back up exactly the way you tuned it. You stop hand-assembling launch flags and start declaring intent.
The distro story: Debian proper, Ubuntu for the cutting edge
Here’s a choice most local-AI write-ups skip: which Debian-based distro. We run two of them — there are plenty more — on purpose.
Ubuntu Server is the easiest on-ramp — and the one that ships with commercial support you can buy if your business needs a number to call. AMD builds and validates against it, so its repo — pockets, channels, the ROCm and Vulkan backends — is where the newest support lands first. If you want to be aligned with where the vendor is actively working, Ubuntu is the path of least surprise. (Ubuntu Desktop is a joy in its own right — we’ve been tempted to daily-drive a Strix Halo box more than once. And when the laptops land, a Gorgon Halo machine would be our pick for university students: the chip has already turned up in benchmarks with up to 192 GB of unified memory — nearly double Strix Halo — so the laptop form factor can’t be far behind.)
Debian proper is our preference for the box that has to last. It’s tighter and leaves more RAM for models, the stable cycle is long, and backports put it on a current 7-series kernel without dragging the rest of the system forward. AMD didn’t initially consider Strix Halo supported on Debian — so we worked it out anyway, and now run it there in production. (Lemonade publishes a Debian 13 package with every release.)
The two distros aren’t a fork in the road for us; they’re a pair. Both halos share one cache over LVM — the model checkpoints and the Lemonade recipe store — so a 40 GB checkpoint pulled on the Ubuntu box, and the recipes tuned around it, are instantly available to the Debian one. Debian is where we run production; Ubuntu is where we stay current with AMD and where we do our upstream Lemonade project work.
One more thing, because it changes who this is for: we’re describing a headless appliance — a server in a closet, on a UPS, with no monitor attached — but Lemonade itself runs just about anywhere. There’s a one-click installer and tray app on Windows (and it runs happily under WSL on a workplace laptop), a macOS package, Docker and Snap images, an Arch build that drops straight onto CachyOS for the tweakers, and native Debian, Ubuntu and Fedora packages — plus embeddable binaries you can bundle into your own app. Pick whichever surface fits your shop; the API on the other side is identical.

This is where the evergreening discipline we’ve practiced for years pays off. A stable base, a clean dependency graph, security patches that arrive without drama — that’s not nostalgia, it’s what makes a local AI box deployable in a commercial environment rather than a science project that rots in six months. None of this maturation happens by accident: a particular shout-out to Mario Limonciello, whose upstream work has done much of the heavy lifting to take Lemonade from enthusiast-grade to something you can responsibly put in front of a client, from a Debian packaging perspective.
What the appliance does now
Once it’s a service, it stops being “a model” and starts being a platform:
- A multi-model fleet on one endpoint. Everything answers on a single OpenAI-compatible endpoint, so any tool that speaks OpenAI, Anthropic or Ollama — chat UIs, editors, agents, Claude Code — just points at it and works. A built-in MCP gateway lets agents call it directly. No code changes, no SDK lock-in.
- More than text. Image generation (Stable Diffusion / Qwen Image), speech-to-text (Whisper and Moonshine), text-to-speech (Kokoro), and true omni-modal models that return multimedia, not just words. One box, every modality.
- Fast enough for the day-to-day. Built-in speculative decoding (MTP / EAGLE3 draft models) roughly doubles throughput, so the local box comfortably handles the steady stream of small jobs. You won’t give up your frontier coding subscriptions for the hard problems — but you stop spending paid tokens on the routine: our harnesses (hermes and openclaw) hand the small tasks to Lemonade, and Qwen3.6-27B makes a genuinely good overnight code generator and reviewer.
- Operational manners. Resumable downloads, a native Prometheus metrics endpoint, secure WebSockets behind HTTPS, crash-safe installs, and a built-in benchmark (lemonade bench) that gives apples-to-apples numbers across engines. Boring on purpose.
Cutting edge, but governed
The fast-moving part of this world is llama.cpp: new quantization formats and speed tricks land almost daily, and a hot new model quant — say an Eagle / MTP draft for speculative decoding — often needs a specific, very recent llama.cpp build to run at all. The usual way to chase that is to compile from source and hope. We don’t have to: Lemonade ships its llama.cpp builds through Ubuntu PPA pockets — a stable channel and a bleeding-edge one — so “the newest build” is still just a package.
And most tuning is a tunable plus a restart: change a recipe or a per-backend argument, restart the service, done — no environment surgery. Which is exactly the kind of chore you can hand to an agent.
So we did. One of our workflows watches for updated model quants, follows the model card to the llama.cpp build it requires, waits for that build to land in the bleeding-edge pocket, then deploys the new quant, updates the recipe, and tests it through the API before it goes live. And when we’re really hungry, we don’t even wait for the pocket: point the configuration at a specific llama.cpp build and lemonade-server pulls that binary straight from upstream — running it ahead of the CI and quality checks the Lemonade team (led by Jeremy Fowers) would normally put it through. That’s a knob you only reach for on a box you own and verify yourself. The result is a contradiction that shouldn’t work but does: you ride the absolute cutting edge of models and runtimes while never leaving a package-managed, reproducible environment — and an agent does the watching, deploying, and verifying. That’s the Netstatz thesis in a single loop: frontier capability, governed, on hardware you own.
More on the harness ecosystem that drives that in a later post.
What we built on it
This is the part the last post promised. Once you have a private, always-on inference appliance, the interesting question isn’t “can it run a model” — it’s “what do you point at it?”

DadGPT. Back in September 2025 we stood up OpenWebUI on top of Lemonade running a 120-billion-parameter open-weight model (gpt-oss-120b), and pointed our home DNS so that dadgpt.com resolves to it on the network. Each family member gets a partitioned, Gmail-authenticated space — private chats and the kind of questions nobody actually wants to hand to a cloud provider. The kids named it at dinner, and for a short while it was the most-used AI service in the house — not one token of it leaving the property.
Push-to-talk on everything. With OpenWhispr in the mix — and Lemonade’s newly added Moonshine backend, a streaming speech-to-text engine built for real-time, on-device use (it starts transcribing while you’re still talking, and scales from tiny 26 MB models for constrained devices up to versions that beat Whisper Large V3 on accuracy) — every aging laptop and spare machine in the house becomes a real-time voice endpoint. Old hardware that was destined for a drawer now talks to the appliance.
A beacon for discovery. The appliance advertises itself on the network, so other on-net devices can find it without anyone hand-configuring an IP.
Wired into our harnesses. The same endpoint feeds the automation we run for real work — content pipelines, research, and operational agents — orchestrated through frameworks like openclaw and hermes. The appliance is the muscle; the harnesses are the judgment. (That’s its own story, and we’ll tell it properly soon.)
The business case: run it overnight, burst when you need to
Strip away the fun and there’s a serious argument here, and it’s the one we make to clients every week.
Start with what 128 GB of unified memory actually buys you: room to run big, dense models — a 27-billion-parameter dense model, not just the small or sparse ones — entirely in local memory. The catch is honest: on a desk-side Strix Halo box, those big dense models are slow (though built-in speculative decoding now claws a lot of that speed back). But “slow” stops mattering when the work doesn’t need an answer in the next ten seconds. You have the RAM, you own the hardware, and the marginal cost is electricity — so you let the heavy jobs run overnight. Summarize the day’s documents, classify a backlog, run an audit across a whole corpus: queue it at 6 p.m. and read the results with your coffee. Throughput you own beats latency you rent, for everything that can wait.
And that’s what makes the appliance a stepping stone, not a dead end. Once a workload proves its value overnight — and once you need it interactive, or at a scale the desk-side box can’t reach — you don’t rewrite anything. Lemonade has built-in cloud offload, so the same server can route to an OpenAI-compatible provider right alongside its local models. Spin up a private Hot Aisle instance running larger open weights like GLM-5.2, billed by the hour, run the on-demand burst, and shut it down. Same API, same prompts, same privacy posture — your own instance, your data — just dialed up for the hours you actually need the speed.
That’s the entry-level version of the move. A more maturing use of the same box runs the same engine end to end: prototype a workload with vLLM on your RDNA hardware, then deploy that identical vLLM stack on a CDNA MI300X in the cloud — you re-tune the performance recipe for the bigger iron, not the code or the API. That bridge from desk to data center is its own story, and it’s coming next.
So the appliance plays two roles at once: the always-on, private, zero-marginal-cost workhorse for everything that can run overnight, and the proving ground that tells you exactly which workloads justify on-demand cloud-GPU spend before you commit a dollar to it. (That last part isn’t hand-waving — lemonade bench gives you real, comparable numbers across engines and devices.) For a small or mid-sized business weighing AI infrastructure, that’s the whole game: keep the data in the building, keep the standing cost near zero, and rent burst capacity deliberately instead of paying a per-token meter that grows with every employee who discovers it.
The honest caveat: getting a box from “it runs” to “it’s reliable, private, and supportable” is real work — distro choices, hardening, the boring operational manners. But that work is finite and repeatable, which is the whole point of an appliance.
If that tiered, privacy-first approach fits your organization — or you’re weighing where local inference and on-demand GPU each belong in your stack — that’s a conversation worth having, and it’s one we have with clients every week. The earlier you have it, the more it’s worth: the window where this kind of advice compounds is narrow, and it’s open now. And if you’re a builder rather than a buyer — standing this up yourself and want to compare notes — you’ll find us as imac1024 in the Lemonade Discord.
Takeaways
- Packaging is a feature. The jump from script to service is what turned a capable model into infrastructure you can depend on.
- It’s not AMD-only anymore. CUDA and MLX support widen the door — the hardware you already own may be enough.
- Own the overnight, rent the on-demand. Big RAM makes batch work free; the same API bursts to a private, hourly GPU instance when speed matters.
- Pick your distro on purpose. Ubuntu to ride with upstream AMD; Debian to last. Share the cache and have both.
- Local-first is a privacy strategy, not just a cost play. DadGPT is a toy; the principle behind it is not.
What’s next
We’re writing the deep version of this: our actual build — LVM-shared caches, TPM, Debian hardening, the recipes we run, and the small TOON harness we use to stage and upgrade the appliance safely. If you want the copy-pasteable version with all the sharp edges labeled, that one’s for you.
In the meantime: install the package (the Lemonade docs cover every platform), point a tool at it, and see how little babysitting it takes now.