Run GPT‑OSS‑120B on Strix Halo (Ubuntu 25.04) — 40 tok/s, no containers
OS: Ubuntu 25.04 (kernel 6.14 series)
Drivers: AMDGPU 30.10_rc1 + ROCm 7.0_rc1 (via APT, includes ROCk module)
Python: 3.13 managed by uv (APT repos), lockfile‑reproducible venv
Hardware: Bosgame M5 128 GB
Serving: Lemonade + llama.cpp ROCm
gpt-oss-120b-GGUF perf: ~40 tokens/s
On current Strix Halo boxes (e.g., Ryzen AI MAX+), Ubuntu 25.04 “just works”: the stock kernel recognizes AMDXDNA and AMD’s preview AMDGPU 30.10_rc1 + ROCm 7.0_rc1 packages install entirely via APT—no compiling, git pulls or tarballs. With Lemonade on Strix Halo, you can serve gpt‑oss‑120b (GGUF) on the iGPU through llama.cpp‑ROCm and expose an OpenAI‑compatible API. The setup is fully reproducible using uv, can be run headless, and takes very little time to get going.
All steps below use Ubuntu/Debian tooling (apt, bash, vi) and prioritize forward‑compatibility, and easy rollback.
Dual Boot Setup
- Cloned the original SSD to a 2nd M.2 NVMe drive using gparted from the Ubuntu USB live installer.
- Resized C: and Moved Recovery to create free space at the end of the disk, preserving Win11 recovery actions.
- In the system BIOS: enable SR-IOV/IOMMU, leave Secure Boot ON (allows us to enroll MOK for DKMS) – DEL to Ener BIOS, F7 for Boot Selection on the Bosgame M5
imac@ai2:~$ uname -a Linux ai2 6.14.0-29-generic #29-Ubuntu SMP PREEMPT_DYNAMIC Thu Aug 7 18:32:38 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux imac@ai2:~$ journalctl -k | grep -i amdxdna Sep 03 11:39:43 ai2 kernel: amdxdna 0000:c6:00.1: enabling device (0000 -> 0002) Sep 03 11:39:44 ai2 kernel: [drm] Initialized amdxdna_accel_driver 0.0.0 for 0000:c6:00.1 on minor 0
The 2TB NVMe drive that came with this box is shown below. It was modified using gparted on the Ubuntu USB live boot prior to installation. New Linux users with a shipped device that includes Windows 11, may opt to to create a single new p5 partition for the Ubuntu 25.04 instance and skip the additional partitioning exercise. Creating a second new partition (p6) is not required for running Lemonade on Strix Halo, and has no impact on any steps described in this post. The Ubuntu installer will allocate all free space to a selected partition on the device during the installation process, and can “Install Ubuntu alongside Windows” handling all resizing on its own.
Disk /dev/nvme0n1: 1.86 TiB, 2048408248320 bytes, 4000797360 sectors Disk model: KINGSTON OM8PGP42048N-A0 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 79708580-B666-424E-8D0D-C785190FA328 Device Start End Sectors Size Type /dev/nvme0n1p1 2048 206847 204800 100M EFI System /dev/nvme0n1p2 206848 239615 32768 16M Microsoft reserved /dev/nvme0n1p3 239616 616009727 615770112 293.6G Microsoft basic data /dev/nvme0n1p4 616009728 618057727 2048000 1000M Windows recovery environm /dev/nvme0n1p5 618057728 1071108095 453050368 216G Linux filesystem /dev/nvme0n1p6 1071108096 4000794623 2929686528 1.4T Linux filesystem
The Strix Halo crypto performance is excellent. I wrap nvme0n1p6 with LUKS encryption, and the system hardly blinks, hitting over 13 GB/s in hardware-supported decode. The second M.2 slot creates an opportunity for RAID 1+0 reconfiguration for added performance and redundancy, should those become considerations for a longer term deployment plan.
imac@ai2:~$ lsblk ... nvme0n1 259:0 0 1.9T 0 disk ├─nvme0n1p1 259:1 0 100M 0 part /boot/efi ├─nvme0n1p2 259:2 0 16M 0 part ├─nvme0n1p3 259:3 0 293.6G 0 part ├─nvme0n1p4 259:4 0 1000M 0 part ├─nvme0n1p5 259:5 0 216G 0 part / └─nvme0n1p6 259:6 0 1.4T 0 part └─lvm_crypt 252:0 0 1.4T 0 crypt └─nvme1-models 252:1 0 500G 0 lvm /mnt/models imac@ai2:~$ cryptsetup benchmark ... aes-xts 256b 13151.8 MiB/s 13010.5 MiB/s ...
Add AMDGPU & ROCm (preview) APT repos
- Add keys and repositories for both components and a preference to prefer them over the distribution packages. You can browse for newer releases here and here.
# Key (/etc/apt/keyrings for user managed vs. /usr/share/keyrings where packages deploy) curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key \ | sudo gpg --dearmor -o /etc/apt/keyrings/rocm.gpg # AMDGPU 30.10_rc1 echo 'deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/30.10_rc1/ubuntu noble main' \ | sudo tee /etc/apt/sources.list.d/amdgpu.list # ROCm 7.0_rc1 echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.0_rc1 noble main' \ | sudo tee /etc/apt/sources.list.d/rocm.list # apt preferences echo 'Package: * Pin: release o=repo.radeon.com Pin-Priority: 600' \ | sudo tee /etc/apt/preferences.d/rocm-pin-600 sudo apt update
Install the graphics + compute stack (no compiling)
Secure Boot note: During DKMS install you’ll set a one‑time MOK password and enroll it on the next reboot so the kernel can load signed modules.
# AMDGPU kernel bits + userspace sudo apt install amdgpu-dkms amdgpu # ROCm runtime sudo apt install rocm rocminfo
DKMS will build kernel modules for you (okay, so technically there is some compiling and linking here, but none initiated by the user). The rocm install pulls in a bunch of math libs and runtime packages. (rocblas rocsparse rocfft rocrand miopen-hip rocm-core hip-runtime-amd rocminfo rocm-hip-libraries)
It took me a minute to realize 30.10_rc1 is newer than 6.4.3/latest and is also aligned with 7.0_rc1 As shown below, the cli outputs the version 6.14.14 for both.
imac@ai2:~$ dkms status amdgpu/6.14.14-2193512.24.04, 6.14.0-29-generic, x86_64: installed imac@ai2:~$ modinfo amdgpu | head -n 2 filename: /lib/modules/6.14.0-29-generic/updates/dkms/amdgpu.ko.zst version: 6.14.14 imac@ai2:~$ rocminfo | head -n 1 ROCk module version 6.14.14 is loaded imac@ai2:~$ apt show rocm-libs -a Package: rocm-libs Version: 7.0.0.70000-17~24.04 Priority: optional Section: devel Maintainer: ROCm Dev Support <rocm-dev.support@amd.com> Installed-Size: 13.3 kB Depends: hipblas (= 3.0.0.70000-17~24.04), hipblaslt (= 1.0.0.70000-17~24.04), hipfft (= 1.0.20.70000-17~24.04), hipsolver (= 3.0.0.70000-17~24.04), hipspars> Homepage: https://github.com/RadeonOpenCompute/ROCm Download-Size: 1,056 B APT-Sources: https://repo.radeon.com/rocm/apt/7.0_rc1 noble/main amd64 Packages Description: Radeon Open Compute (ROCm) Runtime software stack
Kernel & memory tunables
Strix Halo’s iGPU uses RDNA3.5 and can use GTT to dynamically allocate system memory to the GPU. However, oversizing the GTT window could affect stability. It might also trigger OOM issues if the system lacks enough memory. I have not tried to load anything large enough to cause this, but if you are curious about seeing oom-kill in your kernel logs, it can be triggered with memtest_vulkan. The AMD docs show a couple of environment variables that appear to control thresholds for GTT allocations to prevent this. Limited testing has shown these, when enabled and using GTT, can prevent model loading when there is memory pressure.
One notable issue currently with Lemonade’s llama.cpp+rocm stack arises when VRAM is set to 96GB in the BIOS. With this setting, when you try to load gpt-oss-120b, you will just wait, forever. This appears to be some kind of issue with the mmap enabled strategy in llama.cpp misbehaving, as noted in a lemonade issue here and here. The current low level llama-server works with –no-mmap passed directly so that option for Lemonade with 96GB VRAM pinned in the BIOS is expected. When testing this scenario, I stopped waiting for the model to load after 230 minutes in one session. There were no errors or crashes. However, llama-server was at 100% and kswap active at 50% without a swap file. The GPU was at 0% according rocm-smi.
You have two choices for gpt-oss-120b with Lemonade v8.1.8. You can either a) tune the VRAM BIOS settings down and allocate GTT as you like (27648000=105GB, 16777216=64GB) or b) set your VRAM BIOS to 64GB and not use GTT. It is unclear to this author what benefit there is leaving GTT allocated when VRAM is set in the BIOS. I set GTT to a low value of 512M, to simply avoid the default allocation of 16GB.
One note from my experience: using GTT results in slightly lower gpt-oss-120b TPS (38-41) compared to VRAM (40-45). However I have not tested this in a structured manner, or extensively as Leonard Lin. YMMV. It looks like ROCWMMA is right around the corner, which should show up in a uv –package-upgrade shortly.
a) Low VRAM, High GTT
The VRAM is set to 512MB in the BIOS and GTT is set to 105GB in the kernel parameters. The 105GB value is based on an observation by Jeff Geerling. I have not tested potential unstabilities at higher values. However allocating more could help run even larger models.
sudo vi /etc/default/grub # GTT to 105GB: GRUB_CMDLINE_LINUX="transparent_hugepage=always numa_balancing=disable amdttm.pages_limit=27648000 amdttm.page_pool_size=27648000" sudo update-grub imac@ai2:~$ sudo dmesg | egrep "amdgpu: .*memory" [ 3.375071] [drm] amdgpu: 512M of VRAM memory ready [ 3.375074] [drm] amdgpu: 108000M of GTT memory ready.
With GTT set to 105GB, I can now load models beyond the 96GB BIOS limit. I successfully loaded GLM-4.5-Air-UD-Q6_K_XL with specific settings. (-ngl 100 landing at GTT Total Used Memory (B): 98704523264). This platform’s ability to go beyond 96GB is exciting and unique. I expect GTT performance to become equivalent to VRAM, despite being a few TPS slower today. No more BIOS tinkering.
b) High VRAM, Low GTT
The VRAM is set to 64GB in the BIOS and GTT is set to 512M in the kernel parameters. This is currently more performant Low VRAM, High GTT (by a few TPS) for Lemonade on Strix Halo running gpt-oss-120b.
sudo vi /etc/default/grub # GTT parameters removed: GRUB_CMDLINE_LINUX="transparent_hugepage=always numa_balancing=disable amdttm.pages_limit=131072 amdttm.page_pool_size=131072" sudo update-grub
If the VRAM is set higher to 96GB in the BIOS, even with low GTT settings (512M), gpt-oss-120b will not load. I have not done significant diagnostic, but the symptoms seem to be that the 32GB system memory struggles to load the model. To load a 62GB model like gpt-oss-120b using Lemonade, which uses llama-server with mmap turned on, you currently require 64GB of system memory, in addition to 64GB of space for the model in VRAM available to the GPU. The system memory is not a strict requirement, but issues with loading large models with only 32GB of system memory appears to be an mmap issue. A discussion tracked here indicates 32GB can work with mmap disabled. A fix for this scenario is expected.
With 32GB of system memory, we have seen the following kernel message during model loading. This may indicate a high CPU workload related to memory. As mentioned, the model did not load after hours. This was with BIOS VRAM pinned to 96GB, leaving 32GB of system memory.
kernel: workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND imac@ai2:~$ sudo dmesg | egrep "amdgpu: .*memory" [ 3.212260] [drm] amdgpu: 98304M of VRAM memory ready [ 3.212261] [drm] amdgpu: 512M of GTT memory ready. [ 3.816123] amdgpu: HMM registered 98304MB device memory
gpt-oss-120b loads immediately if VRAM is set to 64GB in the BIOS and GTT to 512MB via kernel parameters. (amdttm.pages_limit=131072 amdttm.page_pool_size=131072 is shown as 512M below). Pinning VRAM at 64GB in the BIOS seems to deliver slightly higher TPS. This is compared to using GTT memory and VRAM at 512M.
imac@ai2:~$ sudo dmesg | egrep "amdgpu: .*memory" [ 3.333155] [drm] amdgpu: 65536M of VRAM memory ready [ 3.333156] [drm] amdgpu: 512M of GTT memory ready. [ 3.892481] amdgpu: HMM registered 65536MB device memory
Swap File
I have disabled the swap file on my system. It seems to generate SVM messages from the kernel, usually during model load when swap is enabled. With the current mmap issue, we see endless numbers of these with 32GB system memory while trying to load gpt-oss-120b if the swapfile is present. No swapfile = no SVM mapping failed messages.
kernel: amdgpu: SVM mapping failed, exceeds resident system memory limit
This is achieved by simply commenting the swapfile load out of /etc/fstab, as shown in the last line below
imac@ai2:~$ cat /etc/fstab # /etc/fstab: static file system information. # # Use 'blkid' to print the universally unique identifier for a # device; this may be used with UUID= as a more robust way to name devices # that works even if disks are added and removed. See fstab(5). # # <file system> <mount point> <type> <options> <dump> <pass> # / was on /dev/nvme0n1p5 during curtin installation /dev/disk/by-uuid/cea76f55-f802-4ef3-a1cd-ebda84150293 / ext4 defaults 0 1 # /boot/efi was on /dev/nvme0n1p1 during curtin installation /dev/disk/by-uuid/7E3F-BB4F /boot/efi vfat defaults 0 1 #/swap.img none swap sw 0 0
Environment Variables
There are a lot of environment variables that impact the ROCm at runtime. Although none are used in this Lemonade on Strix Halo recipe, it is worth keeping track of what is out there. Leaving them enabled in your .bashrc can lead to unexpected behavior if you forget about them.
GPU_MAX_ALLOC_PERCENT=100 GPU_SINGLE_ALLOC_PERCENT=100 HSA_OVERRIDE_GFX_VERSION=11.0.0 HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0 HIP_VISIBLE_DEVICES=0 ROCR_VISIBLE_DEVICES=0 ROCBLAS_USE_HIPBLASLT=1
Setup Python venv with uv
I prefer uv over pyenv/poetry and use a packaged version from debian.griffo.io.
curl -fsSL https://debian.griffo.io/EA0F721D231FDD3A0A17B9AC7808B4DD62C41256.asc \ | sudo gpg --dearmor -o /etc/apt/keyrings/debian.griffo.io.gpg echo 'deb [signed-by=/etc/apt/keyrings/debian.griffo.io.gpg] https://debian.griffo.io/apt trixie main' | sudo tee /etc/apt/sources.list.d/debian.griffo.io.list apt update apt install uv
Head to wherever you want to store your lemonade project.
cd ~/src/lemonade #replace with your own project location uv init uv venv uv python pin 3.13 uv add torch --index https://download.pytorch.org/whl/rocm6.4 uv sync uv pip install lemonade-sdk[dev] # try uv add lemonade-sdk[dev] first
Lemonade on Strix Halo does not require Torch for GGUF+ROCm, but it is useful. Pinning the ROCm wheel extras index in your pyproject.toml helps resolve some dependency extras cleanly when you pull lemonade-sdk. This also avoids about 1GB of extra nvidia tools and libraries. Using pip install also works around not guarding against windows-only dependencies. This issue, which produces the error below, will be fixed soon, allowing the use of uv add. The uv resolver is much stricter that pip, which just does what it is told.
error: Distribution `pywin32==311 @ registry+https://pypi.org/simple` can't be installed because it doesn't have a source distribution or wheel for the current platform
Run Lemonade headless (screen)
Running in screen allows you to start Lemonade and leave it running in the background. You can then close your terminal window. I picked a reasonable context size, which is configurable. I also set the host so that Lemonade listens on all interfaces, not just localhost. This system is on a private network. Do not port forward or put this system on a public IP in this configuration, please.
cd ~/src/lemonade screen -S lemony # inside screen: source .venv/bin/activate lemonade-server-dev run gpt-oss-120b-GGUF \ --ctx-size 8192 \ --llamacpp rocm \ --host 0.0.0.0 \ --log-level debug |& tee -a ~/src/lemonade/lemonade-server.log
Detach from screen with CTRL-a d. Reattach with screen -r lemony. Access: http://STRIX_HALO_LAN_IP_ADDRESS:8000 from a browser on any device on the same network. Debug log level will output TPS and other useful information, but can be removed when not needed.
Lock it down
Secure the interfaces once you have it working. In this case, only two ports are required. SSH port 22 is for administration. HTTP port 8000 is for web access to the model manager and API.
sudo ufw allow 22/tcp sudo ufw allow 8000/tcp sudo ufw enable
Screen is used here, but a systemd wrapper may be better for long-term use. This would be as a service to provide an API to something like Open WebUI. When this Strix Halo is not tied up with other workloads, I have a separate Debian trixie instance that serves up Open WebUI to provide memory (look at your old chats) and advanced features for other local network users. It is a mature tool and great for engaging with private data. This is an alternative to enterprise AI chat tool subscriptions. It’s a clear candidate for enhancing household and small business productivity.
Basic Local Monitoring
Messages about VRAM and GTT allocations and ongoing SVM mapping failures
journalctl -b | egrep "amdgpu: .*memory"
Follow logs live to see errors in realtime
journalctl -f
Watch the GPU memory use
watch -n1 /opt/rocm/bin/rocm-smi --showuse
Inspect VRAM capacity
rocm-smi --showvbios --showmeminfo vram --showuse rocm-smi --showvbios --showmeminfo gtt --showuse rocm-smi --showvbios --showmeminfo all --showuse
Evergreening
apt update && apt upgrade
Check for bumps in the apt repositories here and here. Move to the stable or latest when the preview releases become final.
uv lock --upgrade # Make a copy of uv.lock first for rollback, if not using a git repo with a tag uv pip install --upgrade lemonade-sdk # Today, use if you installed with uv pip install uv lock --upgrade-package lemonade-sdk # Soon, when lemonade-sdk properly manages dependencies for linux vs windows
Also keep an eye on your torch wheel if you are using torch, and either update the index in your pyproject.toml or remove it so its dependencies can not conflict with lemonade.
[[tool.uv.index]] url = "https://download.pytorch.org/whl/rocm6.4"
Initial State 9/3/25 (At Publication)
apt managed
ii amdgpu-dkms 1:6.14.14.30100000-2193512.24.04 all amdgpu driver in DKMS format. ii rocm 7.0.0.70000-17~24.04 amd64 Radeon Open Compute (ROCm) software stack meta package ii uv 0.8.14-1+trixie amd64 An extremely fast Python package and project manager, written in Rust.
uv managed
lemonade-sdk 8.1.8 torch 2.8.0+rocm6.4
Current State 9/5/25 (Evergreening)
apt managed
ii amdgpu-dkms 1:6.14.14.30100000-2193512.24.04 all amdgpu driver in DKMS format. ii rocm 7.0.0.70000-17~24.04 amd64 Radeon Open Compute (ROCm) software stack meta package ii uv 0.8.15-1+trixie amd64 An extremely fast Python package and project manager, written in Rust.
uv managed
lemonade-sdk 8.1.8 torch 2.8.0+rocm6.4
Open WebUI
If you do want to spawn Open WebUI, similar steps below should work on Debian Trixie, and probably also on any Ubuntu Plucky instance.
sudo apt install pkg-config python3-dev build-essential libpq-dev uv init uv venv uv python pin 3.11 uv sync source .venv/bin/activate uv pip install setuptools wheel uv add open-webui open-webui serve
Ready to Build on This?
Building for the future without creating technical debt is a powerful paradigm. But the real business advantage comes from mapping your unique business logic into multi-agent AI workflows that solve real problems and create real scalability.
At Netstatz Ltd., this is our focus. We leverage our enterprise experience to build intelligent agent systems on stable, secure, and cost-effective edge platforms like Strix Halo. If you are a small or medium sized business looking to prototype or deploy local AI solutions, contact us to see how we can help. If you’re in the Toronto area, we can grab a coffee (or a beer) and talk shop.