
If you’ve been curious about running AI locally but found most guides either hand-wavy or clearly written by someone whose “budget” starts at a datacenter lease, Hardware Asylum’s new series is worth your time. Dennis Garcia, the site’s Editor in Chief, spent the past several months building and iterating on a multi-GPU local AI workstation in his lab and documented the entire journey across four detailed articles published over the past month.
Part 1
The series kicks off with Balancing Model Quality and Hardware Demands in AI Workstations, which is essentially the prerequisite class you didn’t know you needed before throwing money at hardware. Dennis breaks down the four memory components an LLM actually requires to run: model weights, the KV Cache (what he accurately describes as the model’s “short term memory”), activation memory, and framework overhead. He explains quantization in plain terms, noting that going below INT4 begins to introduce meaningful divergence in the model’s outputs, and provides a useful table showing how precision format maps to memory consumption for an 8B model: FP32 needs 32GB, INT4 gets you down to 4GB. The takeaway is that VRAM is the constraint, not raw GPU speed, and anything spilling into system DRAM for inference is going to crawl.
Part 2
Part two, Building a Multi-GPU AI Workstation on a Budget, is where the rubber meets the road. Dennis’s build centers on an ASUS ROG Strix TRX40-E Gaming motherboard paired with an AMD Ryzen Threadripper 3960X and a pair of RTX A4500 workstation GPUs. Each A4500 carries 20GB of VRAM and uses a blower-style cooler, which matters when you’re stacking two cards in one system. He notes that consumer RTX 3090 cards were tried and rejected, as the Founders Edition’s passthrough cooler design causes the lower card’s heat to flood into the upper card and trigger throttling on both.
The A4500s were sourced used on eBay for roughly $1,000 each at time of writing. On inference performance, two A4500s running Gemma3 12B at Q8 delivered 38.14 tokens per second versus 33.18 tok/s from a single RTX 5080, with the older professional cards actually edging ahead on that workload due to their higher aggregate VRAM. The 5080 flipped the script on image generation, completing a Qwen Image 2512 render in 155 seconds versus 237 for the A4500, so the use case really does determine which hardware wins.
Part 3
Part three, Software Choices for a Multi-GPU AI Workstation, runs through the full software stack that Dennis has assembled. The lineup covers Ollama and Llama.cpp for inference and model management, OpenWebUI as a self-hosted ChatGPT-style front end, ComfyUI for image and video generation workflows, N8N for automation and agent pipelines, OpenClaw for tool-calling, Aphrodite Engine for high-throughput content generation, and Augment Toolkit for dataset creation. The Ollama section is particularly useful, detailing three distinct ways to import models: downloading from ollama.com, importing from SafeTensors for Q8 builds, or converting from GGUF.
Dennis’s take on OpenClaw is entertainingly candid. He describes Jensen Huang calling it “the operating system for AI” at GTC 2026 and then notes that it ships with zero guardrails and is “no better than talking with a pre-teen” if you aren’t extremely precise with your instructions. His advice is to sandbox its file system access tightly and monitor its web dashboard closely, which seems like very reasonable advice for a tool that will happily delete your email if you forget to tell it not to.
Part 4
The final installment, Fine-Tuning LLMs on a Local Multi-GPU AI Workstation, tackles the most hardware-intensive aspect of local AI. Training requires roughly 20x the memory of inference, meaning an 8B model that inferences in 16GB needs around 160GB to train. Dennis’s approach uses Axolotl LLM for the actual training runs combined with Deepspeed’s Adam ZeRO optimization, which allows weights and tensor data to be offloaded from VRAM to system DRAM, making it possible to train a 4B model on dual RTX A4500s with 128GB of system memory doing most of the heavy lifting. He’s also testing Phison’s aiDAPTIV platform, which substitutes an extremely high-endurance NVMe drive (100 DWPD) for the DRAM offload layer, theoretically allowing even larger model training on limited hardware. His assessment of aiDAPTIV is measured: it works, but Phison is slow to add model support, multimodal training isn’t there yet, and the whole platform becomes inert if Phison stops selling the proprietary SSD.
Dennis closes the series with a perspective worth sitting with. He draws a parallel to the CPU era where single-core frequency gains eventually plateaued and performance improvements shifted to core counts. He argues AI is approaching a similar inflection point: the incremental model quality improvements are slowing down, MoE architectures are emerging as a way to scale knowledge without scaling hardware demands proportionally, and the real remaining constraint at the top end is datacenter power capacity. Local AI hobbyists are carving out a useful space well below that ceiling, and this series is one of the better practical guides to doing it without burning cloud credits or buying into the hype that you need the latest and fastest to accomplish something worthwhile. Three-generation-old workstation GPUs running a custom-trained 14B model from your spare bedroom is, apparently, entirely on the table.
Check out the full series over at Hardware Asylum.
