Parallel store.
Parallel recall.
The first algorithm to step past Von Neumann — running today on standard GPUs.
20 W brain · 700 W GPU · 35× efficiency gap
Today's AI runs on an 80-year-old idea.
Von Neumann separated memory from compute. Every operation moves data across that boundary. Every chip built since — CPU, GPU, TPU — inherits this structure.
Frontier scale is now colliding with the wall this design implies. The cost of moving data exceeds the cost of computing on it. Memory costs more than the silicon that uses it. The next watt of efficiency is no longer hiding in the transistor.
A different architecture has been running for millions of years.
The human brain operates on roughly twenty watts. About three percent of a single GPU. It holds memory in the same place it computes. It activates only the neurons a task requires. It retrieves what it needs in parallel, cued by what it is already thinking about.
Not a metaphor. A literal architectural alternative — validated by hundreds of millions of years of evolution.
How memory finds the answer in one shot.
Hear “red round fruit” — and “apple” arrives before the sentence finishes. The brain has not searched a database. It has not iterated through entries. Storage and retrieval happen in the same substrate, in the same operation.
Two layers cooperate. The hippocampus holds what is active right now — the working cue, the current context. It is small, fast, expensive. The cortex holds the rest of a lifetime — vast, slower, distributed across millions of neurons. The cue from the hippocampus radiates outward in parallel. Every memory the cortex contains is evaluated at the same instant. Only the neurons whose stored patterns match the cue fire back. The answer assembles itself from that selective firing.
Von Neumann cannot do this. In the conventional architecture, recalling N pieces of evidence means N round trips across the memory bus. Each lookup waits for the previous one to finish, because there is only one channel between memory and compute. Faster clocks shrink the cost of each trip but never collapse the trips into a single operation. The bottleneck is not a speed problem — it is the architecture itself.
We extracted that mechanism into an algorithm. The mathematics turns out to be elegant: retrieved = softmax(β · Q · KT) · V — the same operation that runs inside every transformer attention head. We mapped it onto the right hardware hierarchy. HBM became the hippocampus — fast, small, holding the cue. NVMe SSD became the cortex — vast, deeper, holding the patterns. A single query no longer walks the memory bus N times. It fires across the entire content store at once and the matching memories return together.
Six thousand books. One GPU. Zero hallucination.
We loaded the entire English Wikipedia — 6.4 million articles, roughly six thousand books of text — onto a single GPU node. Then we asked it questions.
Each query went through the parallel cued retrieval mechanism described above. The cue radiated across all 6.4 million articles simultaneously. Only the matching neurons fired. The answer assembled from that selective firing — the way the brain does it, but on silicon, at production scale.
Across one hundred queries spanning five categories and eight languages: 97% answered correctly. 0% hallucinated. The remaining 2% the system correctly refused to answer when no relevant memory existed — the absence of confabulation rather than the absence of intelligence. Average recall latency: 10.45 milliseconds. Against the standard vector-search baseline (FAISS): 124× faster.
The substrate is what makes this possible. NVIDIA GH200 — 480 GB unified memory, 96 GB HBM3 — provides the hippocampus tier. NVMe SSD provides the cortex tier. The same two-layer memory hierarchy the human brain has been refining for hundreds of millions of years — now running at data-center scale, on hardware that already ships. The system is live and publicly accessible at brain.umparumpa.com.
NVMe · cortex
BGE-M3 embeddings
The same principle, applied to AI inference itself.
The Wikipedia experiment is the knowledge layer — brain-inspired retrieval applied to enterprise documents. The same principle applies one layer deeper, inside the inference loop of large language models themselves.
We built that as a production-grade algorithm — running on the GPUs that exist today, on the SSDs that ship today, with no firmware changes and no new hardware. The same hardware now does roughly twice the work. Output is preserved bit-exact. Measured on real workloads, not synthetic benchmarks. Not a paper. Not a prototype.
The paradigm change that follows Von Neumann.
Today: the first reference algorithm in the new inference ecosystem. Same socket every frontier model uses. Zero code changes for adopters.
Five years out: AI inference at a fraction of today's power per token. Memory and compute, integrated. Cued retrieval at scale. The first commercial architecture that escapes the limit Von Neumann set in 1945.
Sunnyvale, California.
Sunnyvale, CA 94085