Brain-inspired computing

Parallel store.
Parallel recall.

The first algorithm to step past Von Neumann — running today on standard GPUs.

20 W brain  ·  700 W GPU  ·  35× efficiency gap

The limit

Today's AI runs on an 80-year-old idea.

Von Neumann separated memory from compute. Every operation moves data across that boundary. Every chip built since — CPU, GPU, TPU — inherits this structure.

Frontier scale is now colliding with the wall this design implies. The cost of moving data exceeds the cost of computing on it. Memory costs more than the silicon that uses it. The next watt of efficiency is no longer hiding in the transistor.

700 W
per GPU chip
1 %
memory bus utilized
Today — Von Neumann
Memory
HBM
80 GB
PCIe bus
1 % utilized · bottleneck
Compute
GPU
700 W
Every operation moves data across this boundary. Most of the bus sits idle.
The brain's answer

A different architecture has been running for millions of years.

The human brain operates on roughly twenty watts. About three percent of a single GPU. It holds memory in the same place it computes. It activates only the neurons a task requires. It retrieves what it needs in parallel, cued by what it is already thinking about.

Not a metaphor. A literal architectural alternative — validated by hundreds of millions of years of evolution.

01
Memory and compute, integrated
Each neuron is simultaneously storage and processor. No bus. No bottleneck.
02
Sparse parallel activation
Of a hundred billion neurons, only a few hundred fire together — selected by the task.
03
Cued parallel retrieval
Memory returns in one shot, triggered by what is already in attention.
The brain — integrated, sparse, parallel
100 B nodes
memory and compute, both
~ few hundred fire
sparse, task-selected
One shot
parallel cued retrieval
The mechanism

How memory finds the answer in one shot.

Hear “red round fruit” — and “apple” arrives before the sentence finishes. The brain has not searched a database. It has not iterated through entries. Storage and retrieval happen in the same substrate, in the same operation.

Two layers cooperate. The hippocampus holds what is active right now — the working cue, the current context. It is small, fast, expensive. The cortex holds the rest of a lifetime — vast, slower, distributed across millions of neurons. The cue from the hippocampus radiates outward in parallel. Every memory the cortex contains is evaluated at the same instant. Only the neurons whose stored patterns match the cue fire back. The answer assembles itself from that selective firing.

Von Neumann cannot do this. In the conventional architecture, recalling N pieces of evidence means N round trips across the memory bus. Each lookup waits for the previous one to finish, because there is only one channel between memory and compute. Faster clocks shrink the cost of each trip but never collapse the trips into a single operation. The bottleneck is not a speed problem — it is the architecture itself.

We extracted that mechanism into an algorithm. The mathematics turns out to be elegant: retrieved = softmax(β · Q · KT) · V — the same operation that runs inside every transformer attention head. We mapped it onto the right hardware hierarchy. HBM became the hippocampus — fast, small, holding the cue. NVMe SSD became the cortex — vast, deeper, holding the patterns. A single query no longer walks the memory bus N times. It fires across the entire content store at once and the matching memories return together.

Two layers, one parallel mechanism
HIPPOCAMPUS · the cue≈ HBM · μs · smallparallel cue · only the matching neurons fireCORTEX · the content≈ SSD · ms · vast
The mathematics of one-shot recall
retrieved = softmax(β · Q · KT) · V
The same operation that drives attention inside every transformer model — extracted, scaled, and run on the hardware hierarchy that mirrors the brain's own.
The proof — at scale

Six thousand books. One GPU. Zero hallucination.

We loaded the entire English Wikipedia — 6.4 million articles, roughly six thousand books of text — onto a single GPU node. Then we asked it questions.

Each query went through the parallel cued retrieval mechanism described above. The cue radiated across all 6.4 million articles simultaneously. Only the matching neurons fired. The answer assembled from that selective firing — the way the brain does it, but on silicon, at production scale.

Across one hundred queries spanning five categories and eight languages: 97% answered correctly. 0% hallucinated. The remaining 2% the system correctly refused to answer when no relevant memory existed — the absence of confabulation rather than the absence of intelligence. Average recall latency: 10.45 milliseconds. Against the standard vector-search baseline (FAISS): 124× faster.

The substrate is what makes this possible. NVIDIA GH200 — 480 GB unified memory, 96 GB HBM3 — provides the hippocampus tier. NVMe SSD provides the cortex tier. The same two-layer memory hierarchy the human brain has been refining for hundreds of millions of years — now running at data-center scale, on hardware that already ships. The system is live and publicly accessible at brain.umparumpa.com.

The proof — Wikipedia, in one place
Corpus
6.4 M
Wikipedia articles
≈ 6,000 books
8 languages · 5 categories
Environment
NVIDIA GH200
480 GB · 96 GB HBM3
HBM · hippocampus
NVMe · cortex
BGE-M3 embeddings
Result
0 %
Hallucination
97 %
Accuracy
124 ×
vs FAISS · 10.45 ms recall
Live · publicly accessible
brain.umparumpa.com
Open in browser. Ask anything.
The inference algorithm

The same principle, applied to AI inference itself.

The Wikipedia experiment is the knowledge layer — brain-inspired retrieval applied to enterprise documents. The same principle applies one layer deeper, inside the inference loop of large language models themselves.

We built that as a production-grade algorithm — running on the GPUs that exist today, on the SSDs that ship today, with no firmware changes and no new hardware. The same hardware now does roughly twice the work. Output is preserved bit-exact. Measured on real workloads, not synthetic benchmarks. Not a paper. Not a prototype.

concurrent users, same hardware
0.99987
output similarity vs baseline
24 / 24
bit-exact token agreement
0
added GPU cost at decode
Measured — same H100, same SSD
Baseline · vLLM stock
4 users → out of memory
×
UmpaRumpa · same hardware
8 users · stable
Latency unchanged. Output preserved bit-exact.
2 ×
Why no one solved this before
Approach
Measured outcome
Result
Naive SSD offload
20.04× slowdown
FAIL
KIVI / KVQuant
Attention output corrupted
FAIL
SAGE-KV / H2O
+5–10% GPU overhead
FAIL
Always prefetch
5× slowdown
FAIL
UmpaRumpa
2× capacity · bit-exact · 0 GPU cost
PASS
All prior approaches added GPU work to relieve memory pressure — and broke either speed or accuracy. We don't touch the GPU.
Five-year vision

The paradigm change that follows Von Neumann.

Today: the first reference algorithm in the new inference ecosystem. Same socket every frontier model uses. Zero code changes for adopters.

Five years out: AI inference at a fraction of today's power per token. Memory and compute, integrated. Cued retrieval at scale. The first commercial architecture that escapes the limit Von Neumann set in 1945.

Team & Contact

Sunnyvale, California.

Andy Lee, Ph.D
Founder & CEO
Dr. Gun Lee
CTO
Sehong Min
VP
Bala
CSO
710 Lakeway Drive, Suite 200
Sunnyvale, CA 94085