Memory intelligence

We make
memory think.

The brain doesn't win by storing more. It wins by deciding — what to remember, what to forget, what to recall. We build that decision layer for today's memory hierarchy, in software. No new hardware required.

Between compute and storage · running on silicon that ships today

See how the brain does it Contact

The gap

Memory got faster. It never got smarter.

Thirty years of memory engineering built a magnificent hierarchy — HBM, DRAM, NVMe — each layer faster, denser, cheaper per bit. But the decision of what goes where is still made by rules written in the 1980s: least-recently-used, first-in-first-out. Blind rules, managing the most expensive real estate in computing.

AI pays the bill. GPUs sit idle waiting on memory. HBM — the costliest silicon ever shipped — spends its capacity holding data nobody will touch again. The compute layer is crowded. The storage layer is crowded. The layer that decides is empty.

The brain's answer

A 20-watt brain manages memory three ways. We moved all three into silicon.

The brain's genius is not capacity. It is management — three mechanisms, refined over half a billion years. These are not metaphors. They are algorithms, and they run on the memory hierarchy you already own.

Mechanism 01

Consolidate in sleep

A day's memories accumulate in the hippocampus; sleep moves them to the cortex. The brain never reorganizes while it is thinking.

becomes XHBM

Mechanism 02

Recall by cue

Memory is not searched. One scent of tangerine and a grandmother's house arrives whole — related memories surface together, in parallel.

becomes Recall

Mechanism 03

React before thinking

A ball flies at your face and your hand is already up. The signal never reaches the brain — the spinal cord answers first.

becomes React

Product — mechanism 01 · consolidate in sleep

XHBM — HBM as the hippocampus. SSD as the cortex.

KV-cache residency policy · idle-window demotion · V-only 4-bit · NIXL-native · endurance-aware QLC write-shaping

The brain moves the day's memories from hippocampus to cortex while you sleep — never while you think. XHBM applies the same discipline to the most expensive silicon on earth. It watches what an LLM is actually using, keeps the hot working set in HBM, and demotes the cold remainder to high-density QLC SSDs only inside idle windows, where the GPU is already waiting.

During decode it moves nothing — not one byte — for the same reason the brain doesn't reorganize mid-thought. It attaches to NVIDIA's NIXL / Dynamo rails with no code changes on the adopter's side, and it is fully reversible: switch it off and the stack is exactly what it was.

Same mechanism, same geometry — sleep consolidation | idle-window demotion

Left: the brain lets a day's memories accumulate in the hippocampus and moves them to the cortex only during sleep. Right: XHBM lets the KV cache accumulate in HBM and demotes the cold remainder to SSD only inside idle windows. During thought — during decode — nothing moves. The structure is identical; only the labels change.

0 bytes

decode-path I/O · B200 measured

−37.5 %

KV bytes · V-only compression

2×

concurrent users · same HBM

Industry impact

The same HBM, serving twice the users.

HBM is the most expensive and most scarce resource in the AI datacenter. XHBM lets the same HBM serve roughly twice the concurrent users without adding a single GPU. At industry scale, that is a software layer standing in for billions of dollars of memory and GPU buildout — and every LLM operator's serving bill, cut by the layer they didn't know was missing.

Works with — NVIDIA NIXL / Dynamo · high-density QLC SSDs

Roadmap — mechanism 02 · recall by cue

Recall — memory that surfaces itself.

parallel cued retrieval · associative memory · one-shot readout · bit-exact decode preservation

Hear “red round fruit” — and “apple” arrives before the sentence finishes. The brain has not searched a database or walked a graph. One cue radiates across everything stored, and only the memories that match fire back.

Recall is that mechanism as an algorithm: retrieved = softmax(β · Q · K^T) · V — the same operation inside every transformer attention head, mapped onto the right memory hierarchy. HBM becomes the hippocampus, holding the cue. NVMe becomes the cortex, holding the patterns. No index walk, no vector-database pilgrimage: one associative readout across everything the machine has ever stored. And a memory that doesn't exist cannot surface — so it doesn't hallucinate.

Same mechanism, same geometry — cued recall | parallel readout

Left: one cue radiates across the cortex and the matching memories light together — recall is not a search. Right: one query fires across the whole content store in a single parallel readout — 10.45 ms measured at Wikipedia scale. A memory that does not exist cannot light up, which is why retrieval cannot hallucinate.

124×

retrieval vs FAISS baseline

0 %

hallucination · 97% accuracy

10.45 ms

6.4M documents · one readout

Industry impact

One session's knowledge: from 19 books to 6,000.

Today's LLM session holds roughly nineteen books' worth of context. Recall handles six thousand books in the same session with zero hallucination — the entire English Wikipedia, 6.4 million articles, on a single GPU node. This turns every terabyte of commodity NAND into usable AI memory: long-term memory stops being a database problem and becomes what it is in the brain — just another tier.

The proof — Wikipedia, in one place

Corpus

6.4 M

Wikipedia articles

≈ 6,000 books

8 languages · 5 categories

Environment

NVIDIA GH200

480 GB · 96 GB HBM3

HBM · hippocampus
NVMe · cortex
BGE-M3 embeddings

Result

0 %

Hallucination

97 %

Accuracy

124 ×

vs FAISS · 10.45 ms recall

Live demo

Ask us for a session — we spin it up on demand.

info@umparumpa.com →

Runs on — standard GPUs. No new hardware.

Proof — mechanism 03 · react before thinking

If the principle is real, it should survive on one watt. It does.

event-based vision (DVS) · spiking inference @ neuromorphic NPU · Kalman CA tracking · looming TTC τ=θ/θ̇ · constant-bearing miss estimate

When a ball flies at your face, the signal never travels to the brain — the spinal cord answers first, because the round trip would arrive too late. React is that reflex arc in silicon. Nothing is sent to a cloud or a GPU: an event camera — an eye that sees only change — and a 91KB reflex model respond inside the machine itself.

No GPS, no radio, under a watt — and it evades objects at speeds it never trained on, 94%+ of the time. The same decision layer that manages a datacenter's memory, compressed six orders of magnitude in power. Now being validated on real drones.

Same mechanism, same geometry — reflex arc | on-machine reflex

Left: the signal from the eye answers at the spinal cord — it never travels to the brain, because the round trip would arrive after the ball. Right: React answers inside the machine — nothing travels to a cloud or a GPU, because the round trip would arrive after the collision. The short path is the technology.

Reflex pipeline & latency budget

A frame camera has to wait for the next frame. An event has already arrived the instant something changed. The reflex is fast not because of compute — but because of structure.

94 %+

detection @ unseen speeds (sim)

91 KB

on-chip model weights

< 1 W

target inference power

Demo — reflex on a walking robot

What happens when you throw a ball at a robot?

Most robots simply get hit — their eyes are too slow. We took a commercially sold quadruped robot, changed not a single line of its walking AI, and attached only the reflex. All we send is one velocity command — “step aside” — in the exact format the robot’s own SDK already accepts, so it drops straight onto the real robot.

Physics simulation (MuJoCo). Robot model, mass and locomotion policy all match the real robot’s spec. TTC is shown as estimate and ground truth side by side, because we would rather measure honestly.

Demo — drone swarm · no GPS · no radio

A swarm that never collides — on reflexes alone.

No GPS. No radio link between drones. Each drone avoids every other using nothing but its own event-camera reflex — the same engine, running per machine. When positioning and comms drop out, the reflex is what is still working.

Industry impact

One principle, from a megawatt datacenter to a battery-powered drone.

Because the reflex runs under a watt — brain power, not a tens-of-watts GPU — a battery-powered machine can keep evasion switched on all the time instead of only when the budget allows. And because it is the same selective-processing principle that runs XHBM, that range — six orders of magnitude in power — is itself the proof of the technology.

Works with — commercial event cameras · neuromorphic NPUs · your robot

Why this works

The brain never processes everything. Neither should silicon.

Compute belongs to the GPU makers. Storage belongs to the memory makers. We are the layer between — the one that decides. It consolidates when idle, recalls by association, and reacts before thinking.

The next watt of efficiency is no longer hiding in the transistor. It is hiding in the structure. In biology and in silicon alike, attention is cheaper than capacity.

Compute

GPUs · NPUs — theirs

UmpaRumpa — memory intelligence

the layer that decides · XHBM · Recall · React

Storage

HBM · DRAM · NVMe — theirs

Contact