Next-Gen Intelligence: Building the Architecture — Dev Update
Most AI systems don’t have a sense of the world. They have a context window. Feed them input, get output, discard everything. That works well for question-answering. It doesn’t work for a system that needs to track what’s changing, anticipate what comes next, and respond when it decides something is worth saying — not just when a user hits Enter.
Building something that genuinely maintains a model of its environment rather than reacting to each prompt in isolation turns out to be an interesting engineering problem. Not just algorithmically — structurally. You need components that speak to each other, a shared representation of the current state of the world, and a loop that runs whether or not anyone is watching.
In our previous post we described what we wanted to build: a system built around a JEPA vision encoder, a DreamerV3 RSSM world model, a Mamba-based reasoning core, and a continuous perceive→predict→decide→act loop. This post covers what we’ve actually built — the first implementation milestone, working end-to-end on CPU.
Laying the Groundwork
Before writing a single neural network layer, we needed a package structure that could hold a system this complex without turning into a tangle. The guiding principle was strict separation: configuration lives apart from components, components live apart from the orchestration core, and nothing reaches across those boundaries implicitly.
The package follows a config/models/core split. Each major component — vision encoder, world model, reasoning core — gets its own typed configuration sub-dataclass loaded from YAML. A single IntelligenceConfig.from_yaml("config.yaml") call constructs the full config tree, with type errors surfacing immediately at startup rather than causing mysterious failures mid-run. That matters for a system that’s meant to run continuously in the background.
Logging is structured throughout — every log call emits JSON with a consistent schema: component name, level, timestamp, and arbitrary key-value context. Structured logs are grep-able and trivially parseable, which is less about elegance and more about not flying blind when the system is running unattended.
WorldState — The System’s Working Memory
Every component in the architecture produces output in a different shape. The vision encoder produces a sequence of image patch embeddings. A text encoder produces variable-length token embeddings. The world model produces a temporal belief state. Something has to be the shared workspace where all of this coexists.
That’s WorldState: a fixed-size latent array, 512 vectors of 256 dimensions each, that any component can read from or write to. Think of it as working memory — a compact, always-available representation of what the system currently knows about its environment.
The design is inspired by Perceiver IO. Rather than forcing every downstream component to handle every possible input modality directly, encoders write into WorldState via cross-attention, and consumers read from it via cross-attention. The shape contract is always 512×256, regardless of what was written in.
class WorldState(nn.Module):
def __init__(self, num_latents: int = 512, latent_dim: int = 256):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.cross_attn_write = CrossAttention(latent_dim)
self.cross_attn_read = CrossAttention(latent_dim)
def write(self, inputs: Tensor) -> None:
"""Cross-attend inputs into latent array, updating WorldState."""
self.latents = self.cross_attn_write(self.latents, inputs)
def read(self, queries: Tensor) -> Tensor:
"""Cross-attend queries against latent array, returning context."""
return self.cross_attn_read(queries, self.latents)
The fixed size is the key property. Memory consumption doesn’t scale with input complexity — feeding a high-resolution video frame through the vision encoder doesn’t balloon downstream memory usage, because the bottleneck absorbs it. The system has a fixed memory budget for the world, and the latent array is how that budget is enforced. What the vision encoder writes into WorldState is available to everything that comes after it.
JEPAVisionEncoder — Learning What Matters
The vision encoder uses a ViT-B/16 backbone — 86M parameters, 16×16 pixel patches, 224×224 input — pretrained with a JEPA objective rather than supervised classification or pixel reconstruction.
The training objective is what makes this encoder different from a standard ViT. A model trained to reconstruct pixels learns to care about texture — it needs to get the exact color values right. A model trained to predict class labels learns to care about a single category. A JEPA model is trained to predict the abstract representations of masked patch regions: given visible patches, predict what the representation of the hidden patches should be. To do that well, the model has to learn what’s semantically meaningful in a scene, not just what the surface looks like.
The result is features that transfer better to downstream tasks — and features that write more useful information into WorldState.
The standard ViT-B pools patch embeddings via a CLS token. We replace that with a learned attention pooling head:
class AttentionPooling(nn.Module):
def __init__(self, dim: int, num_heads: int = 8):
super().__init__()
self.query = nn.Parameter(torch.randn(1, 1, dim))
self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
def forward(self, patch_tokens: Tensor) -> Tensor:
# query: (B, 1, D), keys/values: (B, 196, D)
query = self.query.expand(patch_tokens.size(0), -1, -1)
out, _ = self.attn(query, patch_tokens, patch_tokens)
return out.squeeze(1) # (B, D)
The CLS token has a fixed interaction pattern baked in during pretraining — it attends to patches in the same way it learned to during training, regardless of what downstream task you’re solving. Attention pooling is more flexible: the aggregation weights are learned independently and can adapt to what the system actually needs from the image. On transfer benchmarks, this matters.
The encoder outputs a single (B, 768) embedding per image. A linear projection brings it to 256 dims, then a cross-attention write puts it into WorldState. From there, it’s available to the world model and the reasoning core. VRAM: approximately 0.35GB.
DreamerV3 RSSM — A Sense of Time
A world model is a system that maintains a belief about its environment even when it isn’t receiving new observations. Instead of processing each input independently, it accumulates a latent representation of what’s happening — and keeps rolling that representation forward in time, predicting what comes next, even between ticks.
For a continuous system, this is the difference between reacting and anticipating.
The RSSM factorizes the world state into two components: a deterministic state h_t, implemented as a GRU hidden state that accumulates information across time, and a stochastic state z_t, a categorical latent sampled from a learned distribution. The deterministic component provides a stable, differentiable memory thread. The stochastic component provides representational capacity — and enables imagination, rolling the model forward without real observations.
The update equations at each timestep:
h_t = GRU(h_{t-1}, z_{t-1})
z_t ~ Categorical(MLP(h_t, x_t)) # representation model (obs available)
ẑ_t ~ Categorical(MLP(h_t)) # transition model (no obs, for imagination)
r̂_t = MLP(h_t, z_t) # reward predictor
Our implementation adopts three improvements from DreamerV3 over the original DreamerV2. Rewards and value targets are encoded via the symlog function — sign(x) * log(|x| + 1) — before feeding into prediction heads, which makes the model robust to reward distributions with large magnitudes without manual scaling. Scalar rewards are represented as soft two-hot vectors over a fixed set of buckets, turning reward prediction into a classification problem over a learned distribution rather than direct regression, which is more stable. And the KL divergence uses a free-bits threshold: below a minimum information threshold, the KL loss is ignored entirely. This prevents the stochastic state from collapsing to carrying no information — a failure mode that happens when the KL penalty dominates the training objective.
After each tick, the RSSM state (h_t, z_t) is projected to 256 dims and written into WorldState via cross-attention. WorldState now holds both perception-derived content from the vision encoder and a continuously updated temporal belief from the world model — the system’s best current guess about what’s happening and what’s about to happen. The RSSM itself is deliberately lightweight at approximately 0.06GB VRAM; it runs continuously in the background, even between observations.
MambaReasoningCore — Thinking Across the World State
The reasoning layer takes the full WorldState latent sequence — 512 vectors — and processes it into a reasoning trace: a refined representation that distills what the world model knows into something that can condition generation.
The challenge is efficiency. Standard transformer attention scales quadratically with sequence length. Attending over 512 latents at every tick, in a loop running at 100ms intervals, adds up. Mamba sidesteps this with selective state space mechanics: a linear-time sequence model that can still selectively attend to relevant positions in the sequence.
Each Mamba layer maintains a state that it updates as it processes each position in the sequence. A learned gating mechanism decides, position by position, how much of the accumulated state to carry forward versus overwrite with new input. This is the architecture’s analog of attention — a soft, position-dependent memory — without the quadratic cost.
class MambaReasoningCore(nn.Module):
def __init__(self, cfg: ReasoningConfig):
super().__init__()
self.layers = nn.ModuleList([
MambaBlock(d_model=cfg.latent_dim, d_state=cfg.state_dim)
for _ in range(cfg.num_layers)
])
self.norm = nn.LayerNorm(cfg.latent_dim)
def forward(self, world_state: Tensor) -> Tensor:
# world_state: (B, 512, 256) — full latent array
x = world_state
for layer in self.layers:
x = layer(x)
return self.norm(x) # reasoning trace: (B, 512, 256)
The output — the reasoning trace — is the bridge between the world model’s latent representation and the language model’s generation. The language model doesn’t receive raw world state latents; those are uninterpretable to it. Instead, the reasoning trace is projected into token-space and prepended to the generation context. The language model generates conditioned on a compressed, reasoning-refined snapshot of what the system currently believes about the world. VRAM: approximately 0.75GB, the most memory-intensive component outside the language model itself.
The Loop — Perceive, Predict, Decide, Act
Every 100 milliseconds, the system runs through four phases.
In the Perceive phase, it collects whatever sensory inputs have arrived since the last tick — image frames, text tokens — and writes them into WorldState via the appropriate encoder. Nothing blocks here; if inputs haven’t arrived, the phase completes immediately.
In the Predict phase, the RSSM steps forward. It advances its deterministic and stochastic states based on the current WorldState contents, then writes the updated temporal belief back in. The world model is always running, always anticipating — even when no new observations have arrived, the transition model rolls forward based on what it expects.
In the Decide phase, the MambaReasoningCore processes the full latent array and produces a reasoning trace. This is the system asking itself: given everything I currently know and believe, what is the state of the world?
The Act phase only runs when the system decides it should. A learned should_respond() gate checks whether the reasoning trace has crossed an information-theoretic threshold — roughly, whether the world model believes there’s something worth communicating. When it fires, the reasoning trace is projected into token-space and the language model generates a response.
The non-blocking design means new inputs arriving mid-tick are queued and processed on the next Perceive phase. The loop never stalls waiting for input. It runs whether or not anyone is paying attention.
Integration Tests — Proving the Stack Composes
The integration test suite verifies the full stack end-to-end using synthetic inputs — random tensors in the right shapes — without requiring a GPU or real sensors. Every component is instantiated in CPU mode and exercised through all four phases of the loop.
The key test cases address questions that are easy to assume but worth checking explicitly. Does the full forward pass actually complete without shape mismatches? Does WorldState correctly round-trip a write followed by a read — can what was written be approximately recovered? Do all components instantiate cleanly without CUDA, so CI can run without GPU hardware? Does the config loading actually round-trip correctly from YAML?
These aren’t performance tests. They’re correctness tests — and the distinction matters. Running the full forward pass on synthetic inputs at CPU speed doesn’t tell you anything about throughput on real hardware. It tells you whether the components actually compose: whether the shapes contract correctly, whether the cross-attention write/read cycle behaves as expected, whether the world model’s output lands in WorldState in a form the reasoning core can process.
pytest tests/ -v
All tests pass on CPU. CI runs without GPU hardware. The integration test isn’t a bar for performance — it’s a bar for correctness, and clearing it is a non-trivial claim.
What It Means That This Works
The integration test passing matters more than it might look. It means five independent components — a vision encoder, a shared latent store, a world model, a reasoning core, and an orchestration loop — actually compose correctly. The shapes contract. The cross-attention reads and writes behave as expected. The world model’s temporal belief ends up where the reasoning core can read it. None of that is guaranteed by building each component in isolation.
What comes next is the harder work. The training pipelines — JEPA pretraining, RSSM world model training, reasoning core fine-tuning — need to be connected to real data loaders. Real sensor input needs to replace synthetic tensors in the Perceive phase. The system needs to run on actual RTX hardware so we can measure where the memory and latency budget is actually being spent. And the multimodal output layer — concurrent speech synthesis and image generation alongside the language model — needs to be added.
The system generates reasonable outputs on synthetic inputs. That’s the sanity check: the architecture is coherent, the components connect, the loop runs. The next test is whether it generates reasonable outputs on real ones.
More updates to follow.