Modern vision-language models (VLMs) have achieved impressive success in recognizing and describing visual content, yet they continue to struggle with understanding spatial relationships. The limitation persists even with massive data and model scaling, suggesting that the root of the problem lies in the architecture and training objective rather than data alone. This post examines the underlying causes and discusses why recent proposed fixes, while promising, remain insufficient to achieve robust spatial reasoning.
To truly understand an image, we have to treat it as more than a collection of pixels. As an image is a 2D projection of a fundamentally 3D world, mentally reconstructing the scene requires at least two components: recognizing “what” is in the image and understanding “where” those things are located.
This notion of “where” comes in two forms. One is the absolute where: identifying an object’s position on the image plane, often by drawing a bounding box around it. The other is the relational where: reasoning about how objects are situated relative to one another (e.g., “the chick is behind the cup” or “the car is to the left of the tree”). Both forms are important, but in this post we focus on the latter: how models reason about spatial relationships.
Coming back to the problem of image understanding, we cannot reliably infer the scene behind an image without knowing both what is present and where things are in relation to each other. Let’s consider a simple scene with two chicks sitting near a cup.
If someone asks, “Grab me the chick behind the cup,” the instruction only makes sense if we can correctly identify the “cup” (what) and accurately interpret what “behind” means in the appropriate reference frame (relational where). For example, if “behind” is defined relative to the camera’s viewpoint, it refers to the object that is farther away from the camera than the cup. This kind of relational reasoning is fundamental to real-world systems such as autonomous vehicles and robotic arms, where understanding both the objects and their spatial relationships is critical for safe and reliable action.
Modern vision-language models (VLMs), such as Gemini and ChatGPT, have become remarkably good at the what. When asked to describe an image or generate a caption, they often produce accurate and detailed responses. As these models grow larger and get trained on increasingly more datasets, their ability to recognize objects and describe their visible content continues to improve. In many evaluations
However, when it comes to reasoning about where things are relative to each other, these same models often fall short.
This "what" vs. "where" paradox in modern VLMs becomes especially clear in combined reasoning tasks like “Find the hidden object.”
In this example, the model (Gemini 3 Pro) correctly identifies the hidden object (the golden ring) but mislocates it. It claims that the ring is to the right of the red Snoopy doghouse, when in fact it is located below it. These mistakes may seem minor, but they expose a deeper and persistent limitation of modern VLMs: strong semantic recognition does not guarantee accurate relational spatial understanding.
We might expect this gap to shrink as models grow larger and are trained on more data. Yet, despite massive datasets and billions of parameters in state-of-the-art systems, this spatial weakness remains. A growing body of research suggests that the issue lies in the core architecture and training objectives of VLMs. In other words, the limitation stems not only from how much data we provide, but also from how these models are built and what they are fundamentally optimized to prioritize.
In this post, we take a closer look at the architectural roots of this spatial blindness in modern VLMs. We examine the building blocks that process visual information, the training objectives that favor certain capabilities over others, and why many recent fixes still fall short of fully solving the problem. Understanding why models struggle with this relational “where” and how we might overcome this limitation is a key step toward building vision systems that truly understand the worlds they see.
To understand why VLMs often fail to capture spatial relationships, we need to look at their architectural foundation: the Transformer. Since its introduction in Attention is All You Need
However, pure self-attention has a critical blind spot: it does not care about order. If we only rely on relevance between tokens, the sentences “The chick is in front of the cup” and “The cup is in front of the chick” look identical, even though they describe entirely different spatial situations. Without any notion of token position, the model collapses the sentence into something like a bag-of-words representation, where only co-occurrence matters, not who is in front of whom.
When we move from text to images, the problem intensifies. A typical VLM first chops an image into a grid of patches (for example, 16x16 or 32x32), converts the patches into tokens, and then feeds the resulting sequence into the Transformer. Without positional information, the model treats the image as a bag-of-patches. It may recognize that there is a chick and a cup, but it has no built-in mechanism to tell whether the chick patch is to the left, above, or behind the cup patch.
To fix this, researchers developed a series of positional encodings that tag tokens or patches with location information. Many of these methods were originally developed to handle word order in sentences, but here we focus on what happens when they are applied to images.
The original Transformer paper introduces absolute position encoding (APE), which assigns a unique fixed vector to every patch. This works if all input images share the same size, but fails when the resolution changes. If a model is trained on 224x224 images, it learns positional vectors tied to that grid. At test time, if given a larger image, the model cannot naturally represent the extra patches. A common workaround is to stretch or interpolate embeddings to a new size, but this distorts spatial relationships. The model overfits to specific locations, limiting its ability to generalize or be translation-invariant.
To move beyond fixed grid sizes, later work introduces relative position encoding (RPE)
Today, the de facto standard for large language models and many VLMs is rotary positional encoding (RoPE)
Let’s imagine there exists a positional encoding scheme that perfectly preserves 3D structure. Even with such an ideal encoding, a growing line of work suggests spatial reasoning would still remain a persistent weakness. The problem is not only how good our positional encodings are, but also how the Transformer processes them internally. Recent work points to a deeper issue inside the Transformer itself: a phenomenon often referred to as embedding norm suppression
This effect becomes evident in a simple permutation test. Take an image, break it into patches as usual, but then randomly shuffle the order of the visual tokens so that the original spatial layout is destroyed. Many VLMs show only a negligible drop in performance on standard benchmarks. For example, consider the following captioning example:
Even after random shuffling, the generated captions remain nearly identical because the model still identifies the objects (“chick” and “blue bonnet”) and produces a reasonable description of the scene. This indicates that positional encodings, including mechanisms like M-RoPE, have little effect on the model’s output. Instead, the model appears to treat the image as a bag of semantic features.
To counteract this, recent work proposes embedding norm normalization
However, balancing signal magnitudes is only a partial fix. It shows that positional information is “quiet” relative to semantics, but even after that correction, deeper issues remain. Spatial reasoning goes beyond simply knowing coordinates. Encodings like APE, RPE, and RoPE can indicate where the patches are and how far apart they are, but they do not capture the structure of the scene. Many spatial questions depend on richer relations, such as containment (“Is the water in the cup?”), occlusion (“Is the chick behind the cup?”), and relative depth. These concepts require structured spatial reasoning and domain-specific visual priors, not just louder positional coordinates. In other words, making position information “loud enough to hear” is necessary, but not sufficient for VLMs to reason about space as humans do.
Beyond how models encode position, another line of work asks a different question: where do models actually focus their attention? These studies suggest that many spatial errors arise not because positional information is missing, but because the model looks at the wrong place. In VLMs, this happens at two levels: between modalities (text vs. image) and within the image itself (relevant vs. irrelevant regions).
VLMs typically process two inputs: the image (e.g., a chick and a cup on a table) and the text (e.g., “Describe the image” or “Where is the chick relative to the cup?”). Architecturally, the visual backbone converts image patches into patch embeddings, and the language backbone (usually a powerful LLM) encodes the text. These two pieces are then fused so the model can jointly reason over both modalities.
However, prior work
This imbalance also contributes to the failures we see in spatial reasoning. The model hallucinates what should be there instead of faithfully reporting what is there. If it knows that clouds are usually above the grass, it may claim that relation even if the image is upside down with the clouds below. The model often sees what it expects from language, not what the pixels actually show. In this view, many spatial errors stem less from model limits but more from uneven attention between modalities. VLMs tend to trust text more than vision.
A natural reaction is to simply increase attention to visual tokens. But that alone is not enough. Even if the model looks more at the image, if it focuses on the wrong regions, spatial reasoning will continue to fail.
Even if we manage to encourage a VLM to rely more on visual input, a second failure mode appears: the model may simply look in the wrong place.
Consider asking, “Where is the chick in relation to the cup?” Ideally, the model should focus on patches containing the chick and cup. However, empirical analyses
To address this, a recent work proposes AdaptVis
AdaptVis and similar methods show that attention allocation is crucial for spatial reasoning. They show that VLMs often underuse available visual information and that steering attention can significantly improve performance.
However, AdaptVis is not a complete solution. As the method operates purely at inference time, it cannot change the model’s internal representation or how the spatial relations are learned in the first place. If the model never learned concepts like “in front of” vs. “behind,” simply pointing its attention more precisely will not fully close those gaps. In other words, attention steering helps the model make better use of what it already knows, but by itself, it cannot guarantee the richer, human-like spatial reasoning that many applications require.
The methods introduced so far (embedding norm normalization and AdaptVis) can be viewed as post-hoc fixes. They operate on top of an already trained architecture, leaving both the original model design and its pretraining objectives unchanged. While these approaches do improve performance, a growing body of work argues that true spatial awareness cannot simply be patched in with auxiliary mechanisms. Instead, the problem lies in how these models are trained from the beginning.
The dominant paradigm for large-scale VLM training is Contrastive Language-Image Pretraining (CLIP)
To address this limitation at its source, recent work proposes Contrastive Localized Language-Image Pretraining (CLOC)
But this kind of localized pretraining comes with a substantial cost. To use it, we will need to retrain the visual backbone from scratch, which is computationally expensive and time-consuming (especially for state-of-the-art VLMs that already represent years of training effort). This motivates a parallel line of research focusing on modular interventions: methods that aim to inject spatial awareness into existing, frozen models without full retraining. One example is Positional Insert (PIN)
These methods represent an important first step toward models that are natively capable of object localization, but they also highlight several remaining fundamental challenges. For unified models like CLOC, beyond the cost of retraining, a central question is how much spatial information can be injected without affecting global semantic understanding. It remains unclear how to balance learning what and learning where within a single system: how much representational capacity and attention should be devoted to semantics vs. spatial structure, and how that balance should adapt across different levels of visual complexity. In modular designs, maintaining strong semantic experts while adding dedicated spatial experts increases computational and memory overhead. Because these approaches rely on frozen backbones, they also limit how deeply spatial information can be integrated into the visual feature hierarchy; spatial cues are often introduced only at later stages (for example, at the decoder), constraining the model’s ability to form truly grounded representations.
Taken together, these challenges suggest that localized and modular interventions are promising but unlikely to solve spatial reasoning on their own. Overcoming the “what” vs. “where” divide will likely require coordinated advances in model architecture, training objectives, supervision signals, and other parts of the learning pipeline, rather than isolated fixes applied at a single point.
Stepping back, a consistent picture emerges: today’s VLMs are very good at answering “What’s in the image?” but far less reliable at answering “Where is it relative to that?” This weakness does not disappear with more data or larger models. Instead, it makes us trace back to how these systems are built and trained. Transformers are naturally position-agnostic, positional encodings get drowned out by rich semantic features, and training pipelines reward capturing the overall gist over spatial precision.
Throughout this post, we have seen several promising attempts to narrow this gap: more expressive positional encodings (from APE to RPE to RoPE and M-RoPE) to better preserve the 2D layout of images, embedding norm normalization to help positional signals compete with loud semantic embeddings, attention-steering methods to guide models to look at the right regions, and training strategies to push models to encode not only what objects are present, but also where they are in the scene.
Still, none of these approaches fully solves spatial reasoning on its own. Spatial awareness does not emerge from plugging in a single clever module; it arises from the interaction among the architecture, attention, and training objectives. Moving toward spatially-aware VLMs will likely require these strands to work together, along with a deeper understanding of each component: positional schemes that respect 2D (and ultimately 3D) structure, mechanisms that reliably guide attention to the right evidence, and objectives that genuinely demand spatial precision. To develop agents that can act safely and effectively in the physical world, we must treat where not as a byproduct of what, but as a first-class goal of visual intelligence.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX