This blogpost explores what it truly means to change knowledge inside a neural network. Unlike symbolic systems, large language models do not store facts in explicit locations; they implement them through distributed geometric transformations. Editing a model therefore reshapes regions of its activation space, alters relational structures, and sometimes shifts broader behavioral tendencies. We examine how local edits differ from global ones, why forgetting resembles suppression rather than deletion, and how repeated modifications can change a model’s identity. By framing model editing as a philosophical and structural question rather than a purely technical procedure, this piece highlights the need to evaluate edits not only for local correctness but also for their impact on coherence, ontology, and long-term behavior.
Large language models can now be adjusted. They can be corrected after deployment, fixed for safety, or guided towards new behaviors without needing complete retraining. This raises a fundamental question that spans machine learning, knowledge theory, and philosophy: What does it mean to change knowledge within a neural network? Traditional software updates a record in a database. Humans change beliefs through reasoning, feelings, and contradictions. Neural networks do neither. They do not store symbols, explicit beliefs, or lookup tables. Instead, knowledge is spread out, intertwined, and geometric.
So when we edit a model :
This blog post provides a framework for these questions. It is based on the technical features of neural networks but also invites philosophical reflection.
Before exploring how knowledge changes, we need to understand what it is.
In symbolic systems, knowledge exists as:
In a neural network, there is no specific spot for “The Eiffel Tower is in Paris.” Instead, knowledge arises from a transformation represented in
A useful summary:
A fact is not stored; it is enacted. A network knows something because its transformations consistently bring it about.
To illustrate this, consider a straightforward example
When a model states, “Rome is the capital of Italy,” the fact is not retrieved; it emerges from a path through activation space that consistently lands in the “Rome” area.
Thus:
This means editing is essentially a geometric process.
Editing a model—whether through fine-tuning, ROME-style updates, MEMIT, soft prompts, or manual weight adjustments—changes the underlying geometry of these attractors.
Two main types of changes typically occur:
Imagine a micro-manifold: a small area of activation space where similar prompts converge. For example, questions like:
All fall into a similar neighborhood.
A local edit only alters this specific area:
Methods like ROME and MEMIT aim to work here
Some relationships are globally structured.
Editing a relation like (Italy, capital, X) can have broader impacts:
This constitutes a relational edit, rather than a local one. You reshape an entire conceptual subspace.
This is similar to full relation editing or changing an embedding direction like “country→capital.”
In summary:
Local rewrites fix a pocket of meaning.
Distributed edits alter the semantic landscape.
Both are important and can create contradictions.
Human beliefs can be modular and inconsistent, while neural networks are more globally connected.
This creates a challenge:
For example:
A helpful distinction:
A good edit should change epistemic content without harming ontology.
In simpler terms:
Edit the belief, not the overall worldview.
Current methods often struggle to maintain this balance.
For humans, forgetting can take various forms:
In neural networks, forgetting is a strange idea.
There is no slot to erase or pointer to zero out. Knowledge is redundantly encoded
To forget the fact: “The Eiffel Tower is in Paris,” you would need to disrupt all the attractor pathways leading to “Paris.”
But because:
a single update rarely wipes out all paths.
This explains why edited facts often come back
So:
Forgetting in neural networks isn’t about destruction; it’s about reducing the chance of a belief coming back.
Philosophically, this is more like suppression than deletion.
Model identity isn’t defined by the dataset but by:
Editing can change these.
Small edits (like a single factual relationship) usually do not alter identity. However, large or repeated edits such as changing tone, worldview, or moral views can create a distinctly different agent.
Example:
This relates to the Ship of Theseus problem:
After many edits, is it still the same model?
From an engineering perspective:
From a philosophical perspective:
A theory of editing matters because it relates to:
Local edits may unintentionally create global contradictions.
Changing epistemic content could distort ontological foundations, altering what the model values.
Editing enables us to examine the model’s structure: we find out which relationships are fragile or deep.
If a model is part of a workflow or agent system, shifts in identity can undermine trust.
Editing provides insight into how neural networks represent meaning and change knowledge.
Here is an improved taxonomy, now enhanced with examples and basic connections to known methods:
Type-1. Local Semantic Rewrites
Small, targeted changes to micro-manifolds. This roughly corresponds to methods like ROME or MEMIT that aim for precise, fact-level edits.
Type-2. Distributed Relational Shifts
Edits that affect entire relational subspaces (for example, altering all country→capital pairs). These can be seen in behavior changes after broader fine-tuning or multi-example edits.
Type-3. Ontological Reorientations
Changes that affect how the model organizes concepts (for instance, safety-alignment fine-tunes that reshape moral or intentional concepts). These often occur as a side effect of broad, domain-specific training.
Type-4. Identity-Level Modifications
Edits that change persona, reasoning style, or core behavioral tendencies. Example: turning a neutral assistant into one that is consistently sarcastic. This is typical in instruction tuning, reinforcement learning from human feedback, or safety training.
This taxonomy is conceptual, but it connects to real methods
We circle back to the question:
What does it mean to change knowledge in a neural network?
The refined answer:
A practical implication is:
Evaluating model edits should assess not only factual accuracy but also identity shifts and ontological changes.
Ultimately:
Model editing is not just a technical tool. It is a philosophical act involving decisions about how an artificial mind should evolve.
Here are some more articles you might like to read next: