Contextual Overlap: The Transformer, Attention Mechanism And Non-Classical Spaces
I tend to look at things from a 10,000 foot view on most things in the digital world.
Whether it’s my mathematics background or my more abstract methodology on most things, it’s incredibly hard for me to play the game without seeing the whole board, so to speak.
As mentioned in my first post on this site, the connection between search (and transitively AI/LLM spaces) and the non-classical/quantum world was something I’ve been tracking for some time now.
It was one of those things that’s hard to see unless you’re looking from a specific lens or height, but not something one can openly discuss without the usual “quantum” eye-rolling that most folks have when they see that word used in discourse.
Because of the complexity (and largely mixed interpretations) of the quantum field, it’s all too easy to fall into the “mystic” or philosophical trappings that typically accompany related conversations.
The connection between search and quantum/non-classical isn’t mystical or philosophical. There is no physical quantum phenomenon here – simply behavior that is so similar to it, that I wasn’t alone in seeing this path develop over the years.
To me, it was like seeing the block/cube that lines up with the square opening on those old-school toys we had growing up – it still feels like a glove perfectly fitting a hand the more you make the comparisons.
But why?
Why on earth would search/retrieval (and eventually AI/LLMs), through years of design and engineering start behaving more and more like non-classical systems over time?
The origin story of this truly began before the transformer (we’ll get to that later). I believe it really started when search started moving into tokenization and vector representation.
Vector Spaces In Search
The search index.
A collection of resources that creates the foundation of search. Search engines crawl, store and organize online resources from the Open Web to create an index of retrievable items to present for relevant searches.
This is a shorthand way of describing things (that doesn’t do search engineering justice), but for purpose of this post, it should suffice.
In order to maintain such a large collection of resources, search engines need efficient ways of storing – one of those ways is through vector space models.
Vector space models are just mathematical representations of documents and language through the use of vectors (I leave this link here for those unfamiliar with vectors).
Items in a vector space are represented by a numerical set of coordinates which indicate a relative position in that space.
Vectors that are close to each other can represent documents and queries that may be closely related – or relevant to each other.
A query for “apple” gets turned into a vector (set of numerical coordinates) and is compared to other vectors in that space, and the set of document vectors that is closest becomes the relevant set of related documents presented in a search result.
As you can imagine, vector space models can drastically increase the speed of retrieving a set of documents – simply compare how close a query vector and set of document vectors overlap (instead of analyzing word for word through all of the documents in the set), and pull the closest ones.
Again, this is wildly oversimplifying this process, but the basic essence is there even in complex systems.
By developing these vector space models – while making things more efficient, also became the precursor to the development of the transformer & attention mechanism, which powers the modern AI/LLM search spaces.
On The Transformer: Attention
Getting into the exact details of the transformer and attention mechanism is best left to the original paper: Attention Is All You Need, but I’ll give a brief overview below.
Let’s use a simple example sentence of “The brown fox jumped over a fence.” and watch it move through the transformer.
The very beginning of the transformer process starts with tokenizing words, turning them into numerical representations – each word (or subword) gets its own unique value, which serves as the base vocabulary.
After tokenization, these values are pushed into an embedding space – which turns them into high dimensional vectors (this is where the high dimensional vector space model described above starts to overlap).
These tokens then enter the heart of the transformer – the attention mechanism.
“Attention” here means that the mechanism compares (looks) at each token and compares it to every other token in the sentence, learning a contextual (there’s that word again) relationship between each word.
For example, the word “brown” in the sentence above would be closely related to the word “fox” (since brown describes fox), giving it a higher relative attention score in the new contextual representation.
The true genius of the transformer is that it can use this attention across very large sequences of text, allowing it to find relationships between words, concepts or whatever is being represented across very large distances.In other words, it finds non-local (outside of immediate proximity) connections between words, concepts or anything else digested by the transformer. These relationships are learned through training and over large collections of documents and text – the larger the corpus the more relationships it can potentially make.
The more it finds certain words or phrases “in context” of other words or phrases, the more likely it will correctly recognize and use those words when requested in similar contexts.
This contextual, non-local representation then enables it to generate text on the output side – to answer queries or prompts, based on similar context scenarios.
A Quick Example
“Where would you find jaguars?”
Since the words “Brazil”, “Peru” and “Bolivia” are typically found frequently across many sources, documents and text with the word “jaguar”, the transformer might respond with “Brazil, Peru or Bolivia”.
Over time, this contextual, non-local representation space forms something akin to the high dimensional vector space model above – more intricate – but very similar. This space is so high dimensional it’s impossible for us to visualize (in our 3 dimensional imaginations), but this dimensionality allows connections to be made that would be difficult for us to connect.
Instead of a raw comparison of a query and a document, it uses the attention mechanism to evaluate a prompt or query and generate a response based on the geometry of the contextualized embedding space.
The output starts as the most likely (computed through matrix mathematics and some other mathematical machinery) word vector to start a sentence, is fed back into the mechanism to decide on the next, most relevant word vector (through the same process above), appended to the first word, and that process is repeated until a coherent response is complete.
The word vectors are then converted back into tokens (numerical representations), and are then translated back into readable words from the vocabulary based on those original numerical representations mentioned above.
The output is presented to the user and the thread continues.
Context Sensitivity
In the example above, if that particular prompt “Where would you find jaguars?” was changed slightly or entered in a different context, it’s likely that the response would change (sometimes dramatically).
Say a particular user has been in a chat session about car parts, selling cars or luxury vehicles. They then enter that prompt about jaguars in this particular, context-specific subspace.
It’s likely, that the shape of this subspace would invoke a response about Jaguar cars — not the animal.
The response might be a list of luxury car dealerships or a set of Jaguar cars for sale from a particular website – or perhaps even a clarification on which jaguars they want.
Whatever this new response is, it will be different in different context settings – and different for different users, based on the context around those particular users, prompt selection and input context.
Non-Classical Vector Spaces & Non-Local Effects
Non-classical (quantum) spaces are often represented by high dimensional spaces (a very specific type of space I’ll discuss in a future post), similar to the spaces described above.
These spaces describe sub-atomic objects — electrons, et. al. — “the little things” as the vectors in the vector space. The high dimensional vectors in this space represent the possible states each object can be found in.
These sub-atomic objects can become “entangled” – a phenomenon only found in the non-classical world.
Entanglement essentially means that two objects are connected in a way that is inseparable — measuring one means measuring both – no matter how far away they are; even light-years away.
Measuring a single property (measurement context) of one, instantly correlates with knowing that property of the other entangled object.
This is the “non-local” behavior and correlation that overlaps with the transformer and attention mechanisms above.
Attention As A Conceptual, Contextual Entanglement Engine
When word tokens are cast (embedded) into high dimensional representations, the attention mechanism in the transformer takes advantage of this dimensionality to find connections between words that would otherwise not be found if it was in its original, single dimensional state and represents it in a new, contextual high dimensional state.
One can argue this attention mechanism acts as a conceptual, contextual “entanglement” mechanism for meaning — entangling words in different contexts that are revealed when measured (through prompting) in similar context situations.
“Jaguars” become entangled with the words “Brazil”, “Peru” and “Bolivia” through attention and training, and measuring them in the context of “where” or “location” means knowing both (represented in the response together) – no matter where the individual words exist in the ambient vector space.
This is rough analogy (with more notes to come), but the idea remains – both entanglement and the attention mechanism create inseparable, contextually driven correlations across distances.
Take A Breath
Admittedly, this was a much longer post than anticipated, so if you made it this far you’re doing well.
The biggest takeaway if you have made it this far is that search, AI/LLM and non-classical spaces share many of the same behaviors because they’re largely constructed in the same way.
They’re on parallel tracks from two different, distinct fields, now correlated (entangled?) through this post.
I was not the first one to make these connections (far from it), but hopefully if you work or live in the search world, you may now see and understand why I believe the classical measurement that we’ve come to know in the search world are getting increasingly unreliable.
It’s not that the tools have changed, it’s the space that they’re measuring.
And it’s not mystical or philosophical — just non-classical behavior caused by the representation and inherent geometry of these new spaces.
More notes next week.

