Why a 1980’s algorithm from Cambridge still matters
There’s a moment which anyone who’s worked with survey data at scale will recognise: when you realise that the numbers alone aren’t enough. The scores tell you something has changed, but it is the text responses that tell you what that something is.
But there is a problem with all those free-text responses: people don’t write in neat, clean, structured, conveniently identical ways. They write the way they think. One person writes “I’m frustrated by the lack of communication.” Another writes “communications are frustrating.” A third writes “we were never communicated with.” A fourth uses “communicating” as a verb and “communication” as a noun in the same sentence, all meaning slightly different things each time.
To a human reader, those are obviously all the same topic. To a computer, they are four entirely different words. And that, in a sentence, is the problem that Natural Language Processing has been trying to solve for well over half a century.
The problem of word forms
When we analyse open-text responses at SocialOptic, be it ten thousand employee engagement comments, a hundred thousand consultation responses, or a dozen 360 degree feedback comments, or a single inspection report, the first challenge isn’t understanding the meaning. It is something far more basic: recognising that words which look different can actually be the same.
Think of a survey where a number of the people responding mention management. Some write “managed,” others “managing,” others “management,” “managers,” or just “manage.” If a system treats each of those as a separate item, it will scatter what should be a single, significant theme across five separate ones in the analysis. The signal gets diluted down to a tiny fraction of what it was, and the insight gets lost.
This isn’t a trivial problem. In a large-scale survey, it’s entirely normal for us to encounter hundreds of thousands of individual words across the full set of responses. Without some way of grouping related word forms together, the analysis would be overwhelmed by these surface-level variations before it ever got to meaning.
There are two main approaches to solving this, and they are worth understanding. You don’t need to become a computational linguist, but it is useful to know what you are getting, because the choice between the approaches used affects the type of insights you get from your survey data, and some platforms fall short on the basics.
Stemming: the pragmatic approach
The older and simpler approach is called stemming. A stemmer strips suffixes from words to reduce them to a common root. That is the “stem.” It does this by applying a set of rules, very mechanically, and without any understanding of what the word actually means. The logic is actually rather beautiful.
Take the word “connection.” A stemmer would strip away the suffix “-ion” and produce “connect.” Feed it “connected,” “connecting,” “connections,” and “connective,” and it’ll reduce all five to the same stem. That’s actually quite useful. It means that when a dozen respondents write about feeling “disconnected,” “disconnection from leadership,” and “there’s no connection between what we’re told and what happens,” the platform can group them together.
But stemming has a blunt edge. Because it works by rules rather than understanding, it sometimes produces stems that aren’t real words. The Porter stemmer, which the most widely used stemming algorithm in the English language, and one we’ll come back to, reduces “argue,” “argued,” “argues,” and “arguing” to “argu.” Not a word, but that doesn’t matter for the purpose of grouping. Hower, more problematically, it can sometimes group words that shouldn’t be grouped: “universal,” “university,” and “universe” all stem to “univers,” despite having quite different meanings in most contexts. And that is a problem.
The technical term for this is over-stemming. This happens when words that are etymologically related but semantically distinct (sorry, two of my favourite words) get collapsed into the same bucket. There’s also under-stemming, where words that should be grouped together are not. English, with its habit of borrowing irregularly from Latin, French, and Germanic roots, and anything else that happened to be passing through, gives stemmers a particularly hard time. “Alumnus” and “alumni” are clearly the same word in singular and plural, for our Latin scollars, but they don’t stem to the same root in most algorithms, because Latin plurals don’t follow English suffix rules.
Moving to a Snowball
The history here is worth a brief detour, because it tells you something about how the tools we rely on today came into existence.
In 1980, a British computer scientist named Martin Porter, then at the University of Cambridge, published a paper called “An algorithm for suffix stripping” in the journal Program. It was a modest paper, describing a five-step process for removing common English suffixes. It has since been cited over eight thousand times and remains one of the foundational texts of information retrieval.
The algorithm works by measuring what Porter called the “measure” of a word. This is essentially, a count of the vowel-consonant sequences in whatever remains after a suffix is removed. If the remaining stem is long enough (the measure exceeds a threshold), the suffix is stripped. If the stem would be too short, the suffix stays. While that sounds a bit complex, this is actually an elegantly simple idea: it means that short words are left alone (you don’t want to strip “ed” from “red”) while longer words are safely reduced.
What made the Porter stemmer so influential wasn’t just its effectiveness, it was this simplicity. It could be implemented in a few pages of code and ran quickly even on the computing hardware of the early 1980s, where I started my journey. It became the default stemming algorithm for English-language search engines, academic research systems, and, eventually, the kinds of text analysis tools that underpin platforms like SurveyOptic.
But Porter’s original algorithm had a problem: it existed only as a description in a journal paper, and the various implementations that other people built from that description frequently contained errors. Some were subtle: an off-by-one mistakes in the suffix rules that would only show up with unusual words, but the cumulative effect was that “Porter stemming” meant slightly different things depending on whose code you were running.
To solve this, Porter created Snowball. This was a small programming language designed specifically for writing stemming algorithms. Snowball allowed stemmers to be defined precisely and unambiguously, and then compiled into fast code in C, Java, Python, and a host of other languages. The improved English stemmer written in Snowball, which is sometimes called Porter2, fixed several known issues with the original algorithm, including better handling of short words ending in “e” and “y,” and improved treatment of the suffix “-ly.”
Just as importantly, Snowball made it straightforward to write stemmers for other languages. There are now Snowball stemmers for more than twenty languages, from Arabic and Armenian to Turkish. This matters in a UK context more than you might think: in any large public consultation or employee survey run by a national organisation, responses may arrive in Welsh, Scots Gaelic, Polish, Urdu, or any number of other languages spoken across these fair islands. Multi-language stemming isn’t just an academic curiosity; it is also a practical necessity.
Porter named Snowball as a tribute to SNOBOL, a string-handling programming language from the 1960s. There is a pleasing echo to the metaphor: a snowball that grows as it gathers contributions from the community, as Porter himself retired from active development in 2014, and Snowball is now maintained as an open-source community project.
Lemmatisation: the linguistically precise approach
There are other approaches, and we also use an alternative: lemmatisation. Where a stemmer chops suffixes using rules, a lemmatiser consults a dictionary. It understands grammar, in the sense that it knows “better” is a form of “good,” and “was” is a form of “be,” and that “mice” is the plural of “mouse.”, and “sheep”, well, sheep are just sheep. While a stemmer would reduce “better” to “bet” or “bett” a lemmatiser returns “good,” which is obviously more reader friendly.
Lemmatisation works by first identifying the part of speech of each word in context. Is “meeting” being used as a noun (“the meeting was productive”) or a verb (“we’re meeting on Thursday”)? The answer determines the lemma: as a noun, the lemma is “meeting”; as a verb, it’s “meet.” This kind of context-sensitivity is something stemmers can’t do.
The trade-off is speed and complexity. Lemmatisation requires a vocabulary database, a part-of-speech tagger, and considerably more computational effort. For a survey with a few hundred responses, the difference is negligible. For a hundred thousand responses, each containing multiple free-text answers, the choice of approach, and indeed the choice of when to use which approach, starts to have real consequences for both accuracy and processing time, and also the power consumed.
Why this matters for your survey data
If you’ve read this far, you might be wondering: why should you care about the difference between a stemmer and a lemmatiser?
The answer is that these decisions happen inside your analysis tools, whether you know about them or not. And they directly affect what you see in the results.
If your tool uses aggressive stemming, it might conflate words that shouldn’t be conflated, for example merging “care” with “career,” or “project” (the noun) with “projected” (the verb). In an employee engagement survey where “career development” and “care about staff” are both major themes, that conflation can actively mislead you.
If your tool uses lemmatisation, it will be more accurate, but only if it has been configured for the domain you’re working in. Clinical language in an NHS patient experience survey behaves differently from the language used in a corporate engagement survey, which behaves differently again from responses to a public policy consultation.
At SocialOptic, we don’t rely on a single approach. Our text analysis pipeline uses a combination of techniques. We use stemming for speed in initial frequency analysis and search indexing, lemmatisation for accuracy in analysis and theme identification, and additional layers of processing for tasks like identifying the people and places mentioned in responses, or detecting the emotional register of a comment as distinct from its factual content.
The point isn’t which technique is “better” in the abstract. It’s that the right approach depends on what question you’re trying to answer, at what scale, and with what kind of language.
What this means in practice
Let’s ground all of this in a real scenario. Imagine an NHS trust runs its annual staff survey and receives thirty thousand free-text responses across several open questions. Within those responses, the words “support,” “supported,” “supporting,” “supportive,” and “unsupported” all appear frequently.
A naive word count would treat those as five separate words. A stemmer would reduce them all to “support” – useful for knowing that support is a major theme, but you’ve now lost the crucial distinction between “I feel supported” and “I feel unsupported.” Those two sentiments are exact opposites, and conflating them is almost worse than not analysing the text at all.
A well-designed analysis pipeline handles this differently. It uses stemming or lemmatisation to identify “support” as a core theme, but it also uses sentiment analysis to distinguish positive from negative expressions of that theme. It uses the surrounding context to determine whether “support” is being praised or criticised. And it preserves the original text, so that when you drill down from the headline finding to the individual responses, you can read what people actually wrote.
This is why we’re cautious about the idea that large language models can simply replace purpose-built text analysis tools for survey data. LLMs are extraordinary at generating and summarising text, but analysing tens of thousands of structured responses, consistently, accurately, auditibly, and at a cost and environmental impact that makes sense, requires a different kind of engineering. The foundational techniques we’ve described: stemming, lemmatisation, part-of-speech tagging, sentiment classification are many times faster, more predictable, more auditable, and have a fraction of the environmental footprint of running every response through a large generative model.
The tools behind the tools
One of the things I find fascinating about this field is how much of what we use today rests on work done many decades ago. Martin Porter’s 1980 paper is forty-six years old. The idea of lemmatisation draws on linguistic principles that predate computing entirely. The part-of-speech taggers we use are built on probability models first developed in the 1960s.
These aren’t old technologies in the sense of being obsolete. They’re old technologies in the sense of being proven, refined, optimised, and battle-tested across billions of words in hundreds of languages. When we build text analysis capabilities into SurveyOptic, we’re standing on foundations that have been peer-reviewed, openly debated, and continuously improved by a global community of researchers and practitioners for half a century. That’s not something you get from a proprietary AI API.
But it is only half the story. Because the most interesting recent developments in text analysis don’t replace these foundational techniques, they build on them.
From words to vectors: the quiet revolution
There’s a quote I come back to often, from the English journalist and satirist Malcolm Muggeridge: “All new news is old news happening to new people.”
It’s a line that was meant for politics and current affairs, but it captures something essential about the technology landscape we’re describing. Because the most exciting development in text analysis over the last decade, the one that underpins everything from modern search engines to the large language models everyone is talking about, turns out to rest on an idea from 1957. Long befor even Martin Porter’s work.
The idea belongs to a British linguist called John Rupert Firth, and it’s beautifully simple: “You shall know a word by the company it keeps.”
What Firth meant is that the meaning of a word isn’t some abstract property locked away in a dictionary. It’s defined by context. Specifically, the other words that tend to appear around it. The word “nurse” means something slightly different when it appears alongside “hospital” and “ward” than when it appears alongside “plant” and “seedling.” You don’t need a dictionary to know that. You just need enough examples of each usage.
For decades, this remained a theoretical observation. Linguists nodded, wrote papers about distributional semantics, and moved on. It wasn’t until computing caught up with the theory that Firth’s insight could be turned into something practical.
Words as numbers: word embeddings
The breakthrough came in 2013, when a Czech researcher named Tomáš Mikolov, working at Google, published a technique called Word2Vec. The idea, which seems almost too simple, was to train a neural network on a vast quantity of text, asking it to predict which words tend to appear near other words. Then, rather than using the predictions themselves, Mikolov kept the internal representations that the network had learned along the way. Those representations are what we now call word embeddings. Simple, but revolutionary.
An embedding is a way of representing a word as a list of numbers (a vector) in a mathematical space. Each word gets several hundred numbers. That doesn’t sound very illuminating, but something remarkable happens when you do this with enough data: words with similar meanings end up close together in that mathematical space, and the relationships between words get encoded in the distances and directions between their vectors.
The most famous demonstration of this is the analogy test. If you take the vector for “king,” subtract the vector for “man,” and add the vector for “woman,” the result lands very close to the vector for “queen.” Nobody taught the system about gender or royalty. The relationship emerged from patterns of usage, from the company the words keep. Firth’s 1957 insight, made into a computational reality.
What makes this relevant to survey analysis, and not just a party trick with analogies, is what it does for understanding meaning at scale.
Why vectors matter for survey data
Consider the problem we described earlier: a respondent writes “I feel unsupported,” another writes “there’s no help available,” and a third writes “I’m left to figure things out alone.” A stemmer can’t connect those three statement. The words are entirely different. Even a lemmatiser won’t help, because the issue isn’t word forms; it’s that the same concept is being expressed in completely different language.
But in a vector space, all three of these statements end up in a similar region. The embeddings for “unsupported,” “no help,” and “left to figure things out” are mathematically close, because they tend to appear in similar contexts in the vast corpora of text from which the embeddings were trained. The system has learned, without being explicitly told, that these are different ways of saying the same thing.
This is transformative for large-scale text analysis. It means we can move beyond keyword matching to something much closer to genuine understanding of the meaning of the words. When SurveyOptic analyses thirty thousand open-text responses and identifies “lack of support” as a major theme, it isn’t just counting the word “support.” It’s recognising that dozens of different phrasings, expressed in the full variety of natural language, are all pointing in the same direction.
Vector-based search: finding meaning, not just matching words
The same principle powers a fundamental shift in how search works. It is one that most people encounter daily without even knowing it.
Traditional search, the kind that powered early search engines and still powers many internal platforms, works by matching keywords. You type “communication problems with management” and the system looks for documents containing those specific words (or, if it’s using a stemmer, their root forms). If a respondent wrote about “poor dialogue with leadership,” the traditional system won’t find it, because none of the keywords match.
Vector-based search, sometimes called semantic search, works differently. Instead of matching words, it converts both the query and the documents into vectors, and then finds the documents whose vectors are closest to the query vector. “Communication problems with management” and “poor dialogue with leadership” end up close together in vector space, because their meanings are similar, even though their words are not.
This is the technology that now underpins the search and retrieval capabilities for comments in SurveyOptic. When a you want to find all responses related to “feeling overwhelmed,” the platform doesn’t just look for the word “overwhelmed.” It finds responses about being “stretched too thin,” about “workload being unmanageable,” about “drowning in tasks”, and all the ways real people express that experience.
This entire capability is built on Firth’s observation from nearly seventy years ago: You shall know a word by the company it keeps. The mathematics is new. The computing power is new. The insight is old.
The spin-out from large language models
It’s worth being precise about the relationship between word embeddings, vector search, and the large language models like ChatGPT, Claude, and their peers that currently dominating headlines. Large language models (LLMs) are, in a very real sense, the grandchildren of Word2Vec. They use the same fundamental principle, learning from context, but at enormously greater scale. The transformer, the architecture behind almost every major LLM, was introduced in 2017, and it extended the embedding approach in a crucial way: instead of giving each word a single, fixed vector, transformers produce contextual embeddings. These are vectors that change depending on the sentence the word appears in. The word “bank” gets a different vector in “river bank” than in “bank account.” This was the missing piece that static word embeddings couldn’t provide.
The practical spin-out from this for survey analysis is significant. We can now use the embedding layers from these models, the part that converts text into vectors, without needing to use the full generative model at all. This gives us the semantic understanding, but without the cost, latency, or unpredictability of asking it to generate text. It’s like having a translator who understands every language perfectly but whom you only ever ask to read, never to write.
This is where the technology sits today, and it’s where SurveyOptic’s newer capabilities are being built. We use transformer-based embeddings for semantic search and theme clustering, classical NLP techniques for structured analysis, and purpose-built models for domain-specific tasks like detecting early signals of service issues in patient feedback.
Each layer has its strengths. The oldest techniques are the fastest and most transparent. The newest are the most semantically powerful. The skill is in knowing when to use which, and in never losing sight of the fact that every vector, every embedding, every cluster ultimately represents something a real person took the time to write.
Old theories, new capabilities
Muggeridge was right: all new news is old news happening to new people. The distributional hypothesis is seventy years old. Stemming algorithms have been in continuous use since 1968. The Porter stemmer has been running, in one form or another, fornearly half a cenntury. And the most powerful text analysis tools available in 2026 are, at their mathematical core, implementations of ideas that were first articulated when the transistor was a new technology.
These are not untested technologies riding a hype cycle. They’re mature, well-understood techniques that have been refined across decades of practical application in information retrieval, computational linguistics, and data science. When we deploy them in SurveyOptic, we know how they’ll behave, we can explain why they produce the results they do, and we can audit every step of the analysis chain.
That matters enormously when the data you’re analysing is someone’s honest feedback about their workplace, their care, or their community.
Where we’re going
We’re currently bringing a number of new text analysis capabilities online in SurveyOptic, including document understanding, inference from text comments for early detection of service issues, and enhanced support for the increasingly large volumes of text involved in public consultations. If you’re working with large-scale survey data, employee feedback, or consultation responses and you’d like to understand what’s actually being said, we would love to have a conversation.
The tools have come a long way since 1957. But the fundamental challenge hasn’t changed: people express themselves in their own words, and those words deserve to be properly understood.
References
If you want to dig deeper, here are some references and further reading:
- Snowball homepage and introduction: snowballstem.org/texts/introduction.htm
- What company do words keep? Revisiting the distributional semantics of J.R. Firth & Zellig Harris.
- Efficient Estimation of Word Representations in Vector Space” (the Word2Vec paper): arxiv.org/abs/1301.3781.
- “What Are Word Embeddings?” – a non-technical overview: ibm.com/think/topics/word-embeddings
- Jurafsky & Martin, Speech and Language Processing is the standard university textbook for NLP web.stanford.edu/~jurafsky/slp3.