We all make fun of how large language models get things wrong—but only because we want very much for LLMs to get things right. Retrieval augmented generation (RAG) is one way to potentially provide more reliable results.
Like the internet, artificial intelligence will, eventually, affect everything. I think it’s pretty clear, for example, that there will be two kinds of organizations in our future: those that use AI and those that will die. But even that’s only true, of course, for the organizations that get the tools to work reliably. Generative AI using large language models is a good example of both the problems and solutions available with AI tools.
The most notorious problem in using generative text solutions is “hallucinations,” where a generated response includes both material that you can count on and material that is … let’s just say, less reliable.
My favorite example comes from Wikipedia where a chatbot, given a URL with several English words in it (“technology,” “chatgpt-prompts-to-avoid-content-filters”) generated a summary of that page—except there was no page at the URL. Not that it mattered that the page didn’t exist, because the chatbot didn’t have an internet connection to look for the page, anyway.
Criticize that chatbot’s response all you want, but that chatbot has a future as a consultant (or as a technical writer). In fact, to me, what’s interesting about the response is how human it is: Rather than admit ignorance, the chatbot took a guess at the right answer.
But it turns out that we care very much about the reliability of the answers we get from AI solutions. If we wanted occasionally unreliable answers, we’d skip the artificial intelligence and go with real intelligence by hiring a consultant who could tell us when their answers were guesswork. Generative language tools don’t do that (at least, not yet—see below).
AI’s problems aren’t limited to simply “making things up,” though. Responses can also be off-topic or nonsensical. Again, this isn’t unusual: I wrote a post a few months back about producing document summaries by integrating the Telerik PDF and document processing tools with Azure Language Services. Of the three examples I used in that post, there was one document summary that I thought was fine and one summary that I thought could use some improvement. In the third document’s summary, however, language services revealed that it couldn’t distinguish between the topic of my post (“the code to work with Azure Language Services”) and the topics of the sample documents I used. The result was sort of ugly.
Still, that’s one success, one fail and one in-between. In baseball terms, that’s an on-base percentage of .500 when, currently, a good on-base percentage is considered to be (quick check using a chatbot) about .360. To my mind it means that, while you can use AI to create solutions that might formerly have been too expensive, those solutions are going to require some supervision.
I’d still call that progress but, if we want more reliable solutions, we need to understand why our AI solutions currently produce unreliable results.
The current hot technology for generating text using AI is large language models (LLMs). LLMs absorb a huge corpus of various kinds of documents (text, audio, video, images) as their input and are then trained to respond to new queries using patterns discovered in that corpus. But, because LLMs are pattern-based (rather than knowledge-based), they only generate responses that look like the right answer. As my previous examples demonstrate, that turns out not to be completely reliable.
Another significant issue is that LLMs are inherently out of date—a problem referred to as “latency.” Because LLMs need time to build their corpus of documents, ingest those documents and then be trained on that corpus, they are, inevitably, driven by the past (much like dictionaries, which tell you how words were used in the past but not how those words are being used right now). A service-oriented LLM in a rapidly evolving area might not be hallucinating but could easily be giving responses that are out of date.
There are, at least, two possible ways to improve reliability—and I like one of them very much.
One solution is “fine-tuning” LLMs to use documents from a particular “problem domain.” Since your solution will now only find content relevant to the problem domain then, for that problem domain, the discovered patterns will be based on real information. The incidence of hallucinations should be reduced and reliability improved.
However, while these more focused models are sometimes referred to as “small language models,” they typically still need a pretty huge collection of documents to generate coherent responses. It might be better to think of these as “topical” language models rather than “small” ones. Not surprisingly, the resources required to process all these documents remains high, as does training those LLMs. And, of course, latency remains a problem.
And the time/cost/effort of building a corpus is an issue because the existing “general purpose” LLMs are a) available right now and b) pretty darn cheap. It’s not clear that the improvements in reliability from creating “fine-tuned” models will justify the costs compared to existing “general purpose” LLMs.
The other, more promising solution is retrieval augmented generation (RAG). With RAG, applications built on top of an LLM automatically reach out to topical data sources to retrieve information related to the interaction. That retrieval can be domain-specific data (your company’s product list or service manuals) or just more recent data (current polls in election campaigns). The additional information is integrated into the LLM’s response to enrich the response with current, domain-specific data.
As an example of what this might look like, about two days ago I clicked on an AI-generated link to access information about a rock band I like (The Warning, if you’re interested). The link brought me to a page with a discussion of the band’s style that, while not particularly insightful, wasn’t wrong.
But then the page went on to (purportedly) list the band’s albums, and not one of the albums listed was an album by the band. Adding insult to injury, the generated response also got the names of the band members wrong (which, considering that the members are all sisters and share a common last name, was a pretty impressive mistake).
Fortunately, I was suspicious of the reliability of the information—I knew the name of the band’s latest album, which wasn’t on the list. So I did what any rational person would do and checked a reliable source: the entry for the band on AllMusic.com, which gave me the right information.
Where had that original chatbot gotten its information? I have no way of knowing. But, had my original query been processed using AllMusic, as I did and as an RAG could, the list of the band’s albums (and band member names) would have been more accurate. Using a knowledge-driven source, rather than just counting on pattern matching, can help keep an LLM from providing a response unless there’s supporting information from the topical data source (i.e., no “hallucinations”).
But there’s another benefit: Sophisticated RAG systems can also include references to the topical data sources, effectively providing citations that users can review and verify. This lets users filter out nonsensical or off-topic responses (if they occur) by checking to see if that part of the response has a citation. If nothing else, in my example, the absence of a link to a reliable source (the band’s AllMusic entry) would have flagged the information as potentially unreliable.
But, of course, no solution is perfect, and RAG has issues also. Understanding those issues requires looking at how RAG is integrated into processing a query.
All LLM solutions begin with natural language processing (i.e., tokenizing the query, stemming to generate synonyms, recognizing and naming entities, etc.). The output from that process is then converted into a vector. A vector is a set of numbers that reflect how words are used in a sentence and what the relationships among those words are. RAG uses that vector to fetch data from some topical vector database, based on similarities between the query vector and the vectors in your topical data (e.g., recognizing the relationship between “Great Danes are very big dogs” and “What are some big dogs?”).
Which leads to the first issue: Your topical data also has to be converted into vectors (“vector embedding”) before it can be integrated into an LLM solution.
And that also leads to the second issue: The query into your topical database will return a lot of matching vectors. Those matching topical data vectors need to be ranked and the higher ranking results given priority in generating the response.
That ranking varies from one topical data source to another. In responding to a “How do I fix this broken device?” query, for example, data relevant to the particular device and the symptoms of the failure get priority. In an intrusion detection system, on the other hand, priority would be given to data about anomalies and activities that match those in the MITRE ATT&CK knowledge base. That diversity means getting a reliable ranking for any particular topical data source is going to involve some training.
That training might not be difficult, though: It might just consist of having human beings rank the LLM+RAG’s initial responses and directing the LLM to do more of whatever it did that generated higher ranked responses and less of anything that resulted in lower ranked results. In addition, training is probably a one-time/up-front cost rather than an ongoing expense (though, again, supervision is required and any solution’s results should reviewed at regular intervals to determine if additional training/retraining is required).
Third: Latency remains an issue. However, because the amount of domain-specific information is orders of magnitude smaller than required for either large or fine-tuned models, regenerating the vector database can happen more frequently, lowering the latency gap. That is, however, an ongoing cost.
One more issue: Scalability, this time in terms of many topic data sources you can include in a RAG solution. While, in theory, you can structure a RAG to include more topical data sources, as your data sources become more diverse, you end up with the original problem: You have a general purpose, pattern-matching system rather than one that leverages specialized knowledge. As a result, unless the solution has a way to pick the “right” topical data source for any query, the reliability of the solutions responses will decrease (though the presence/absence of citations will help flag unreliable responses).
We all know that reliability matters. If you’ve read this far, it’s only because you thought this was a reliable post, for example. RAG, while not free, provides a way to create more reliable solutions than pure LLM implementations by moving to a knowledge-based approach instead of pure pattern-matching—or, at the very least, giving users a quick way to assess the reliability of any response.
Still, if I were you, even in a RAG solution, I’d keep an eye on the results you get: The tools are still young and, like any new hire, will need supervision.
And, for the record, here’s some audience footage of The Warning (the Villarreal sisters—Ale, Pau and Dany), on tour to my town.
Peter Vogel is a system architect and principal in PH&V Information Services. PH&V provides full-stack consulting from UX design through object modeling to database design. Peter also writes courses and teaches for Learning Tree International.