Telerik blogs

Rather than relying exclusively on general knowledge available to an LLM, RAG allows you to connect the model to specific reference sources, so your AI can provide more accurate, relevant information.

Imagine asking a general AI assistant (like the popular ChatGPT) about something very specific about your company. Instead of recognizing that it lacks access to your internal documentation, it confidently provides an entirely incorrect answer based on generic industry practices that don’t match your actual implementation.

Although LLMs (Large Language Models) are incredibly powerful at understanding and generating language, they can only work with information from their training data. They can’t access your company’s latest documentation, current project details or any internal knowledge sources unless you explicitly provide that information to them.

Some teams try to solve this problem with prompt engineering; carefully crafting detailed prompts that include the relevant context or instructions. While this can work for small, one-off use cases, it doesn’t scale well. Maintaining long, manual prompts becomes unmanageable as your knowledge base grows, and it’s easy for important context to be left out or become outdated.

This is where Retrieval-Augmented Generation, or RAG, comes in. Instead of relying solely on what a model already “knows,” RAG allows you to connect the model to an external knowledge base (e.g., your internal documentation, wikis or databases). When a user asks a question, the system first retrieves the most relevant documents and then augments the model’s response using that information.

The Three Steps of RAG

The name “Retrieval-Augmented Generation” summarizes how the process works: you retrieve the proper context, then augment the model’s generation with it. A typical RAG workflow involves three main steps:

Retrieve

When a user submits a query, the system searches through an external knowledge source (e.g., your company’s documentation, support tickets or technical specs) to find the most relevant pieces of information. This retrieval step usually relies on embeddings, which represent text as numerical vectors so that semantically similar content can be efficiently matched.

We’ll dive deeper into embeddings in a follow-up article, but think of them as a way to convert text into numbers that capture meaning. This allows the system to understand that “authentication” and “login security” are related concepts even though they use different words.

The retrieval system returns the top matches, perhaps the five or 10 most relevant documents or passages, based on how closely they align with the user’s query.

Augment

Once the relevant documents are retrieved, they’re combined with the original user query to create an enhanced prompt. This is the “augmentation” part. We’re giving the LLM a cheat sheet of relevant information before asking it to respond.

Generate

With the retrieved context now part of the prompt, the LLM generates a response that’s grounded in your actual documentation rather than generic knowledge. The model can quote specific policies, reference exact numbers from your specs or explain procedures that are unique to your organization.

An Example: TechCorp’s Internal Assistant

Let’s walk through a fictional example. Imagine TechCorp, a mid-sized software company, has implemented a RAG system for its internal engineering assistant. Their knowledge base includes API documentation, deployment guides, security policies and incident runbooks.

The Knowledge Base

TechCorp has indexed several key documents into its vector database:

DocumentContent
Doc A“Our API rate-limiting policy enforces 1000 requests per minute for standard tier clients. Premium tier clients get 5000 requests per minute. Rate limits reset every 60 seconds.”
Doc B“JWT tokens must be rotated every 24 hours in production environments. The rotation process involves calling the /auth/refresh endpoint with the current token before expiration.”
Doc C“Deployment to production requires approval from two senior engineers. All deployments must go through staging first, with a minimum 2-hour soak period.”
Doc D“For authentication issues, first check the auth service logs in Datadog. Common errors include expired tokens (ERROR_401_EXPIRED) and invalid signatures (ERROR_401_INVALID_SIG).”
Doc E“Our staging environment mirrors production but uses synthetic test data. Database refreshes happen nightly at 2 AM PST.”

Now let’s see how an AI assistant, supported by RAG, would handle a few questions. When an engineer asks, “What’s our API rate limit for premium customers?” here’s how the system responds:

Step 1. Retrieve

The system converts the question into an embedding and searches for similar content. Below is a simplified view of how the documents rank in similarity:

DocumentContent PreviewSimilarity Score
Doc A“Our API rate-limiting policy…”0.92
Doc D“For authentication issues…”0.31
Doc B“JWT tokens must be rotated…”0.28
Doc C“Deployment to production…”0.15
Doc E“Our staging environment…”0.08

The system retrieves the top two most relevant chunks:

  • Doc A with its high similarity score.
  • Doc D as potential supporting context.

Step 2. Augment

The system builds the following augmented prompt combining the user’s original question with the most relevant retrieved documents from the external knowledge base:

Context from internal documentation:
- "Our API rate-limiting policy enforces 1000 requests per minute for standard tier clients. Premium tier clients get 5000 requests per minute. Rate limits reset every 60 seconds."
- "For authentication issues, first check the auth service logs in Datadog. Common errors include expired tokens (ERROR_401_EXPIRED) and invalid signatures (ERROR_401_INVALID_SIG)."

Question: What's our API rate limit for premium customers?

Please answer based on the provided context.

Step 3. Generate

The LLM then potentially responds:

“Premium tier customers at TechCorp have an API rate limit of 5000 requests per minute. These limits reset every 60 seconds. Standard tier clients, for comparison, are limited to 1000 requests per minute.”

The response is specific and accurate because it relies on verified documentation rather than general assumptions. After receiving that answer, the engineer could follow up with:

“Do these rate limits apply to both production and staging environments?”

For this question, the assistant retrieves the relevant context from Docs A and E, noting that the staging environment mirrors production but uses synthetic test data. It then responds:

“Yes — the same rate-limiting policy applies in staging and production. However, since staging uses test data and refreshes nightly at 2 AM PST, engineers typically run requests at smaller volumes during testing.”

This walkthrough provides a simplified overview of how a RAG system operates in practice. In real-world implementations, the architecture can vary, from how documents are chunked and indexed to how retrieval scoring, ranking and reranking are performed. Still, the core principle remains the same: RAG combines retrieval and generation to deliver more accurate, context-grounded responses.

Wrap-up

RAG bridges the gap between what large language models know and what your organization needs them to know. It eliminates the limits of static training data and the maintenance burden of prompt engineering by grounding every response in your company’s real documentation.

In the examples above, you may wonder how exactly a system computes those similarity scores. For example, how did it know that the words “API rate limit” were 92% similar to our rate-limiting documentation? We’ll explore this a bit more when we dive into embeddings in the next article.

Finally, if you’re exploring how to put RAG into production, the Progress Agentic RAG platform provides a practical starting point. It delivers RAG-as-a-Service (indexing your data, managing retrieval quality and integrating seamlessly with any LLM), so teams can focus on building intelligent, context-aware AI experiences rather than the underlying RAG infrastructure itself.


AI, RAG
About the Author

Hassan Djirdeh

Hassan is a senior frontend engineer and has helped build large production applications at-scale at organizations like Doordash, Instacart and Shopify. Hassan is also a published author and course instructor where he’s helped thousands of students learn in-depth frontend engineering skills like React, Vue, TypeScript, and GraphQL.

Related Posts

Comments

Comments are disabled in preview mode.