RAG (Retrieval-Augmented Generation) and Embedding : PART — 1

Please share to show your support


RAG (Retrieval-Augmented Generation) is a hybrid AI approach that combines:

1. Retrieval-based systems (for accuracy and up-to-date knowledge)

2. Generative models (for fluent, natural language responses)

Why do we need RAG? Where was it some time back?

A user enters a natural language query, such as “What are the latest features in Kubernetes 1.30?”

The query is converted into a vector embedding using a pre-trained encoder (such as OpenAI embeddings or Sentence Transformers).

This embedding is used to search a vector database for semantically similar documents.

The top-matched documents are retrieved to form the context.

This context is provided to a large language model (LLM) to generate an accurate and relevant response.

RAG 

The diagram, which is on the top, illustrates the architecture of a Retrieval-Augmented Generation (RAG) system powered by Redis Vector DB and Azure OpenAI.

The process begins with PDFs (Step 1), which are converted into vector embeddings using a pre-trained model (Step 2). These embeddings are stored in a Redis Vector Database (Step 3). When a user submits a natural language question (Step 5), it is also converted into an embedding (Step 2) and sent to the retriever (Step 4). The retriever searches the Redis Vector DB to find semantically similar content. The retrieved content and the original question are passed to the RAG system (Step 7), which uses Azure OpenAI (Step 6) to generate a context-aware response. Finally, the system produces an accurate answer (Step 8) for the user.

If you look at the diagram above, embeddings are a critical component — the more meaningful the embeddings, the more accurate and relevant the search results become. In my example, I’ve used RedisVL as the database to store these embeddings.

Let’s now dive deeper into the Embedding Flow in a Retrieval-Augmented Generation (RAG) system.

The diagram illustrates the embedding process in a Redis-based RAG architecture. This process involves two key workflows: embedding documents (PDFs) during ingestion and embedding user queries at runtime. Both workflows are essential to enable accurate, vector-based semantic search.

Embedding- Ingestion Flow
Embedding- Ingestion Flow

Embedding PDFs (Ingestion Time)

Step 1: Load PDFs

PDF files are loaded using libraries like PyPDF or LangChain’s PyPDFLoader. This step extracts text from each page of the document.

Step 2: Split Text into Chunks

To maintain context and adhere to embedding size limits, the extracted text is split into smaller chunks (e.g., 500 tokens) using a text splitter.

Step 2.1: Preprocess Text

Before generating embeddings, I applied several preprocessing techniques to clean and normalize the text. These included:

  • Lowercasing
  • Removing punctuation
  • Tokenization
  • Stopword removal
  • Lemmatization or stemming

These steps reduce noise and help ensure the embeddings capture the semantic meaning more effectively.

Step 3: Generate Embeddings

Each text chunk is passed to an embedding model, such as OpenAI’s text-embedding-3-small, which converts the text into a high-dimensional vector.

A high-dimensional vector is simply a list of numbers (e.g., 1536 values) that numerically represent the meaning of text. Semantically similar sentences result in vectors that are close in this vector space. For instance, ‘Redis enables fast data retrieval’ and ‘Redis supports quick access to data’ may generate closely aligned vectors, even with different wording.

Though this article focuses on the RAG workflow, it’s crucial to highlight the common challenges in embedding generation. In my experience, embedding is actual data science work — if embeddings go wrong, the entire RAG pipeline can fail, regardless of the use case.

Here are a few real-world challenges I’ve encountered:

1. Semantic Drift

Vectors from unrelated content may appear similar due to shared terms or vague overlaps. This can cause irrelevant documents to be retrieved.

Example:

  • Relevant: “Redis supports high-speed caching for web applications.”
  • Irrelevant: “This resort offers a relaxing experience by the riverbank.”
  • User Query: “How does Redis support caching?”

Due to slight numerical similarity, the retriever might incorrectly choose the resort sentence as a match — a false positive.

2. Context Loss During Chunking

Splitting text mid-paragraph can weaken the contextual meaning, leading to subpar embeddings.

Bad Chunking Example:

  • Chunk A: “Redis is an in-memory data store, commonly used for caching. It supports various”
  • Chunk B: “data structures like strings, hashes, and lists. It is extremely fast and is often used…”

These fragments lose meaning when read independently, reducing embedding effectiveness.

3. Polysemy and Ambiguity

Words with multiple meanings (polysemy) can confuse models if context is weak or missing.

Example:

  • “bank” could mean:
  • A financial institution
  • The side of a river

Problem:

  • Sentence: “He sat by the bank and watched the water flow.”
  • Query: “How do I open an account at a bank?”

The shared word “bank” can lead to incorrect matches despite the semantic gap.

Mitigation:

  • Use larger context windows.
  • Add metadata (e.g., domain=finance or domain=nature).

4. Poor Preprocessing

Inconsistent tokenization or noise like headers and footers can degrade embedding quality.

5. Model Limitations

Generic embedding models may not grasp specialized jargon or domain-specific terms.

Example:

  • “The patient underwent CABG following myocardial infarction.”
  • CABG = Coronary Artery Bypass Grafting
  • Myocardial infarction = Heart attack

A general-purpose model may:

  • Misinterpret acronyms
  • Miss the clinical relationships

Mitigation:

  • Use domain-specific models (e.g., BioBERT, FinBERT)
  • Fine-tune on specialized corpora
  • Use metadata and keyword filters

6. No Ground Truth Validation

Often, systems lack human-in-the-loop checks to confirm if retrieved results are truly accurate.

Step 7: Vector Search in Redis

Redis compares the query vector with stored document vectors to find the most semantically similar results.

Step 8: Pass to RAG System

The retrieved chunks and original query are fed into a Large Language Model (LLM), such as Azure OpenAI, which generates a contextual response for the user.

Summary: Redis-based RAG Embedding Pipeline

This document outlines the process of embedding documents and user queries for a Redis-powered Retrieval-Augmented Generation (RAG) system. The workflow begins by extracting text from PDFs using libraries like PyPDF and splitting the text into manageable chunks. These chunks are preprocessed—lowercased, tokenized, cleaned—and converted into high-dimensional vectors using models such as OpenAI’s text-embedding-3-small.

The document highlights key embedding challenges, including:

  • Semantic drift
  • Context loss during chunking
  • Polysemy and ambiguity
  • Poor preprocessing
  • Model limitations, and
  • Lack of ground truth validation

Embeddings and user queries are stored and compared using Redis Stack with RediSearch, enabling fast and accurate vector search using cosine similarity. The final stage involves passing the top-K retrieved chunks and the query to an LLM like Azure OpenAI GPT-4, which produces a meaningful answer. For more information, visit https://cloud.google.com/use-cases/retrieval-augmented-generation

The choice of Redis as the vector store is driven by its performance, ease of integration, and native vector search support, making it a suitable option for building scalable semantic search workflows. Read another article, written by the author at https://journals-times.com/2025/05/31/agentic-ai-how-it-can-redefine-the-software-development-lifecycle/

Subscribe to our newsletter for free!

Please share to show your support

Leave a Reply

Up ↑

Translate »

Discover more from E-JOURNAL TIMES MAGAZINE

Subscribe now to keep reading and get access to the full archive.

Continue reading