Nov 21, 2024 by Sabber ahamed
In this post, I'll guide you through implementing a Retrieval Augmented Generation (RAG) system using modern tools and techniques. We'll explore how to combine multiple retrieval methods with reranking for more accurate and relevant responses.
In the landscape of search based Retrieval Augmented Generation (RAG), one component stands out as a game-changer: reranking. Let me give you some context. In semantic only retriever baseds RAG system, we pull data from documents that are semantically similar to the user query. The algorithm that we use known as Approximate Nearest Neighbors (ANN) search, which is fast and efficient. However, bacause of the nature of the algorithm, it can sometimes pull irrelevant data.
Becaus eof irrelevant data, the Large Language Models (LLMs) can hallucinate and provide inaccurate responses. This is where reranking comes in.
Think of reranking as a two-step interview process: the initial retrieval is like screening resumes (quick but broad), while reranking is the in-depth interview (thorough but resource-intensive). This approach has become increasingly important as organizations struggle with LLM hallucinations and accuracy issues. By implementing proper reranking, many teams have reported up to 40% improvement in response accuracy.
In my last blog post about Building a Multi-agent System, I discussed the importance of combining multiple agents to create a more robust conversational system. Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from a knowledge base before generating responses. This approach combines the power of:
A RAG system should have the following characteristics:
Modern RAG systems consist of several essential components working together:
The embedding model converts text into dense vector representations for semantic search:
embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
This component combines multiple retrieval methods for better coverage:
ensemble_retriever = EnsembleRetriever(
retrievers=[semantic_retriever, bm25_retriever],
weights=[0.7, 0.3] # 70% semantic, 30% keyword importance
)
The reranker fine-tunes the retrieved documents for maximum relevance:
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
Let's break down the implementation into manageable steps:
# Initialize embedding model and LLM
embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
rag_llm = ChatGroq(model="llama3-8b-8192")
# Load vector store
vectorstore = Chroma(
collection_name=collection_name,
embedding_function=embed_model,
persist_directory=persist_directory
)
# Create BM25 retriever
bm25_retriever = BM25Retriever.from_texts(all_docs)
# Set up semantic retriever
semantic_retriever = vectorstore.as_retriever(
search_kwargs={"k": 10}
)
# Combine retrievers
ensemble_retriever = EnsembleRetriever(
retrievers=[semantic_retriever, bm25_retriever],
weights=[0.7, 0.3]
)
[Previous sections remain the same until the Document Reranker section]
The reranker serves as a crucial refinement layer in modern RAG systems, using cross-encoders to perform deep bi-directional attention between queries and documents. While powerful, reranking comes with important trade-offs to consider:
Advantages:
Disadvantages:
Here's how we implement reranking in our system:
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
async def rerank_documents(query: str, docs: List[Document], k_final: int = 3):
"""
Rerank documents using cross-encoder with performance optimization
Args:
query: User question
docs: List of retrieved documents
k_final: Number of documents to return after reranking
"""
# Batch process documents for efficiency
batch_size = 32
pairs = [(query, doc.page_content) for doc in docs]
# Get scores from re-ranker
scores = []
for i in range(0, len(pairs), batch_size):
batch = pairs[i:i + batch_size]
batch_scores = reranker.predict(batch)
scores.extend(batch_scores)
# Sort documents by score
scored_docs = list(zip(docs, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
# Return top k documents
return [doc for doc, _ in scored_docs[:k_final]]
Best Practices for Reranking:
A typical optimization approach is to retrieve more documents initially (e.g., top 10-20) and then rerank to select the best 3-5 documents for the final context. This balances the trade-off between accuracy and performance:
# Example optimization workflow
initial_docs = await ensemble_retriever.ainvoke(question) # Get top 10-20 docs
reranked_docs = await rerank_documents(
query=question,
docs=initial_docs,
k_final=3 # Only keep top 3 after reranking
)
Query Processing
Response Generation
Performance Optimization
As RAG systems continue to evolve, several key areas will shape their development:
Advanced Retrieval Methods
Enhanced Context Processing
Improved Response Generation
We created getassisted.ai for building a seamless multi-agent system. You do not need to write any code. The goal is to create an assistant that helps you learn any niche topics. Whether you're a researcher, developer, or student, our platform offers a powerful environment for exploring the possibilities of multi-agent systems. Here is the link to explore some of the: assistants created by our users.
Building a robust RAG system requires careful consideration of various components and their interactions. By following the structured approach outlined in this article and implementing best practices, you can create effective RAG systems that provide accurate and contextually relevant responses.
Remember that the field of RAG is rapidly evolving, and staying updated with the latest developments and technologies is crucial for creating state-of-the-art systems.