Skip to main content

AI Expedition: RAG

03-19-26 Sparkbox

If you try asking an LLM something specific about the content and data that matters to you every day, you may receive an answer even if the LLM doesn’t actually know. RAG workflows can reduce hallucinations and guesses by providing an LLM with contextually-specific data and expert know-how.

Recently, Marissa gave a talk at Sparkbox’s UnConference about how RAG pipelines can help reduce AI hallucinations and improve response accuracy. But there’s much more that we didn’t have time to explore, so let’s talk about the technology of RAG workflows and why they can be so effective at providing the specificity AI needs.

What You Should Know

The more specific a question you ask AI, the more likely it is that AI will create believable sounding answers that are factually incorrect. Those misguided but plausible-sounding responses are called hallucinations. They happen because LLMs like Claude, Gemini, or ChatGPT are trained primarily on publicly available information and don’t have access to proprietary data or specialized knowledge. Naturally, there are gaps in their knowledge. The RAG technique is designed to bridge the gap between the general LLM knowledge and the expertise and content that you use every day. But how exactly does that happen?

How RAG Solves the Specificity Problem

RAG means Retrieval-Augmented Generation. This technique enables humans to control what specific information the LLM has access to. A RAG pipeline leverages general LLM knowledge, while supercharging it with human expertise. The AI no longer needs to hallucinate or guess since it can lean on us for specific context and expertise.

The RAG technique is a series of phases designed to gather data and make it understandable and quickly searchable by AI. The most powerful part of the pipeline, however, is when it retrieves information that is relevant to a question or query.

Step 1: Ingestion

Gather the data. You can run a web crawler, use local files as sources, as well as PDFs, other databases, and even Git repos. Wherever the data is, you can “point” your RAG to that data for ingestion. The ingestion phase might include converting any HTML to Markdown, particularly if you’re crawling sites like blogs.

Step 2: Processing

Break down the larger pieces of information. Processing includes “chunking” the data, and there are a few strategies that can be used to separate the data, for instance, by sentence, paragraph, semantic boundaries (particularly if chunking code), and others. The chunked data should keep reasonable things together, such as code snippets. Metadata is usually processed and stored, as well.

Step 3: Embedding and vector storage

Convert processed data to vectors. An embedding model, like Sentence-BERT or text-embedding-ada-002, vectorizes the chunked data. The created embeddings, or vectors, are numerical representations of text and text patterns or relationships. The workflow stores the embedded data in a vector database. This database is where all of your expertise, content, and data live. It’s the additional knowledge hub that your LLM will get to leverage in a controlled way.

At this point, the RAG system has built an expert database. The pipeline won’t need to go through these previous phases or rebuild the database every time a query is sent to it. The database only needs to be rebuilt if the data changes, like if there’s more that should be included, or dynamic content was updated.

Step 4: Retrieval

Find data related to a search query. Retrieval is kicked off when a person queries the data. Sometimes the user’s entry point to a RAG workflow comes as code interfaces, or API endpoints, or CLI commands. Other chatbots, scripts, and even Slackbots can also kick off queries. The RAG system first has to embed the query input in order to compare it to the info in the database. That comparison includes cosine similarity scoring (a mathematical measure of how similar two pieces of text are) to determine what info might be semantically relevant (i.e. similar in meaning, not just matching keywords). Once it gathers that relevant data, the RAG sends the embedded query and the related data to the LLM, essentially saying, “here is the related information I found that should help you answer this question.” That fills in the gaps that the LLM might have. Robust RAG pipelines also support hybrid search, where cosine similarity and keyword search are used in tandem to find relevant info.

Step 5: Generation

Generate a response. Using the LLM, a response is created by converting the retrieved data vectors to human-understandable language. With this newly loaded data, the LLM interprets the query alongside the retrieved context and a system prompt. That system prompt, which can be defined in the RAG pipeline, provides further instructions to the LLM to use the retrieved info to answer the question. The LLM usually goes a step further, not only to provide the relevant info, but also to make it a conversational response. 

By providing an LLM with additional, highly specific context, RAG pipelines have proven to reduce hallucinations, since the AI processes use trusted, human-verified sources. All of that leads to better, improved, and more accurate responses.

Pros and Cons

Just because a solution is out there doesn’t always mean it works for all situations. RAGs won’t solve all hallucination problems. But they will especially excel when you have well-documented, text-based information that people may want to query conversationally.

Pros

  • RAG workflows are reusable and versatile. This capability allows lots of use cases, from using blog articles, marketing sites, design system documentation, or PDF documents to build your database. You can even point it at various data sources for ingestion, such as internal systems like Jira and Slack.

  • RAGs can offer improved search functionality because the vector database makes the information extremely fast for AI to search through.

  • It can be used with your current CMS via API connections & integrations, or periodic database exports.

  • Humans control what sources are used and what data goes into the database.

  • With your data in a database you control, RAG can be more secure than uploading your documents to a public AI service.

Cons

  • Hallucinations aren’t eliminated completely. If the info being loaded into the vector database is incorrect or has conflicting information, hallucinations can still occur.

  • Humans still need to be in the loop. Without the expertise and oversight we provide by doing regular maintenance, performing content audits, monitoring user feedback, or implementing output review systems, the RAG pipeline could become inaccurate and unreliable.

  • If the data or content you’re storing updates often, the database has to be rebuilt at certain intervals to keep everything updated and accurate.

  • Building and maintaining a tool like this requires more engineering time and expertise than out-of-the-box LLMs.

The key is treating RAG as a living system that requires ongoing care, not a “set it and forget it” solution.

Tools We’re Looking At

The RAG ecosystem offers a growing toolkit of specialized components. Here are some notable options we’ve explored, organized by their role in the pipeline:

  • Cohere Embedding Models: Embedding models optimized for semantic search and retrieval tasks.

  • Pinecone Vector Storage: Enterprise-level vector database with built-in security and reliability.

  • Chroma DB: Vector database designed for speed and flexibility with support for multiple search methods.

  • Facebook AI Similarity Search: Efficient similarity search across large vector datasets.

  • Haystack: Ready-to-use RAG framework, with pre-built workflows and components for enterprise applications.

  • LangChain Tutorials: Extensive RAG tutorials and orchestration tools for building AI agents and LLM applications.

Next Steps in Our Research

Several team members on the AI Expedition team have individually experimented with building RAG pipelines. After comparing notes, we came up with several open questions that we’ll be exploring in the future:

Architecture and Modularity

How can we build modular components instead of using out-of-the-box RAG frameworks? How can we make it easier to swap tools like embedding models, LLMs, or vector databases? What would the repository architecture look like if we support configurability?

Tool Optimization

What are the best tools or approaches for each of the RAG phases? Is it better to use a vector database, or just store the vectors locally? How many dimensions do we recommend for embedding models? Does the sentence chunking strategy work effectively when processing code? Which LLM model would we recommend to ourselves?

Testing and Deployment

Do we need to support some type of containerization? How do we introduce continuous integration for this tool? How do we validate RAG pipeline accuracy and performance?

Advanced RAG Features

What other features of RAG systems might we still want to incorporate into our prototypes? How can we integrate re-ranking? Do we want to support conversational memory? Would we ever consider creating a front end interface for a RAG?

These are just some of the questions we’re actively exploring. Our experiments continue to deepen our understanding of RAG systems and uncover potential use cases for RAG pipelines.

What Are You Curious About?

Ask Sparkbox about how AI might be valuable to your organization.

Watch this space for monthly updates on what we’re exploring, what we’ve learned, and what’s next. And if you have ideas or questions, we’d love to hear from you.

Want to talk about how we can work together?

Katie can help

A portrait of Vice President of Business Development, Katie Jennings.

Katie Jennings

Vice President of Business Development