Unlocking the Power of Next-Gen AI: The Ultimate Guide to RAG, Fine-Tuning, Vector Databases, and LLM Evaluation

Will Kencel
12 min readMay 21, 2024

--

In the past year, new technologies built on top of LLMs have been astounding the world. Let me share with you what I’ve learned from using these emergent technologies while pursuing my Masters in AI, and researching the advancements of LLMs, Vector databases and RAG, mostly consisting of articles from Fall 2023 onward.

This is geared toward the engineer who has gone through a tutorial or maybe a quick start and now you want to dive deeper, “How do I make my model better, specialized to my data?”, “could I make one LLM specialized as a financial advisor and another as a personal trainer?”

AI Engineer thinking about LLMs

Well, my friend, you are asking the right questions! Welcome to the new world of making LLMs that are specialized to your needs and data. When function calling was released by OpenAI in the fall of 2023, it was primarily unveiled as a way to call APIs to get current data unavailable to OpenAI’s model, but the real value of this implementation was only just being realized. What if those function calls could regularly bring in current, specific data to train the LLM further. That is where RAG comes into play.

Retrieval-Augmented Generation (RAG) uses a function to bring in specified data, usually formatted for interpretation by the LLM and accompanied by a vector database to store this information to be used for further user queries…talk about some powerful technology! Here’s what I’ll be covering in this article:

  1. Retrieval Augmented Generation: RAG vs Fine-Tuning for LLMs
  2. Vector Databases
  3. How to evaluate your LLM

1. RAG (Retrieval Augmented Generation)

For our technical folks, let’s classify retrieval augmented generation as enhancing a model’s ability to generate responses by augmenting its knowledge with information retrieved in real-time from external databases or documents. This “use our own data” approach enables the model to deliver more accurate responses for queries that are pertinent to proprietary data. This enables our LLMs to provide higher quality answers than they could from their preexisting knowledge base, often trained on public data.

fig 1. RAG System — Illustration from left to right: user query, calls retrieved/added data, brings full prompt to model, response to user

The development of RAG can be categorized into three distinct types:

a. Naive RAG

b. Advanced RAG

c. Modular RAG

Each type builds upon the last to overcome previous limitations and improve performance.

Fig. 2. Comparison of RAG pipelines — Left: Naïve RAG pipeline; Right: Advanced RAG pipeline.

a. Naive RAG: This is the foundational form of RAG, focusing on a straightforward process of indexing, retrieving, and generating responses. Indexing involves breaking down and encoding documents into our vector database in a searchable format (vectors or embeddings). During Retrieval, when a question is asked, the system seeks out the most relevant information based on this indexed data. Generation then takes this information, alongside the initial question, to craft a coherent answer. Despite its simplicity, Naive RAG struggles with retrieval accuracy, which may lead to irrelevant information being used, and generation accuracy, as the model might “hallucinate” details not present in the retrieved data. A phenomenon that anyone working with LLMs is all too familiar with.

b. Advanced RAG: Improving upon Naive RAG, Advanced RAG introduces strategies before and after retrieval — known as pre-retrieval and post-retrieval processes — to enhance the precision and relevance of the information retrieved. Pre-Retrieval includes optimizing how data is indexed by incorporating more nuanced data segmentation and metadata. The Post-Retrieval process involves re-ranking and context compressing to prioritize the most pertinent information and streamlining it into a focused prompt for the LLM with the original user query. This enriched prompt serves as a comprehensive, focused input for the LLM. Advanced RAG aims to refine the retrieval quality significantly, addressing the precision and coherence issues observed in Naive RAG.

Fig. 3. Patterns in modular RAG

c. Modular RAG: The most sophisticated iteration, Modular RAG, offers a highly adaptable structure that introduces specific functional modules to the RAG architecture. These modules — like search modules for improved data retrieval across various platforms, memory modules for leveraging LLM’s memory in retrieval, and task adapter modules for customizing outputs to specific tasks — provide a more flexible and powerful system. Modular RAG also integrates new components such as rewrite modules for refining retrieval queries, re-rank modules for prioritizing relevant information, and fusion modules for combining data from multiple sources. Additionally, routing mechanisms guide queries to the appropriate retrieval or processing paths, and predictive modules ensure the generated content is accurate and contextually relevant. Demonstration modules further enhance the system by showcasing how retrieved information can be used effectively, providing examples or templates that guide the generation process. This advanced architecture significantly enhances the capabilities of RAG systems, making them more versatile and efficient. For those interested in exploring the detailed mechanics and benefits of Modular RAG, the comprehensive survey by Gao et al. (2024) offers an in-depth analysis and discussion on the topic .

Each of these three types of RAG systems represent a step forward in developing more intelligent, responsive, and context-aware LLMs, demonstrating the field’s rapid progression and innovation. These advancements not only enhance the capabilities of LLMs but also open new horizons for knowledge-intensive applications and tasks.

Fine-tuning

Fig. 4. High Level Diagram — LLM fine tuning

In addition to RAG, Fine-Tuning (FT) can also be used to customize an LLM. Fine-tuning is the process of adjusting a pre-trained model on a specific, often narrower, dataset or task to enhance its performance in that particular domain. Here, it is necessary to distinguish different types of fine-tuning. FT techniques are commonly classified into Supervised Learning, Unsupervised Learning and Reinforcement Learning.

a. Supervised Learning

b. Unsupervised Learning

c. Reinforcement Learning

Fig. 5. Different types of fine tuning

a. Supervised Fine-Tuning is when a pre-trained model receives sets of labeled input-output pairs, such as movie reviews and the output is positive or negative sentiment. The model is trained using that data to learn to recognize the sentiment of new movie reviews. Another example could be giving the model question-answer pairs to fine tune the model at answering specific questions.

b. Diverging from the reliance on labeled datasets, Unsupervised Fine-Tuning, often termed continual pre-training or unstructured FT, thrives on the model’s ability to learn from unstructured, unlabeled data. This strategy embarks as a direct extension of the pre-training phase, where the optimization emphasizes token prediction in a causal autoregressive manner, with cautious adjustments to maintain a lower learning rate to prevent catastrophic forgetting (Zhou et al., 2023). Given LLMs’ profound capacity to assimilate knowledge during pre-training, this method seeks to perpetuate that learning curve, enriching the model’s repository of information without the guardrails of labeled examples. Unsupervised Fine-Tuning hence plays a crucial role in continuously expanding the model’s knowledge base and adaptability.

c. The Reinforcement Learning (or Reinforcement Training) approach engages the model in a system of rewards and penalties to help it learn to achieve the desired outcome on specific tasks. A pivotal example of RT was when AlphaGo, an AI robot, was trained on the game of Go and enabled it to beat a GrandMaster at this game. These RL methods are particularly beneficial when coupled with instruction tuning, focusing on the quality and behavior of the response rather than the extent of the model’s knowledge. They offer a sophisticated way to align the model’s outputs with human preferences or desired behaviors.

While Fine-Tuning can significantly enhance an LLM’s performance, most of the studies I researched leaned toward RAG as an ideal form of optimizing pre-trained models with knowledge injection. While the studies I found mostly seemed to favor RAG, FT can sometimes be a more optimal strategy for training LLMs, often for executing specific tasks where the requirements are more behavioral than knowledge-based.

2. Vector Databases:

The Multi-Dimensional Powerhouses

Vector databases are foundational to the advancement of RAG systems, and they are just as multi-dimensional as the data they represent. Each piece of information within these databases is encoded into a high-dimensional vector (an embedding), allowing for representation of complex data such as language.

How are Vector Databases multi-dimensional?

Considering more than three dimensions can often be challenging, so let’s use an example to illustrate this concept. Imagine we have two tinted pieces of paper (representing two instances of a 2-dimensional plane). On tinted paper 1, we place point A and point B near it. On tinted paper 2, we place point A in the same spot and point C near it. When we layer the tinted papers on top of each other, point A on each paper is considered the same point A. However, point B and point C exist on different dimensions, making them not close or “similar.” Here is an illustration of this concept:

Fig. 6. Diagram of multidimensional nature of vector databases

Each dimension corresponds to a feature or attribute of the data, capturing the essence of its meaning. For example, when encoding words into vectors (word embeddings), synonymous or related words are positioned closer to each other in this high-dimensional space, facilitating the retrieval of linguistically and contextually relevant information.

To further explain, imagine point A represents the word “apple,” point B represents “delicious,” and point C represents “$470.” In this example, the red plane could relate to the attribute of flavor, while the green plane could relate to the number of apples sold. “Apple” has a similarity to “delicious” on the red plane based on taste, just as “apple” has a similarity to “$470” on the green plane based on sales. “Delicious” and “$470” do not share the same dimension and thus do not have a similarity, even though it may seem so when viewed from a 3D perspective.

Fig. 7. 3D model of vector database

This multi-dimensional approach is not just a theoretical marvel; it’s the bridge that allows LLMs to interact with unstructured data in an almost human-like manner. With each additional dimension, the vector database becomes more capable of discerning fine-grained nuances, offering richer context and precision in retrieval tasks.

In essence, vector databases transform words into mathematical entities that can be computed, compared, and queried, which, in turn, supercharges RAG systems and other AI applications that rely on understanding and processing natural language at scale. For us software engineers, mastering vector databases means unlocking the ability to create applications with a depth of understanding that mirrors human expertise.

3. Large Language Model (LLM) Evaluation

LLM Evaluation: Ensuring Faithfulness with RAGAS

As we immerse ourselves in the AI epoch, the evaluation of LLMs becomes as integral as their development. Recent advancements have introduced sophisticated evaluation methods that push the boundaries of how we measure a model’s performance.

The Role of RAGAS in LLM Evaluation

Let’s examine RAGAS — Retrieval Augmented Generation Assessment System — an evaluation framework designed for RAG systems, that uses an LLM for evaluation metrics. It represents a paradigm shift in how we assess the quality of an LLM’s outputs. Traditional metrics like BLEU or ROUGE, which depend on reference texts, are limited in their ability to capture the fidelity of information that an LLM generates based on external data.

BLEU (Bilingual Evaluation Understudy) is a metric that measures the quality of text generated by a model by comparing it to one or more reference translations. It evaluates the precision of n-grams (contiguous sequences of n items) in the candidate translation against the reference translations. ROUGE (Recall-Oriented Understudy for Gisting Evaluation), on the other hand, focuses on recall by comparing the overlap of n-grams, word sequences, and word pairs between the generated text and reference summaries. While BLEU is widely used for machine translation tasks, ROUGE is commonly applied in evaluating summarization tasks.

Instead, RAGAS operates on the premise that a well-functioning RAG system should produce responses that are both contextually relevant and faithful to the source material. To achieve this, RAGAS uses the aforementioned secondary LLM which serves as the evaluator.

How RAGAS Works

The evaluator model in RAGAS is separate from the one being evaluated. It’s trained to judge the faithfulness and relevance of the generated responses without having been exposed to the same training materials. This model examines the generated text’s faithfulness by cross-referencing the information with the data that the primary LLM retrieved.

Fig. 8. Example of RAGAs: left to right: question to LLM, RAGAs response, answer from LLM, RAGAs context

By comparing the generated responses to the retrieved passages, RAGAS defines a set of metrics to measure how well the generated content aligns with the source material. These metrics might include accuracy, precision of information retrieval, and the relevance of the language used. This approach offers a multi-faceted view of an LLM’s performance, emphasizing not just the correctness but also the origin and reliability of the information being presented.

Implications for Practice

This is more than metrics of the LLM — it’s an actionable insight. The ability of RAGAS to dissect the various aspects of the LLM’s performance allows us to fine-tune not just for coherence or fluency, but for factual accuracy and reliability. By utilizing the insights provided by RAGAS, we can iteratively refine our models, ensuring they serve as trustworthy partners in tasks that require a high degree of information fidelity.

In essence, RAGAS exemplifies the comprehensive, rigorous, and nuanced approach toward LLM evaluation that’s required in our current technological landscape. It’s a testament to the maturation of the field, reflecting a commitment to not only achieving remarkable feats of language generation but clever ways to improve them.

Together, with tools like vector databases powering the multi-dimensional searching capabilities and evaluation frameworks like RAGAS to evaluate our models, we stand at the cusp of an era where AI’s promise is matched only by its performance. As software engineers, it’s our prerogative to harness these tools vigilantly, crafting LLMs that don’t merely mimic understanding but embody the depth and faithfulness essential for true AI advancement.

Sources

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Jing, Z., Su, Y., Han, Y., Yuan, B., Xu, H., Liu, C., Chen, K., & Zhang, M. (Feb 6, 2024). When large language models meet vector databases: A survey. Carnegie Mellon University, Purdue University, University of Michigan, Harbin Institute of Technology (Shenzhen), National Science Library (Chengdu), Chinese Academy of Sciences, Shandong University of Technology.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (March 27, 2024). Retrieval-augmented generation for large language models: A survey. Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University; Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University; College of Design and Innovation, Tongji University.

Kan, D. (2022, August 8). How to choose a vector database. Medium. https://medium.com/how-to-choose-a-vector-database-8c6e6f0f8f8b

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (Dec 24, 2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. UC Berkeley, UC San Diego, Carnegie Mellon University, Stanford, MBZUAI.

Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., & Zhang, Y. (2023). An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. ArXiv, abs/2308.08747. https://doi.org/10.48550/arXiv.2308.08747.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (Sept 30, 2020). Dense passage retrieval for open-domain question answering. Facebook AI, University of Washington, Princeton University.

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (Sept 26, 2023). RAGAS: Automated evaluation of retrieval augmented generation. Exploding Gradients & CardiffNLP, Cardiff University. AMPLYFI, United Kingdom.

--

--