LLM - A RAG impl based on modelscope and faiss
Before the blog
I have used some useful LLM with RAG inferrence frameworks/software in the past, they were quite convenient and effective. Recently, I am working on a project that is closely related to both of these things, and I need to dive into the under implementation of LLM with RAG. That’s why I write this blog.
Retrieval Augmented Generation (RAG) is a technology that assists a large language model (LLM) in generating answers by retrieving information from a data source.
modelscope
and faiss
are two python package. With modelscope
provide a series of API to control the pretrained LLM and faiss
support vector database, it means that we can impl a RAG system totally in python code.
Create python venv
There are many ways to prepae python venv, here give a conda
example:
1 | conda create -n RAG python=3.12 |
The python extern package list is as follow:
1 | torch |
Use pip install -r requirements.txt
to install them all.
A RAG impl
Some hyper value is defined below:
1 | model_id = "qwen/Qwen2-0.5B" |
model_id
: choose the embed and generate llm.device
: choose the compute device.dim
: the vector dim of RAG vector database.
However, the embed and generate model can be different, such as use Bert in embedding the sentence as well as Llama3 in generating.
In this RAG impl, we only use Qwen2-0.5B-Base.
Get the LLM
we defines a function get_model_tokenizer()
that downloads a specified language model and its tokenizer using the ModelScope library.
ModelScope uses snapshot_download()
to retrieve the model files to a local cache directory. Then, it initializes a causal language model (AutoModelForCausalLM
) and its corresponding tokenizer (AutoTokenizer
) from the downloaded path.
The model is configured to automatically use the appropriate data type and device, and it returns both the model and tokenizer for further use.
1 | from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download |
Embed the knowleage
Then we defines a function text_to_vector()
that converts a text into a vector by using a pre-trained model and tokenizer.
It first tokenizes the input text, preparing it as a tensor suitable for the model. Then, within a torch.no_grad()
context (to prevent gradient calculation), it passes the tokenized input to the model, obtaining the outputs and the hidden states.
The function extracts the last layer of hidden states, averages them across the sequence, converts them to a float format, and finally returns the vector as a NumPy array.
1 | def text_to_vector(model, tokenizer, text): |
Manage the vector database
Use get_model_tokenizer()
and text_to_vector()
functions to get the model and tokenizer.
Then we defines a list of texts to process. A FAISS index (IndexFlatIP
) is created for storing normalized inner product vectors. For each text in the list, it converts the text to a vector, normalizes it using L2 normalization, and adds the vector to the FAISS index, enabling efficient similarity search later.
1 | model, tokenizer = get_model_tokenizer(model_id) |
Query and get knowleage
Convert the query text into a vector and normalizing it with L2 normalization.
Searches the FAISS index for the top k
(in this case, 3) most similar vectors to the query vector. The indices of these similar texts are stored in I
, and the corresponding texts are printed out. This effectively retrieves and displays the texts that are closest to the query based on their vector representations.
1 | query_text = "What is Haystack." |
Generate response
This code constructs a chat prompt using the most similar text from the FAISS index and the user’s query. It first creates a messages
list containing a system message with the most relevant text and the user’s prompt. The apply_chat_template
method formats these messages into a suitable input string for the model.
1 | prompt = query_text |
Next, it tokenizes the formatted string and moves it to the appropriate device (e.g., GPU). The model then generates a response by predicting up to 512 new tokens based on the input. After generating the output, it extracts only the new tokens (excluding the original input tokens) and decodes them back into a readable string format. Finally, it prints the response from the model.
1 | text = tokenizer.apply_chat_template( |