Before the blog

I have used some useful LLM with RAG inferrence frameworks/software in the past, they were quite convenient and effective. Recently, I am working on a project that is closely related to both of these things, and I need to dive into the under implementation of LLM with RAG. That’s why I write this blog.

Retrieval Augmented Generation (RAG) is a technology that assists a large language model (LLM) in generating answers by retrieving information from a data source.

modelscope and faiss are two python package. With modelscope provide a series of API to control the pretrained LLM and faiss support vector database, it means that we can impl a RAG system totally in python code.

Create python venv

There are many ways to prepae python venv, here give a conda example:

1
2
3
conda create -n RAG python=3.12
conda activate RAG
python --version # 3.12

The python extern package list is as follow:

1
2
3
4
5
6
torch
modelscope
packaging
transformers
accelerate
faiss-cpu

Use pip install -r requirements.txt to install them all.

A RAG impl

Some hyper value is defined below:

1
2
3
model_id = "qwen/Qwen2-0.5B"
device = "cuda:1"
dim = 896
  • model_id: choose the embed and generate llm.
  • device: choose the compute device.
  • dim: the vector dim of RAG vector database.

However, the embed and generate model can be different, such as use Bert in embedding the sentence as well as Llama3 in generating.

In this RAG impl, we only use Qwen2-0.5B-Base.

Get the LLM

we defines a function get_model_tokenizer() that downloads a specified language model and its tokenizer using the ModelScope library.

ModelScope uses snapshot_download() to retrieve the model files to a local cache directory. Then, it initializes a causal language model (AutoModelForCausalLM) and its corresponding tokenizer (AutoTokenizer) from the downloaded path.

The model is configured to automatically use the appropriate data type and device, and it returns both the model and tokenizer for further use.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download

def get_model_tokenizer(model_id):
model_path = snapshot_download(
model_id=model_id,
cache_dir="./.modelscope/",
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
return_dict_in_generate=True,
output_hidden_states=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
return model, tokenizer

Embed the knowleage

Then we defines a function text_to_vector() that converts a text into a vector by using a pre-trained model and tokenizer.

It first tokenizes the input text, preparing it as a tensor suitable for the model. Then, within a torch.no_grad() context (to prevent gradient calculation), it passes the tokenized input to the model, obtaining the outputs and the hidden states.

The function extracts the last layer of hidden states, averages them across the sequence, converts them to a float format, and finally returns the vector as a NumPy array.

1
2
3
4
5
6
7
8
def text_to_vector(model, tokenizer, text):
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
hidden_states = outputs.hidden_states[-1]
hidden_states = hidden_states.mean(dim=1)
hidden_states = hidden_states.float()
return hidden_states.numpy()

Manage the vector database

Use get_model_tokenizer() and text_to_vector() functions to get the model and tokenizer.

Then we defines a list of texts to process. A FAISS index (IndexFlatIP) is created for storing normalized inner product vectors. For each text in the list, it converts the text to a vector, normalizes it using L2 normalization, and adds the vector to the FAISS index, enabling efficient similarity search later.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
model, tokenizer = get_model_tokenizer(model_id)

texts = [
"Hello, how are you?",
"I am learning about transformers and FAISS.",
"This is an example of using FAISS with transformers.",
"Haystack is an open-source framework for building search systems.",
]

index = faiss.IndexFlatIP(dim)

for text in texts:
text_vector = text_to_vector(model, tokenizer, text)
# print(text_vector.shape) # [batch_size, embed_dim]
faiss.normalize_L2(text_vector)
index.add(text_vector)

Query and get knowleage

Convert the query text into a vector and normalizing it with L2 normalization.

Searches the FAISS index for the top k (in this case, 3) most similar vectors to the query vector. The indices of these similar texts are stored in I, and the corresponding texts are printed out. This effectively retrieves and displays the texts that are closest to the query based on their vector representations.

1
2
3
4
5
6
7
8
9
query_text = "What is Haystack."
query_vector = text_to_vector(model, tokenizer, query_text)
faiss.normalize_L2(query_vector)

k = 1
D, I = index.search(query_vector, k)

for idx in I[0]:
print("Similar text:", texts[idx])

Generate response

This code constructs a chat prompt using the most similar text from the FAISS index and the user’s query. It first creates a messages list containing a system message with the most relevant text and the user’s prompt. The apply_chat_template method formats these messages into a suitable input string for the model.

1
2
3
4
5
prompt = query_text
messages = [
{"role": "system", "content": texts[I[0][0]]},
{"role": "user", "content": prompt},
]

Next, it tokenizes the formatted string and moves it to the appropriate device (e.g., GPU). The model then generates a response by predicting up to 512 new tokens based on the input. After generating the output, it extracts only the new tokens (excluding the original input tokens) and decodes them back into a readable string format. Finally, it prints the response from the model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"{model_id}: ", response)