Skip to main content

Build a QA chatbot

In this tutorial, we will walk you through building a question-answering chatbot that combines a large language model (LLM) with custom data to create an interactive app capable of answering specialized questions. We'll guide you through uploading documents, indexing them for rapid search, and integrating a language model to generate responses based on user queries. Finally, we will deploy the chatbot as a microservice for easy access and use.

Step 1: Upload Data

Prepare Data

Select a set of relevant documents and store them in a one-column CSV file. Each row should contain either a public URL (e.g., .pdf, .txt) or the document's text. The CSV file must contain only one data type (URLs or text).

To learn more about using datasets, please refer to this guide.

Upload the Dataset

Here is an example dataset (change it to your dataset if needed):

data_url = "https://aixplain-platform-assets.s3.amazonaws.com/samples/tests/EnglishPortugueseExample.csv"
data = pd.read_csv(data_url)
data

Next, we will save the file locally to upload the dataset

upload_file = "data.csv"

data.to_csv(upload_file)

Define Metadata for the Dataset

Before uploading, we need to define metadata such as data type, storage type, and languages.

source_meta = MetaData(
name="source",
dtype=DataType.TEXT,
storage_type=StorageType.TEXT,
languages=[Language.English_UNITED_STATES]
)
input_schema = [source_meta]

Upload the Dataset to aiXplain

time_str = datetime.now().strftime("%Y_%m_%d-%H_%M_%S")
dataset_name = f"test_{time_str}"
payload = DatasetFactory.create(
name=dataset_name,
description="Search example dataset",
license=License.MIT,
function=Function.SEARCH,
content_path=upload_file,
input_schema=input_schema,
)
payload

Check the dataset's status. Once onboard_status is onboarded, you are ready to use it.

dataset_id = payload["asset_id"]
selected_dataset = DatasetFactory.get(dataset_id)
selected_dataset.__dict__
Show output

Step 2: Index Data

To enable rapid searching within the dataset, we need to index it using Information Retrieval (IR) models or embeddings.

Select and Index the Dataset

List available search models and select one to index the dataset.

model_list = ModelFactory.list(function=Function.SEARCH, is_finetunable=True)["results"]
for model in model_list:
print(model.__dict__)
selected_model = ModelFactory.get("6499cc946eb5633de15d82a1")
selected_model.__dict__
Show output
dataset_id = "YOUR_DATASET_ID"
selected_dataset = DatasetFactory.get(dataset_id)
selected_dataset.__dict__
Show output

Finetune the Model

Now, we finetune the model. You can learn more about how to finetune models in this guide.

time_str = datetime.now().strftime("%Y_%m_%d-%H_%M_%S")
finetune_name = f"test_{time_str}"
finetune = FinetuneFactory.create(finetune_name, [selected_dataset], selected_model)
finetune.__dict__
Show output
finetune.cost.to_dict()
Show output
finetune_model = finetune.start()
status = finetune_model.check_finetune_status()
status
import time
while status != "onboarded":
status = finetune_model.check_finetune_status()
print(f"Current status: {status}")
time.sleep(10)
finetune_model = ModelFactory.get("6514426e1cfdf13eab753867")
finetune_model.__dict__
Show output
finetune_model.run("Dog videos")
Show output
finetune_model.__dict__
Show output

Step 3: Configure the LLM as a QnA Model

Now, let's set up the language model that will generate answers using the indexed data.

Set Up the Model for Text Generation

List available text generation models and select one.

model_list = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
for model in model_list:
print(model.__dict__)
gpt_model = ModelFactory.get("64d21cbb6eb563074a698ef1")
gpt_model.__dict__
Show output

Define the Prompt Template

Create a prompt template to be sent to the model, including placeholders for the question and documents.

prompt_template = """Generate an answer to the following questions based on the documents:

Question:
@@question@@

Documents:
@@documents@@
"""
prompt_template
question = "List important topics in the marketing domain"
documents = ["Mkt doc 1 content", "Mkt doc 2 content", "Mkt doc 3 content"]
prompt = prompt_template.replace("@@question@@", question)
prompt = prompt.replace("@@documents@@", "\n".join(documents))
print(prompt)
Show output

Step 4: Run the Solution

Now, we’ll create a function that retrieves relevant documents using the search model and then sends a prompt to the language model to generate a response.

Define the Functions

def retrieve_documents(search_model: Model, question: str):
response = search_model.run(question)
# Print the response to understand its structure
print("Search Model Response:", response)

try:
url_list = [elem["uri"] for elem in response]
except TypeError:
print("Unexpected response format. Expected a list of dictionaries.")
return []

documents = []
for url in url_list:
response = requests.get(url)
if response.status_code == 200:
documents.append(response.text)
else:
print(f"Could not retrieve document from {url}")
return documents

def qa_system(text_generation_model: Model, search_model: Model, question: str):
prompt = """Generate an answer to the following questions based on the documents:
Question:
@@question@@

Documents:
@@documents@@"""
start = time.time()
documents = retrieve_documents(search_model, question)
prompt = prompt.replace("@@question@@", question)
prompt = prompt.replace("@@documents@@", "\n".join(documents))
print("Prompt:")
print(prompt)

response_gpt = text_generation_model.run([
{"role": "user", "content": prompt},
])

return response_gpt

Run the QnA System

finetune_model = ModelFactory.get("6514426e1cfdf13eab753867")
gpt_model = ModelFactory.get("64d21cbb6eb563074a698ef1")
question = "Which principle ancient scripts use?"
response = qa_system(gpt_model, finetune_model, question)
Show output

Step 5: Convert the QnA System into a Chatbot

Define the Chatbot Functions

Modify the qa_system function to keep track of conversation history.

import requests

def retrieve_documents(search_model: Model, question: str):
response = search_model.run(question)
# Print the response to understand its structure
print("Search Model Response:", response)

# Check if response is in the expected format
try:
url_list = [elem["uri"] for elem in response]
except TypeError:
print("Unexpected response format. Expected a list of dictionaries.")
return []

documents = []
for url in url_list:
response = requests.get(url)
if response.status_code == 200:
documents.append(response.text)
else:
print(f"Could not retrieve document from {url}")
return documents

def qa_system(text_generation_model: Model, search_model: Model, user_query: str, messages: list):
messages.append({"role": "user", "content": user_query})
prompt = """Generate an answer to the following questions based on the documents:
Question:
@@question@@

Documents:
@@documents@@"""

documents = retrieve_documents(search_model, user_query)
prompt = prompt.replace("@@question@@", user_query)
prompt = prompt.replace("@@documents@@", "\n".join(documents))

print("Prompt:")
print(prompt)
messages.append({"role": "system", "content": prompt})
response_gpt = text_generation_model.run(messages)

return response_gpt

Interactive Chat

finetune_model = ModelFactory.get("6514426e1cfdf13eab753867")
gpt_model = ModelFactory.get("64d21cbb6eb563074a698ef1")

messages=[]
print(f"Assistant: \n\t How can I help you?")
query = input()
while query != quit:
print(f"User: \n\t {query}")
response = qa_system(gpt_model, finetune_model, query, messages)
print(response)
answer = response["data"]
print(f"Assistant: \n\t {answer}")
query = input()

You can also convert the chatbot into an agent, please refer to this guide for information on how to build an agent.

You've successfully built a question-answering chatbot using aiXplain's SDK. This chatbot can search through your custom dataset, use a language model to generate context-based answers, and even handle interactive conversations. Feel free to explore more functionalities and customize this solution to suit your needs!