Build a QA chatbot
In this tutorial, we will walk you through building a question-answering chatbot that combines a large language model (LLM) with custom data to create an interactive app capable of answering specialized questions. We'll guide you through uploading documents, indexing them for rapid search, and integrating a language model to generate responses based on user queries. Finally, we will deploy the chatbot as a microservice for easy access and use.
Step 1: Upload Data
Prepare Data
Select a set of relevant documents and store them in a one-column CSV file. Each row should contain either a public URL (e.g., .pdf, .txt) or the document's text. The CSV file must contain only one data type (URLs or text).
To learn more about using datasets, please refer to this guide.
Upload the Dataset
Here is an example dataset (change it to your dataset if needed):
data_url = "https://aixplain-platform-assets.s3.amazonaws.com/samples/tests/EnglishPortugueseExample.csv"
data = pd.read_csv(data_url)
data
Next, we will save the file locally to upload the dataset
upload_file = "data.csv"
data.to_csv(upload_file)
Define Metadata for the Dataset
Before uploading, we need to define metadata such as data type, storage type, and languages.
source_meta = MetaData(
name="source",
dtype=DataType.TEXT,
storage_type=StorageType.TEXT,
languages=[Language.English_UNITED_STATES]
)
input_schema = [source_meta]
Upload the Dataset to aiXplain
time_str = datetime.now().strftime("%Y_%m_%d-%H_%M_%S")
dataset_name = f"test_{time_str}"
payload = DatasetFactory.create(
name=dataset_name,
description="Search example dataset",
license=License.MIT,
function=Function.SEARCH,
content_path=upload_file,
input_schema=input_schema,
)
payload
Check the dataset's status. Once onboard_status
is onboarded
, you are ready to use it.
dataset_id = payload["asset_id"]
selected_dataset = DatasetFactory.get(dataset_id)
selected_dataset.__dict__
Step 2: Index Data
To enable rapid searching within the dataset, we need to index it using Information Retrieval (IR) models or embeddings.
Select and Index the Dataset
List available search models and select one to index the dataset.
model_list = ModelFactory.list(function=Function.SEARCH, is_finetunable=True)["results"]
for model in model_list:
print(model.__dict__)
selected_model = ModelFactory.get("6499cc946eb5633de15d82a1")
selected_model.__dict__
dataset_id = "YOUR_DATASET_ID"
selected_dataset = DatasetFactory.get(dataset_id)
selected_dataset.__dict__
Finetune the Model
Now, we finetune the model. You can learn more about how to finetune models in this guide.
time_str = datetime.now().strftime("%Y_%m_%d-%H_%M_%S")
finetune_name = f"test_{time_str}"
finetune = FinetuneFactory.create(finetune_name, [selected_dataset], selected_model)
finetune.__dict__
finetune.cost.to_dict()
finetune_model = finetune.start()
status = finetune_model.check_finetune_status()
status
import time
while status != "onboarded":
status = finetune_model.check_finetune_status()
print(f"Current status: {status}")
time.sleep(10)
finetune_model = ModelFactory.get("6514426e1cfdf13eab753867")
finetune_model.__dict__
finetune_model.run("Dog videos")
finetune_model.__dict__
Step 3: Configure the LLM as a QnA Model
Now, let's set up the language model that will generate answers using the indexed data.
Set Up the Model for Text Generation
List available text generation models and select one.
model_list = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
for model in model_list:
print(model.__dict__)
gpt_model = ModelFactory.get("64d21cbb6eb563074a698ef1")
gpt_model.__dict__
Define the Prompt Template
Create a prompt template to be sent to the model, including placeholders for the question and documents.
prompt_template = """Generate an answer to the following questions based on the documents:
Question:
@@question@@
Documents:
@@documents@@
"""
prompt_template
question = "List important topics in the marketing domain"
documents = ["Mkt doc 1 content", "Mkt doc 2 content", "Mkt doc 3 content"]
prompt = prompt_template.replace("@@question@@", question)
prompt = prompt.replace("@@documents@@", "\n".join(documents))
print(prompt)
Step 4: Run the Solution
Now, we’ll create a function that retrieves relevant documents using the search model and then sends a prompt to the language model to generate a response.
Define the Functions
def retrieve_documents(search_model: Model, question: str):
response = search_model.run(question)
# Print the response to understand its structure
print("Search Model Response:", response)
try:
url_list = [elem["uri"] for elem in response]
except TypeError:
print("Unexpected response format. Expected a list of dictionaries.")
return []
documents = []
for url in url_list:
response = requests.get(url)
if response.status_code == 200:
documents.append(response.text)
else:
print(f"Could not retrieve document from {url}")
return documents
def qa_system(text_generation_model: Model, search_model: Model, question: str):
prompt = """Generate an answer to the following questions based on the documents:
Question:
@@question@@
Documents:
@@documents@@"""
start = time.time()
documents = retrieve_documents(search_model, question)
prompt = prompt.replace("@@question@@", question)
prompt = prompt.replace("@@documents@@", "\n".join(documents))
print("Prompt:")
print(prompt)
response_gpt = text_generation_model.run([
{"role": "user", "content": prompt},
])
return response_gpt
Run the QnA System
finetune_model = ModelFactory.get("6514426e1cfdf13eab753867")
gpt_model = ModelFactory.get("64d21cbb6eb563074a698ef1")
question = "Which principle ancient scripts use?"
response = qa_system(gpt_model, finetune_model, question)
Step 5: Convert the QnA System into a Chatbot
Define the Chatbot Functions
Modify the qa_system
function to keep track of conversation history.
import requests
def retrieve_documents(search_model: Model, question: str):
response = search_model.run(question)
# Print the response to understand its structure
print("Search Model Response:", response)
# Check if response is in the expected format
try:
url_list = [elem["uri"] for elem in response]
except TypeError:
print("Unexpected response format. Expected a list of dictionaries.")
return []
documents = []
for url in url_list:
response = requests.get(url)
if response.status_code == 200:
documents.append(response.text)
else:
print(f"Could not retrieve document from {url}")
return documents
def qa_system(text_generation_model: Model, search_model: Model, user_query: str, messages: list):
messages.append({"role": "user", "content": user_query})
prompt = """Generate an answer to the following questions based on the documents:
Question:
@@question@@
Documents:
@@documents@@"""
documents = retrieve_documents(search_model, user_query)
prompt = prompt.replace("@@question@@", user_query)
prompt = prompt.replace("@@documents@@", "\n".join(documents))
print("Prompt:")
print(prompt)
messages.append({"role": "system", "content": prompt})
response_gpt = text_generation_model.run(messages)
return response_gpt
Interactive Chat
finetune_model = ModelFactory.get("6514426e1cfdf13eab753867")
gpt_model = ModelFactory.get("64d21cbb6eb563074a698ef1")
messages=[]
print(f"Assistant: \n\t How can I help you?")
query = input()
while query != quit:
print(f"User: \n\t {query}")
response = qa_system(gpt_model, finetune_model, query, messages)
print(response)
answer = response["data"]
print(f"Assistant: \n\t {answer}")
query = input()
You can also convert the chatbot into an agent, please refer to this guide for information on how to build an agent.
You've successfully built a question-answering chatbot using aiXplain's SDK. This chatbot can search through your custom dataset, use a language model to generate context-based answers, and even handle interactive conversations. Feel free to explore more functionalities and customize this solution to suit your needs!