How to benchmark a model
This guide explains how to benchmark models using the aiXplain SDK. You'll learn to select datasets, models, and metrics, and create benchmarks for tasks. For more information, refer to this guide.
Generic Example (Template)
from aixplain.factories import (
BenchmarkFactory,
DatasetFactory,
MetricFactory,
ModelFactory,
)
datasets = DatasetFactory.list("...") # specify Data ID
metrics = MetricFactory.list("...") # specify Metric ID
models = ModelFactory.list("...") # specify Model ID
benchmark = BenchmarkFactory.create(
"benchmark_name", dataset_list=datasets, model_list=models, metric_list=metrics
)
benchmark_job = benchmark.start()
status = benchmark_job.check_status()
results_path = benchmark_job.download_results_as_csv()
Benchmark Examples
The following examples show Benchmarking applied to Text Generation, Translation and Speech Recognition using differing approaches.
Imports
from aixplain.factories import BenchmarkFactory, DatasetFactory, MetricFactory, ModelFactory
from aixplain.enums import Function, Language # for search
Select Models, Datasets & Metrics
Datasets are currently private, so you must first onboard the datasets in the examples below (or similar) to follow along.
See our guide on How to upload a dataset.
- Text generation
- Translation
- Speech recognition
Models
# Choose 'one or more' models
model_list = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
for model in model_list:
print(model.__dict__)
model_gpt4 = ModelFactory.get("6414bd3cd09663e9225130e8")
model_llama3_70b = ModelFactory.get("6626a3a8c8f1d089790cf5a2")
Datasets
# Choose 'exactly one' dataset
dataset_list = DatasetFactory.list(function=Function.TEXT_GENERATION)["results"]
for dataset in dataset_list:
print(dataset.__dict__)
dataset_pubmed_test = DatasetFactory.get("65e92213763f9f09ec1cf529")
dataset_pubmed_test.__dict__
Metrics
# Choose 'one or more' metrics
metric_list = MetricFactory.list()["results"]
for metric in metric_list:
print(metric.__dict__)
metric_wer = MetricFactory.get("646d371caec2a04700e61945")
metric_bleu = MetricFactory.get("639874ab506c987b1ae1acc6")
Models
# Choose 'one or more' models
model_list = ModelFactory.list(
function=Function.TRANSLATION,
source_languages=Language.English,
target_languages=Language.Spanish,
suppliers=[Supplier.AWS, Supplier.GOOGLE, Supplier.MICROSOFT],
)['results']
for model in model_list:
print(model.__dict__)
models = [
ModelFactory.get(id)
for id in [
"60ddefd98d38c51c58860ad6",
"617048f83a3ab842ec0804c2",
"61b097551efecf30109d3316",
]
]
Datasets
# Choose 'exactly one' dataset
dataset = DatasetFactory.list("opus")["results"][0]
for dataset in dataset_list:
print(dataset.__dict__)
dataset_opus100_en_es_200 = DatasetFactory.get("651886916fd4ecd622045542")
dataset_opus100_en_es_200.__dict__
Metrics
# Choose 'one or more' metrics that are supported
metrics_list = MetricFactory.list()['results']
metrics = [metric for metric in metrics_list if "COMET" in metric.name]
metrics
Models
# Choose 'one or more' models
models = ModelFactory.list(
function=Function.SPEECH_RECOGNITION,
source_languages=Language.English_UNITED_STATES,
page_size=2
)['results']
models
Datasets
# Choose 'exactly one' dataset
datasets = DatasetFactory.list(
function=Function.SPEECH_RECOGNITION,
source_languages=Language.English_UNITED_STATES,
page_size=1
)['results']
datasets
Metrics
# Choose 'one or more' metrics that are supported
metrics = MetricFactory.list(
model_id=models[0].id, # filter for metrics compatible with the first model
page_size=2
)['results']
metrics
Creating a Benchmark
- Text generation
- Translation
- Speech recognition
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=[dataset_pubmed_test],
model_list=[model_gpt4, model_llama3_70b],
metric_list=[metric_wer, metric_bleu]
)
benchmark.__dict__
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=[dataset_opus100_en_es_200],
model_list=models,
metric_list=metrics
)
benchmark.__dict__
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=datasets,
model_list=models,
metric_list=metrics
)
benchmark.__dict__
Starting a Benchmark
Call the start
method to begin benchmarking. You can check the status by calling the check_benchmark_status
method or downloading the current results as a CSV (even for an in-progress benchmarking job).
benchmark_model = benchmark.start()
status = benchmark_model.check_benchmark_status()
You can start multiple jobs on a single Benchmark.
You can use a loop to check the status.
import time
while status != "completed":
status = benchmark_model.check_benchmark_status()
print(f"Current status: {status}")
time.sleep(10)
You can view your Benchmark once it's ready by
- visiting the Platform and locating it in your Benchmark assets, or
- calling the
download_results_as_csv
method.
results_path = benchmark_job.download_results_as_csv()
Normalization
We have methods that specialize in handling text data from various languages, providing both general and tailored preprocessing techniques for each language's unique characteristics. These are called normalization options. The normalization process transforms raw text data into a standardized format, enabling a fair and exact performance evaluation across diverse models. A few examples are 'removing numbers' and 'lowercase text'. To get the list of supported normalization options, we need the metric and model we will use in benchmarking.
supported_options = BenchmarkFactory.list_normalization_options(metric, model)
Note: These options can be different for each metric in the same benchmark
You have the flexibility to choose multiple normalization options for each performance metric. You can also opt for the same metric with varying normalization options. This adaptability provides a thorough way to compare model performance.
selected_options = [<option 1>....<option N>]
metric.add_normalization_options(selected_options)
You can even select multiple configurations for the same metric
selected_options_config_1 = [<option 1>, <option 2>, <option 3>]
selected_options_config_2 = [<option 3>, <option 4>]
metric.add_normalization_options(selected_options_config_1)
metric.add_normalization_options(selected_options_config_2)
After this you can create the benchmark normally
benchmark = BenchmarkFactory.create(
<UNIQUE_NAME_OF_BENCHMARK>,
dataset_list=datasets,
model_list=models,
metric_list=metrics_with_normalization
)
By following this guide, you can effectively benchmark models to assess their performance across various tasks. Utilize these to optimize your agents based on reliable evaluation metrics.