Prompt Benchmarking

Not OnPrem

Prompt benchmarking lets you evaluate how different prompt formulations affect a model’s quality, speed, and cost—while holding the dataset, metric, and underlying model constant.

Example - Poetry Sentiment Classification

The example below uses Google's poem sentiment dataset which categorizes each line's emotion as positive, negative, or neutral (no impact).

Select Dataset, Metric, and Base Model

Each variant uses the same underlying model with a different prompt configuration.

from aixplain.factories import DatasetFactory, MetricFactory, ModelFactory, BenchmarkFactory

datasets = [DatasetFactory.get("67eebc80ff3b3998834d0023")]

model1 = ModelFactory.get("669a63646eb56306647e1091")
model2 = ModelFactory.get("669a63646eb56306647e1091")
model3 = ModelFactory.get("669a63646eb56306647e1091")
model4 = ModelFactory.get("669a63646eb56306647e1091")

model1.add_additional_info_for_benchmark(display_name="No Prompt", configuration={"prompt": ""})
model2.add_additional_info_for_benchmark(display_name="Simple Prompt", configuration={"prompt": "Analyze the sentiment of the following text:"})
model3.add_additional_info_for_benchmark(display_name="Specific Prompt", configuration={"prompt": "Classify the text into 'no_impact', 'negative', or 'positive':"})
model4.add_additional_info_for_benchmark(display_name="Specific Prompt with Output Format", configuration={"prompt": "Classify the text into 'no_impact', 'negative', or 'positive'. Only output the answer and nothing else:"})
models = [model1, model2, model3, model4]

metrics = [MetricFactory.get("65e1d5ac95487dea3023a0b8")]

Create and Run Benchmark

benchmark = BenchmarkFactory.create(name="Poem Sentiment Prompt Run", dataset_list=datasets, model_list=models, metric_list=metrics)

benchmark_job = benchmark.start()

Analyze Results

# Simplified score output
scores = benchmark_job.get_scores()
print(scores)

# Detailed results
results_df = benchmark_job.download_results_as_csv(return_dataframe=True)

# View metric comparison by prompt
results_df.groupby("DisplayName")["ROUGE by HuggingFace"].describe()

# Preview input and reference pairs
results_df[["Input", "Reference 0"]].head()

note

Clear and well-structured prompts can significantly impact model accuracy and consistency across tasks.

Example - Poetry Sentiment Classification​

Select Dataset, Metric, and Base Model​

Create and Run Benchmark​

Analyze Results​

Example - Poetry Sentiment Classification

Select Dataset, Metric, and Base Model

Create and Run Benchmark

Analyze Results