Data Onboarding
At aiXplain, data assets are the foundation for training, benchmarking, and evaluating AI systems. Currently, we categorize data assets into two types: Corpora and Datasets.
Overview
-
Corpus – A flexible, context-rich data collection for exploratory analysis and AI system evaluation.
-
Dataset – A task-specific data compilation with defined inputs/outputs for ML tasks such as Speech Recognition, Machine Translation, or Sentiment Analysis.
[Corpora] → General, exploratory → Used to build datasets
[Datasets] → Task-specific → Used to train/benchmark models
Dataset Onboarding
Datasets are structured for specific ML tasks and can be uploaded to aiXplain for use in fine-tuning, training, or benchmarking.
Step-by-Step Guide
We provide step-by-step examples in Colab notebooks for different dataset types:
Machine Translation Dataset
Speech Recognition Dataset
Machine Translation Dataset (from S3)
Workflow
-
Prepare Your Data – Structure dataset according to task (e.g., text pairs for MT, audio + transcripts for ASR).
-
Upload Dataset – Use SDK or UI to onboard data into aiXplain.
-
Validate – Ensure uploaded dataset matches task requirements and metadata.
Once onboarded, datasets can be attached to agents for evaluation or fine-tuning tasks.
Corpus Onboarding
Corpora are general-purpose collections (text, audio, images, video) that can be structured and later transformed into task-specific datasets. They are useful for training models, exploratory research, and benchmarking AI systems.
Step-by-Step Example
We provide step-by-step example below:
Speech Recognition Corpus
Prepare the Data
A corpus is typically represented in a CSV file. Example: English ASR corpus with audio file URLs and transcripts.
import pandas as pd
data_url = "https://aixplain-platform-assets.s3.amazonaws.com/samples/tests/EnglishASR.csv"
data = pd.read_csv(data_url)
data.to_csv("data.csv") # Save locally
Define Metadata
Metadata describes the structure of the corpus.
Audio Metadata
from aixplain.enums import DataType, Language, StorageType
from aixplain.modules import MetaData
audio_meta = MetaData(
name="audio",
dtype=DataType.AUDIO,
storage_type=StorageType.URL,
start_column="audio_start_time",
end_column="audio_end_time",
languages=[Language.English_UNITED_STATES]
)
Text Metadata
text_meta = MetaData(
name="text",
dtype=DataType.TEXT,
storage_type=StorageType.TEXT,
languages=[Language.English_UNITED_STATES]
)
Combine metadata into a schema:
schema = [audio_meta, text_meta]
Onboard the Corpus
from aixplain.factories import CorpusFactory
from aixplain.enums import License
payload = CorpusFactory.create(
name="corpus_onboarding_demo",
description="This corpus contains 20 English audios with transcriptions.",
license=License.MIT,
content_path="data.csv",
schema=schema
)
print(payload) # Displays corpus ID and onboarding status
Retrieve the Corpus
corpus = CorpusFactory.get(payload["asset_id"])
print(corpus.to_dict())
Best Practices
-
Start with a Corpus for Exploration – Use corpora to collect and structure raw data.
-
Convert to Datasets for Tasks – When targeting ML tasks (e.g., ASR, MT), convert subsets of corpora into datasets.
-
Keep Metadata Clear – Accurate metadata ensures smooth onboarding and benchmarking.
-
Validate Before Use – Always confirm schema and sample rows before running training/evaluation.