Skip to main content
Version: 1.0

Data Onboarding

Not OnPrem

At aiXplain, data assets are the foundation for training, benchmarking, and evaluating AI systems. Currently, we categorize data assets into two types: Corpora and Datasets.

Overview

  • Corpus – A flexible, context-rich data collection for exploratory analysis and AI system evaluation.

  • Dataset – A task-specific data compilation with defined inputs/outputs for ML tasks such as Speech Recognition, Machine Translation, or Sentiment Analysis.

[Corpora] → General, exploratory → Used to build datasets
[Datasets] → Task-specific → Used to train/benchmark models

Dataset Onboarding

Datasets are structured for specific ML tasks and can be uploaded to aiXplain for use in fine-tuning, training, or benchmarking.



Step-by-Step Guide

We provide step-by-step examples in Colab notebooks for different dataset types:

Machine Translation Dataset

Open In Colab

Speech Recognition Dataset

Open In Colab

Machine Translation Dataset (from S3)

Open In Colab

Workflow

  1. Prepare Your Data – Structure dataset according to task (e.g., text pairs for MT, audio + transcripts for ASR).

  2. Upload Dataset – Use SDK or UI to onboard data into aiXplain.

  3. Validate – Ensure uploaded dataset matches task requirements and metadata.

Once onboarded, datasets can be attached to agents for evaluation or fine-tuning tasks.

Corpus Onboarding

Corpora are general-purpose collections (text, audio, images, video) that can be structured and later transformed into task-specific datasets. They are useful for training models, exploratory research, and benchmarking AI systems.



Step-by-Step Example

We provide step-by-step example below:

Speech Recognition Corpus

Open In Colab:

Prepare the Data

A corpus is typically represented in a CSV file. Example: English ASR corpus with audio file URLs and transcripts.

import pandas as pd

data_url = "https://aixplain-platform-assets.s3.amazonaws.com/samples/tests/EnglishASR.csv"
data = pd.read_csv(data_url)
data.to_csv("data.csv") # Save locally

Define Metadata

Metadata describes the structure of the corpus.

Audio Metadata
from aixplain.enums import DataType, Language, StorageType
from aixplain.modules import MetaData

audio_meta = MetaData(
name="audio",
dtype=DataType.AUDIO,
storage_type=StorageType.URL,
start_column="audio_start_time",
end_column="audio_end_time",
languages=[Language.English_UNITED_STATES]
)
Text Metadata
text_meta = MetaData(
name="text",
dtype=DataType.TEXT,
storage_type=StorageType.TEXT,
languages=[Language.English_UNITED_STATES]
)

Combine metadata into a schema:

schema = [audio_meta, text_meta]

Onboard the Corpus

from aixplain.factories import CorpusFactory
from aixplain.enums import License

payload = CorpusFactory.create(
name="corpus_onboarding_demo",
description="This corpus contains 20 English audios with transcriptions.",
license=License.MIT,
content_path="data.csv",
schema=schema
)
print(payload) # Displays corpus ID and onboarding status

Retrieve the Corpus

corpus = CorpusFactory.get(payload["asset_id"])
print(corpus.to_dict())

Best Practices

  • Start with a Corpus for Exploration – Use corpora to collect and structure raw data.

  • Convert to Datasets for Tasks – When targeting ML tasks (e.g., ASR, MT), convert subsets of corpora into datasets.

  • Keep Metadata Clear – Accurate metadata ensures smooth onboarding and benchmarking.

  • Validate Before Use – Always confirm schema and sample rows before running training/evaluation.