Corpora

Corpora are structured datasets that contain various types of data such as text, audio, images, or video. In aiXplain, corpora can be onboarded and used for training models, evaluating AI systems, and other analytical tasks. This guide explains how to onboard a corpus using the aiXplain SDK, from data preparation to successful integration.

Corpus Onboarding

A step-by-step example is available in the following colab notebook:

Speech recognition corpus

Prepare the Data

A corpus is typically represented in a CSV file. This example dataset contains 20 English audio files with their respective transcriptions.

import pandas as pd

data_url = "https://aixplain-platform-assets.s3.amazonaws.com/samples/tests/EnglishASR.csv"
data = pd.read_csv(data_url)
data

To onboard the dataset, save it locally.

upload_file = "data.csv"
data.to_csv(upload_file)

Define Metadata

Metadata provides important information about the dataset, such as its type (audio, text, etc.), storage method, and language.

Audio Metadata

Define the audio data by specifying:

Name: "audio"
Data type: Audio
Storage: URL (links to external audio files)
Start/End columns: Time-based segmentation
Language: English (US)

from aixplain.enums import DataType, Language, License, StorageType
from aixplain.modules import MetaData

audio_meta = MetaData(
    name="audio",
    dtype="audio",
    storage_type="url",
    start_column="audio_start_time",
    end_column="audio_end_time",
    languages=[Language.English_UNITED_STATES]
)

Text Metadata

Define the text data by specifying:

Name: "text"
Data type: Text
Storage: Directly stored in the CSV
Language: English (US)

text_meta = MetaData(
    name="text",
    dtype=DataType.TEXT,
    storage_type=StorageType.TEXT,
    languages=[Language.English_UNITED_STATES]
)

Combine the metadata definitions into a schema.

schema = [audio_meta, text_meta]

Onboard the Corpus

To upload the dataset, call the create method with:

Name: A unique identifier for the corpus.
Description: Brief details of the dataset.
License: Defines usage permissions.
File path: Location of the dataset.
Schema: The metadata structure defined earlier.

from aixplain.factories import CorpusFactory

payload = CorpusFactory.create(
    name="corpus_onboarding_demo",
    description="This corpus contains 20 English audios with their corresponding transcriptions.",
    license=License.MIT,
    content_path=upload_file,
    schema=schema
)
print(payload)  # Displays Corpus ID and onboarding status

Retrieve the Corpus

Once onboarded, the corpus can be retrieved using its asset ID.

corpus = CorpusFactory.get(payload["asset_id"])
print(corpus.to_dict())  # Displays corpus details

Onboarding corpora in aiXplain allows structured datasets to be used efficiently for AI model training and evaluation. By following these steps, you can successfully upload and manage corpora, ensuring they are well-organized and easily accessible.

Corpus Onboarding​

Prepare the Data​

Define Metadata​

Audio Metadata​

Text Metadata​

Onboard the Corpus​

Retrieve the Corpus​