Corpora
Corpora are structured datasets that contain various types of data such as text, audio, images, or video. In aiXplain, corpora can be onboarded and used for training models, evaluating AI systems, and other analytical tasks. This guide explains how to onboard a corpus using the aiXplain SDK, from data preparation to successful integration.
Prepare the Data
A corpus is typically represented in a CSV file. This example dataset contains 20 English audio files with their respective transcriptions.
import pandas as pd
data_url = "https://aixplain-platform-assets.s3.amazonaws.com/samples/tests/EnglishASR.csv"
data = pd.read_csv(data_url)
data
To onboard the dataset, save it locally.
upload_file = "data.csv"
data.to_csv(upload_file)
Define Metadata
Metadata provides important information about the dataset, such as its type (audio, text, etc.), storage method, and language.
Audio Metadata
Define the audio data by specifying:
- Name: "audio"
- Data type: Audio
- Storage: URL (links to external audio files)
- Start/End columns: Time-based segmentation
- Language: English (US)
from aixplain.enums import DataType, Language, License, StorageType
from aixplain.modules import MetaData
audio_meta = MetaData(
name="audio",
dtype="audio",
storage_type="url",
start_column="audio_start_time",
end_column="audio_end_time",
languages=[Language.English_UNITED_STATES]
)
Text Metadata
Define the text data by specifying:
- Name: "text"
- Data type: Text
- Storage: Directly stored in the CSV
- Language: English (US)
text_meta = MetaData(
name="text",
dtype=DataType.TEXT,
storage_type=StorageType.TEXT,
languages=[Language.English_UNITED_STATES]
)
Combine the metadata definitions into a schema.
schema = [audio_meta, text_meta]
Onboard the Corpus
To upload the dataset, call the create
method with:
- Name: A unique identifier for the corpus.
- Description: Brief details of the dataset.
- License: Defines usage permissions.
- File path: Location of the dataset.
- Schema: The metadata structure defined earlier.
from aixplain.factories import CorpusFactory
payload = CorpusFactory.create(
name="corpus_onboarding_demo",
description="This corpus contains 20 English audios with their corresponding transcriptions.",
license=License.MIT,
content_path=upload_file,
schema=schema
)
print(payload) # Displays Corpus ID and onboarding status
Retrieve the Corpus
Once onboarded, the corpus can be retrieved using its asset ID.
corpus = CorpusFactory.get(payload["asset_id"])
print(corpus.to_dict()) # Displays corpus details
Onboarding corpora in aiXplain allows structured datasets to be used efficiently for AI model training and evaluation. By following these steps, you can successfully upload and manage corpora, ensuring they are well-organized and easily accessible.