Skip to main content
Version: 1.0

aixplain.factories.dataset_factory

__author__

Copyright 2022 The aiXplain SDK authors

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Author: Duraikrishna Selvaraju, Thiago Castro Ferreira, Shreyas Sharma and Lucas Pavanelli Date: December 1st 2022 Description: Dataset Factory Class

DatasetFactory Objects

class DatasetFactory(AssetFactory)

[view_source]

Factory class for creating and managing datasets in the aiXplain platform.

This class provides functionality for creating, retrieving, and managing datasets, which are structured collections of data assets used for training, evaluating, and benchmarking AI models. Datasets can include input data, target data, hypotheses, and metadata.

Attributes:

  • backend_url str - Base URL for the aiXplain backend API.

get

@classmethod
def get(cls, dataset_id: Text) -> Dataset

[view_source]

Retrieve a dataset by its ID.

This method fetches a dataset and all its associated data assets from the platform.

Arguments:

  • dataset_id Text - Unique identifier of the dataset to retrieve.

Returns:

  • Dataset - Retrieved dataset object with all components loaded.

Raises:

  • Exception - If:
    • Dataset ID is invalid
    • Authentication fails
    • Service is unavailable

list

@classmethod
def list(cls,
query: Optional[Text] = None,
function: Optional[Function] = None,
source_languages: Optional[Union[Language, List[Language]]] = None,
target_languages: Optional[Union[Language, List[Language]]] = None,
data_type: Optional[DataType] = None,
license: Optional[License] = None,
is_referenceless: Optional[bool] = None,
page_number: int = 0,
page_size: int = 20) -> Dict

[view_source]

List and filter datasets with pagination support.

This method provides comprehensive filtering and pagination capabilities for retrieving datasets from the aiXplain platform.

Arguments:

  • query Optional[Text], optional - Search query to filter datasets by name or description. Defaults to None.
  • function Optional[Function], optional - Filter by AI function type. Defaults to None. source_languages (Optional[Union[Language, List[Language]]], optional): Filter by input data language(s). Can be single language or list. Defaults to None. target_languages (Optional[Union[Language, List[Language]]], optional): Filter by output data language(s). Can be single language or list. Defaults to None.
  • data_type Optional[DataType], optional - Filter by data type. Defaults to None.
  • license Optional[License], optional - Filter by license type. Defaults to None.
  • is_referenceless Optional[bool], optional - Filter by whether dataset has references. Defaults to None.
  • page_number int, optional - Zero-based page number. Defaults to 0.
  • page_size int, optional - Number of items per page (1-100). Defaults to 20.

Returns:

  • Dict - Response containing:
    • results (List[Dataset]): List of dataset objects
    • page_total (int): Total items in current page
    • page_number (int): Current page number
    • total (int): Total number of items across all pages

Raises:

  • Exception - If:
    • page_size is not between 1 and 100
    • Request fails
    • Service is unavailable
  • AssertionError - If page_size is invalid.

create

@classmethod
def create(cls,
name: Text,
description: Text,
license: License,
function: Function,
input_schema: List[Union[Dict, MetaData]],
output_schema: List[Union[Dict, MetaData]] = [],
hypotheses_schema: List[Union[Dict, MetaData]] = [],
metadata_schema: List[Union[Dict, MetaData]] = [],
content_path: Union[Union[Text, Path], List[Union[Text,
Path]]] = [],
input_ref_data: Dict[Text, Any] = {},
output_ref_data: Dict[Text, List[Any]] = {},
hypotheses_ref_data: Dict[Text, Any] = {},
meta_ref_data: Dict[Text, Any] = {},
tags: List[Text] = [],
privacy: Privacy = Privacy.PRIVATE,
split_labels: Optional[List[Text]] = None,
split_rate: Optional[List[float]] = None,
error_handler: ErrorHandler = ErrorHandler.SKIP,
s3_link: Optional[Text] = None,
aws_credentials: Optional[Dict[Text, Text]] = {
"AWS_ACCESS_KEY_ID": None,
"AWS_SECRET_ACCESS_KEY": None
},
api_key: Optional[Text] = None) -> Dict

[view_source]

Create a new dataset from data files and references.

This method processes data files and existing data assets to create a new dataset in the platform. It supports various data types, multiple input and output configurations, and optional data splitting.

Arguments:

  • name Text - Name for the new dataset.
  • description Text - Description of the dataset's contents and purpose.
  • license License - License type for the dataset.
  • function Function - AI function this dataset is suitable for.
  • input_schema List[Union[Dict, MetaData]] - Metadata configurations for input data processing.
  • output_schema List[Union[Dict, MetaData]], optional - Metadata configs for output/target data. Defaults to [].
  • hypotheses_schema List[Union[Dict, MetaData]], optional - Metadata configs for hypothesis data. Defaults to [].
  • metadata_schema List[Union[Dict, MetaData]], optional - Additional metadata configurations. Defaults to []. content_path (Union[Union[Text, Path], List[Union[Text, Path]]], optional): Path(s) to data files. Can be single path or list. Defaults to [].
  • input_ref_data Dict[Text, Any], optional - References to existing input data assets. Defaults to {}.
  • output_ref_data Dict[Text, List[Any]], optional - References to existing output data assets. Defaults to {}.
  • description0 Dict[Text, Any], optional - References to existing hypothesis data. Defaults to {}.
  • description1 Dict[Text, Any], optional - References to existing metadata assets. Defaults to {}.
  • description2 List[Text], optional - Tags describing the dataset. Defaults to [].
  • description3 Privacy, optional - Visibility setting. Defaults to Privacy.PRIVATE.
  • description4 Optional[List[Text]], optional - Labels for dataset splits (e.g., ["train", "test"]). Defaults to None.
  • description5 Optional[List[float]], optional - Ratios for dataset splits (must sum to 1). Defaults to None.
  • description6 ErrorHandler, optional - Strategy for handling data processing errors. Defaults to ErrorHandler.SKIP.
  • description7 Optional[Text], optional - S3 URL for data files. Defaults to None.
  • description8 Optional[Dict[Text, Text]], optional - AWS credentials with access_key_id and secret_access_key. Defaults to None values.
  • description9 Optional[Text], optional - API key for authentication. Defaults to None, using the configured TEAM_API_KEY.

Returns:

  • license0 - Response containing:
    • status: Current processing status
    • asset_id: ID of the created dataset

Raises:

  • license1 - If:
    • No input data is provided
    • Referenced data asset doesn't exist
    • Reserved column names are used
    • Data rows are misaligned
    • Split configuration is invalid
    • Processing or upload fails
  • license2 - If split configuration is invalid.