aixplain.factories.dataset_factory
__author__
Copyright 2022 The aiXplain SDK authors
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Author: Duraikrishna Selvaraju, Thiago Castro Ferreira, Shreyas Sharma and Lucas Pavanelli Date: December 1st 2022 Description: Dataset Factory Class
DatasetFactory Objects
class DatasetFactory(AssetFactory)
Factory class for creating and managing datasets in the aiXplain platform.
This class provides functionality for creating, retrieving, and managing datasets, which are structured collections of data assets used for training, evaluating, and benchmarking AI models. Datasets can include input data, target data, hypotheses, and metadata.
Attributes:
backend_urlstr - Base URL for the aiXplain backend API.
get
@classmethod
def get(cls, dataset_id: Text) -> Dataset
Retrieve a dataset by its ID.
This method fetches a dataset and all its associated data assets from the platform.
Arguments:
dataset_idText - Unique identifier of the dataset to retrieve.
Returns:
Dataset- Retrieved dataset object with all components loaded.
Raises:
Exception- If:- Dataset ID is invalid
- Authentication fails
- Service is unavailable
list
@classmethod
def list(cls,
query: Optional[Text] = None,
function: Optional[Function] = None,
source_languages: Optional[Union[Language, List[Language]]] = None,
target_languages: Optional[Union[Language, List[Language]]] = None,
data_type: Optional[DataType] = None,
license: Optional[License] = None,
is_referenceless: Optional[bool] = None,
page_number: int = 0,
page_size: int = 20) -> Dict
List and filter datasets with pagination support.
This method provides comprehensive filtering and pagination capabilities for retrieving datasets from the aiXplain platform.
Arguments:
queryOptional[Text], optional - Search query to filter datasets by name or description. Defaults to None.functionOptional[Function], optional - Filter by AI function type. Defaults to None. source_languages (Optional[Union[Language, List[Language]]], optional): Filter by input data language(s). Can be single language or list. Defaults to None. target_languages (Optional[Union[Language, List[Language]]], optional): Filter by output data language(s). Can be single language or list. Defaults to None.data_typeOptional[DataType], optional - Filter by data type. Defaults to None.licenseOptional[License], optional - Filter by license type. Defaults to None.is_referencelessOptional[bool], optional - Filter by whether dataset has references. Defaults to None.page_numberint, optional - Zero-based page number. Defaults to 0.page_sizeint, optional - Number of items per page (1-100). Defaults to 20.
Returns:
Dict- Response containing:- results (List[Dataset]): List of dataset objects
- page_total (int): Total items in current page
- page_number (int): Current page number
- total (int): Total number of items across all pages
Raises:
Exception- If:- page_size is not between 1 and 100
- Request fails
- Service is unavailable
AssertionError- If page_size is invalid.
create
@classmethod
def create(cls,
name: Text,
description: Text,
license: License,
function: Function,
input_schema: List[Union[Dict, MetaData]],
output_schema: List[Union[Dict, MetaData]] = [],
hypotheses_schema: List[Union[Dict, MetaData]] = [],
metadata_schema: List[Union[Dict, MetaData]] = [],
content_path: Union[Union[Text, Path], List[Union[Text,
Path]]] = [],
input_ref_data: Dict[Text, Any] = {},
output_ref_data: Dict[Text, List[Any]] = {},
hypotheses_ref_data: Dict[Text, Any] = {},
meta_ref_data: Dict[Text, Any] = {},
tags: List[Text] = [],
privacy: Privacy = Privacy.PRIVATE,
split_labels: Optional[List[Text]] = None,
split_rate: Optional[List[float]] = None,
error_handler: ErrorHandler = ErrorHandler.SKIP,
s3_link: Optional[Text] = None,
aws_credentials: Optional[Dict[Text, Text]] = {
"AWS_ACCESS_KEY_ID": None,
"AWS_SECRET_ACCESS_KEY": None
},
api_key: Optional[Text] = None) -> Dict
Create a new dataset from data files and references.
This method processes data files and existing data assets to create a new dataset in the platform. It supports various data types, multiple input and output configurations, and optional data splitting.
Arguments:
nameText - Name for the new dataset.descriptionText - Description of the dataset's contents and purpose.licenseLicense - License type for the dataset.functionFunction - AI function this dataset is suitable for.input_schemaList[Union[Dict, MetaData]] - Metadata configurations for input data processing.output_schemaList[Union[Dict, MetaData]], optional - Metadata configs for output/target data. Defaults to [].hypotheses_schemaList[Union[Dict, MetaData]], optional - Metadata configs for hypothesis data. Defaults to [].metadata_schemaList[Union[Dict, MetaData]], optional - Additional metadata configurations. Defaults to []. content_path (Union[Union[Text, Path], List[Union[Text, Path]]], optional): Path(s) to data files. Can be single path or list. Defaults to [].input_ref_dataDict[Text, Any], optional - References to existing input data assets. Defaults to {}.output_ref_dataDict[Text, List[Any]], optional - References to existing output data assets. Defaults to {}.description0 Dict[Text, Any], optional - References to existing hypothesis data. Defaults to {}.description1 Dict[Text, Any], optional - References to existing metadata assets. Defaults to {}.description2 List[Text], optional - Tags describing the dataset. Defaults to [].description3 Privacy, optional - Visibility setting. Defaults to Privacy.PRIVATE.description4 Optional[List[Text]], optional - Labels for dataset splits (e.g., ["train", "test"]). Defaults to None.description5 Optional[List[float]], optional - Ratios for dataset splits (must sum to 1). Defaults to None.description6 ErrorHandler, optional - Strategy for handling data processing errors. Defaults to ErrorHandler.SKIP.description7 Optional[Text], optional - S3 URL for data files. Defaults to None.description8 Optional[Dict[Text, Text]], optional - AWS credentials with access_key_id and secret_access_key. Defaults to None values.description9 Optional[Text], optional - API key for authentication. Defaults to None, using the configured TEAM_API_KEY.
Returns:
license0 - Response containing:- status: Current processing status
- asset_id: ID of the created dataset
Raises:
license1 - If:- No input data is provided
- Referenced data asset doesn't exist
- Reserved column names are used
- Data rows are misaligned
- Split configuration is invalid
- Processing or upload fails
license2 - If split configuration is invalid.