Skip to main content

module aixplain.processes.data_onboarding.process_text_files


function process_text

process_text(content: str, storage_type: StorageType)str

Process text files

Args:

  • content (str): URL with text, local path with text or textual content
  • storage_type (StorageType): type of storage: URL, local path or textual content

Returns:

  • Text: textual content

function run

run(
metadata: MetaData,
paths: List,
folder: Path,
batch_size: int = 1000
) → Tuple[List[File], int, int]

Process a list of local textual files, compress and upload them to pre-signed URLs in S3

Explanation: Each text on "paths" is processed. If the text is in a public link or local file, it will be downloaded and added to an index CSV file. The texts are processed in batches such that at each "batch_size" texts, the index CSV file is uploaded into a pre-signed URL in s3 and reset.

Args:

  • metadata (MetaData): meta data of the asset
  • paths (List): list of paths to local files
  • folder (Path): local folder to save compressed files before upload them to s3.

Returns:

  • Tuple[List[File], int, int]: list of s3 links, data colum index and number of rows