Skip to main content
Version: v2.0

Transcribe audio with Whisper Large

In this tutorial you'll turn an audio file into text using Whisper Large (66311fda6eb563279c574b71, served via Groq) on aiXplain. You'll do it two ways—with the Python SDK when you have a local file, and with a direct REST call when you already have a public URL—and you'll pull out word-level timing and confidence scores along the way.

Whisper auto-detects the spoken language, so the same call works for English, Arabic, or any of the ~100 languages it supports.

Prerequisites:

  • aiXplain account and API key (get one ↗)
  • Credits in your wallet (or a voucher)
  • pip install aixplain

Cost: $0.0018 per minute of audio + the standard 20% service fee.


Method 1: Python SDK (local file)

Use the SDK when the audio lives on disk. The SDK uploads it to temporary aiXplain storage, then runs the model.

Step 1: Upload the file

from aixplain.v2.file import FileUploader

uploader = FileUploader(api_key="YOUR_API_KEY")
audio_url = uploader.upload(
file_path="/path/to/audio.mp3",
is_temp=True,
return_download_link=True,
)
print("Uploaded:", audio_url)

is_temp=True stores the file as a short-lived temp object; return_download_link=True gives you a signed URL the model can fetch.

Step 2: Run the model

from aixplain import Aixplain

aix = Aixplain(api_key="YOUR_API_KEY")
model = aix.Model.get("66311fda6eb563279c574b71")

result = model.run(
source_audio=audio_url,
sourcelanguage="en", # required field — Whisper still auto-detects the real language
options={"includeRawData": True}, # return the provider's full response
)

print("Status: ", result.status)
print("Transcript:", result.data)
print("Run time: ", result.run_time, "s")
print("Credits: ", result.used_credits)
Show output

The transcript is in result.data. Even though we passed sourcelanguage="en", Whisper transcribed Arabic audio correctly—that field is required but does not override detection.

Step 3: Read timestamps and confidence

Because we passed options={"includeRawData": True}, the provider's raw Whisper response is available under result._raw_data["rawData"]:

raw = result._raw_data["rawData"]
print("Detected language:", raw["language"])
print("Duration (s): ", raw["duration"])

for seg in raw["segments"][:3]:
print(f"[{seg['start']:.2f}s → {seg['end']:.2f}s] {seg['text'].strip()}")
Show output

Each segment also carries tokens, avg_logprob (mean log-probability—use it as a confidence proxy), compression_ratio, and no_speech_prob (likelihood the segment is silence).


Method 2: REST API (public URL)

When the audio is already hosted somewhere publicly reachable, skip the upload and call the execute endpoint directly.

import requests

MODEL_ID = "66311fda6eb563279c574b71"

response = requests.post(
f"https://models.aixplain.com/api/v2/execute/{MODEL_ID}",
headers={"Content-Type": "application/json", "x-api-key": "YOUR_API_KEY"},
json={
"source_audio": "https://dare.wisc.edu/wp-content/uploads/sites/1051/2008/04/Arthur.mp3",
"options": {"includeRawData": True},
},
)
data = response.json()

print("Status: ", data["status"])
print("Language:", data["rawData"]["language"])
print("Duration:", data["rawData"]["duration"], "s")
print("Segments:", len(data["rawData"]["segments"]))
print("Transcript:", data["data"][:120], "...")
Show output

Equivalent curl:

curl --location 'https://models.aixplain.com/api/v2/execute/66311fda6eb563279c574b71' \
--header 'Content-Type: application/json' \
--header 'x-api-key: YOUR_API_KEY' \
--data '{
"source_audio": "https://dare.wisc.edu/wp-content/uploads/sites/1051/2008/04/Arthur.mp3",
"options": {"includeRawData": true}
}'

The audio URL must be publicly reachable (or a signed URL)—the provider downloads it server-side. Unreachable hosts fail with err.invalid_input_data_or_input_url.


Parameters

ParameterTypeRequiredNotes
source_audioURLYesPublic/signed URL to the audio, or the URL returned by FileUploader
sourcelanguagestringYesRequired field; Whisper auto-detects the actual language regardless
options.includeRawDataboolNoReturns the provider's full Whisper response (rawData) with segments, tokens, and log-probs

Next steps

  • See the full Speech Recognition reference, including output fields, in Models.
  • Learn the sync/async execution model and how to pass files and URLs to any model in API Requests.
  • Browse other speech and audio models in the aiXplain Marketplace.