Skip to main content
Server path: /audio-processing | Type: Embedded | PCID required: No Convert text to speech with configurable voices and emotions. Transcribe audio and video files with speaker diarization, summaries, and sentiment analysis.

Tools

ToolDescription
audio-processing_text_to_speechConvert text to speech
audio-processing_transcribe_audio_or_videoTranscribe audio or video to text

audio-processing_text_to_speech

Convert text to speech with configurable voice, pitch, speed, volume, and emotion settings. Parameters:
ParameterTypeRequiredDefaultDescription
textstringYesText to convert to speech (max 5000 characters)
voice_idenumNo"Wise_Woman"Voice to use: "Wise_Woman", "Friendly_Person", "Inspirational_girl", "Deep_Voice_Man", "Calm_Woman", "Casual_Guy", "Lively_Girl", "Patient_Man", "Young_Knight", "Determined_Man", "Lovely_Girl", "Decent_Boy", "Imposing_Manner", "Elegant_Man", "Abbess", "Sweet_Girl_2", "Exuberant_Girl"
pitchnumberNo0Voice pitch adjustment (-12 to 12)
speednumberNo1Speech speed multiplier (0.5 to 2)
volumenumberNo1Volume level (0 to 10)
emotionenumNo"auto"Emotional tone: "auto", "neutral", "happy", "sad", "angry", "fearful", "disgusted", "surprised"
sample_ratenumberNo32000Audio sample rate in Hz
language_boostenumNo"None"Language to boost for pronunciation accuracy. Supports many languages including English, Spanish, French, German, Chinese, Japanese, Korean, and more.
Response fields:
FieldTypeDescription
outputobject[]Array of audio output objects
output[].urlstringURL of the generated audio file
output[].mimeTypestringMIME type of the generated audio

audio-processing_transcribe_audio_or_video

Transcribe audio or video files to text. Supports speaker diarization, paragraph formatting, summaries, topic detection, sentiment analysis, and content redaction. Parameters:
ParameterTypeRequiredDefaultDescription
fileUrlstringYesURL of the audio or video file to transcribe
modelenumNo"nova-3"Transcription model: "nova-3", "nova-2", "enhanced", "base"
languageCodestringNoLanguage code (auto-detected if not specified)
enableDiarizationbooleanNofalseEnable speaker diarization to identify different speakers
diarizationSpeakerCountnumberNoExpected number of speakers (1–10, improves diarization accuracy)
enableParagraphsbooleanNofalseFormat transcription into paragraphs
enableSummarybooleanNofalseGenerate a summary of the transcription
enableTopicsbooleanNofalseDetect topics discussed in the audio
enableSentimentbooleanNofalseAnalyze sentiment of the transcription
redactenum[]NoContent types to redact: "pci", "pii", "phi", "numbers", "ssn", and others
Response fields:
FieldTypeDescription
transcriptionstringThe transcribed text
metadataobjectTranscription metadata (duration, model, language, etc.)
summarystringSummary of the transcription (when enableSummary is true)
topicsobjectDetected topics (when enableTopics is true)
sentimentobjectSentiment analysis results (when enableSentiment is true)
utterancesobject[]Speaker-attributed segments (when enableDiarization is true)
utterances[].speakerstringSpeaker identifier
utterances[].startnumberStart time in seconds
utterances[].endnumberEnd time in seconds
utterances[].textstringText spoken by the speaker