Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pinkfish.ai/llms.txt

Use this file to discover all available pages before exploring further.

Server path: /audio-processing | Type: Embedded | PCID required: No

Tools

ToolDescription
audio-processing_text_to_speechConvert text to speech using minimax/speech-02-turbo model. Generates high-quality audio from text input with configurable voice settings. Returns a URL to the generated audio file.
audio-processing_transcribe_audio_or_videoTranscribe audio or video files to text using Deepgram AI. Returns transcription text plus optional word-level / utterance / paragraph / speaker timestamps via the includeDetailedResults flag. Supports diarization, keyword boosting (keyterm for Nova-3), search, replace, multichannel, filler words, and more.

audio-processing_text_to_speech

Convert text to speech using minimax/speech-02-turbo model. Generates high-quality audio from text input with configurable voice settings. Returns a URL to the generated audio file. Parameters:
ParameterTypeRequiredDefaultDescription
textstringYesText to convert to speech. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s)
voice_idstringNo"Wise_Woman"Voice ID for text-to-speech generation
pitchnumberNoSpeech pitch (-12 to 12, default: 0)
speednumberNoSpeech speed multiplier (0.5 to 2, default: 1)
volumenumberNoSpeech volume level (0 to 10, default: 1)
emotionstringNoEmotion to apply to speech (default: auto)
sample_ratenumberNoAudio sample rate (default: 32000)
language_booststringNoLanguage enhancement for better pronunciation

audio-processing_transcribe_audio_or_video

Transcribe audio or video files to text using Deepgram AI. Returns transcription text plus optional word-level / utterance / paragraph / speaker timestamps via the includeDetailedResults flag. Supports diarization, keyword boosting (keyterm for Nova-3), search, replace, multichannel, filler words, and more. Parameters:
ParameterTypeRequiredDefaultDescription
fileUrlstringYesURL of the audio or video file to transcribe
modelstringNoDeepgram model. nova-3 is the latest (required for keyterm). Whisper tiers available.
languageCodestringNoLanguage code (e.g., “en”, “es”, “fr”). Omit for auto-detect.
detectLanguagebooleanNoExplicitly request language auto-detection (most models detect by default when languageCode is omitted)
smartFormatbooleanNoSmart formatting of numbers, dates, times, etc. Default true.
numeralsbooleanNoConvert number words to numerals
profanityFilterbooleanNoFilter profanity
dictationbooleanNoInterpret spoken punctuation commands
measurementsbooleanNoNormalize measurement abbreviations
fillerWordsbooleanNoInclude “um” / “uh” / etc. in transcript
enableDiarizationbooleanNoIdentify different speakers. Auto-enables utterances with speaker-attributed timestamps.
diarizationSpeakerCountnumberNoHint for expected number of speakers
diarizeVersionstringNoPin diarization algorithm version
enableUtterancesbooleanNoGroup output into utterances with start/end timestamps per phrase/speaker. Auto-enabled when diarization is on. Compact — does not include per-word timestamps.
uttSplitnumberNoSilence threshold in seconds for splitting utterances (default 0.8)
enableParagraphsbooleanNoFormat output into paragraphs with timestamps
multichannelbooleanNoTranscribe each audio channel separately. Critical for stereo call recordings.
alternativesnumberNoReturn N transcription candidates per channel
keywordsstring[]NoKeyword boosting. Each entry may be “word” or “word:intensifier”. Pre-Nova-3 models.
keytermstring[]NoNova-3 keyterm boosting (superior to keywords; supports multi-word phrases). English-only.
searchstring[]NoSearch for phrases; hits returned with timestamps
replacestring[]NoFind-and-replace in transcript. Format: “find:replace”.
enableSummarybooleanNoGenerate a short text summary of the audio. Requires 50+ words of audio — shorter inputs return the original text. Returns a summary.short field.
enableTopicsbooleanNoDetect topics discussed in the audio. Returns topic segments with labels and confidence scores (0-1). Example topics: “healthcare”, “data collection”, “budget”. English only.
customTopicsstring[]NoProvide up to 100 custom topics to detect. By default uses extended mode — returns your topics plus auto-detected ones. Requires enableTopics. Example: [“sales”, “support”, “billing”, “technical issue”]
enableIntentsbooleanNoDetect speaker intents — what speakers are trying to do. Returns intent segments with verb-form labels and confidence scores (0-1). Example intents: “schedule a meeting”, “request a refund”. English only.
customIntentsstring[]NoProvide up to 100 custom intents to detect. By default uses extended mode — returns your intents plus auto-detected ones. Requires enableIntents. Example: [“purchase”, “cancel subscription”, “get status update”, “file complaint”]
enableSentimentbooleanNoAnalyze sentiment per segment — tags each part as positive, negative, or neutral with a confidence score (0-1). Also provides an average sentiment for the entire transcript. Useful for call center analysis.
redactstring[]NoRedaction options. Common: pci, pii, phi, numbers, ssn. Also supports specific entity types.
mipOptOutbooleanNoOpt out of Deepgram model improvement program
tagstring[]NoLabels for Deepgram console analytics filtering
extrastring[]NoArbitrary “key:value” metadata passthrough
outputModestringNoOutput mode. “transcription” (default): text only (~5KB). “transcription_with_timestamps”: includes utterance-level timestamps with start/end per phrase/speaker (~15KB). Use timestamps mode when you need to know when things were said.
includeDetailedResultsbooleanNoInclude the full raw Deepgram response with word-level timestamps. Warning: very large output (200KB+ per minute of audio). Only use when you need per-word timing data.