What can you do with it?
The/audio-generate
command enables you to convert text to speech using the advanced Minimax Speech-02-Turbo model. You can create professional voiceovers, generate multilingual audio content, produce podcasts, create audiobooks, develop voice assistants, and generate high-quality speech with extensive control over voice characteristics, emotions, and audio quality.
How to use it?
Basic Command Structure
Parameters
Required:text
- Text to convert to speech (max 5000 characters). Use<#x#>
for pause control (0.01-99.99s)
pitch
- Speech pitch: -12 to 12 (defaults to 0)speed
- Speech speed: 0.5 to 2 (defaults to 1)volume
- Speech volume: 0 to 10 (defaults to 1)bitrate
- Audio bitrate: 32000, 64000, 128000, 256000 (defaults to 128000)channel
- Audio channels: “mono”, “stereo” (defaults to “mono”)emotion
- Speech emotion: “auto”, “neutral”, “happy”, “sad”, “angry”, “fearful”, “disgusted”, “surprised” (defaults to “auto”)voice_id
- Voice selection (defaults to “Wise_Woman”). See available voices belowsample_rate
- Sample rate: 8000, 16000, 22050, 24000, 32000, 44100 (defaults to 32000)language_boost
- Language enhancement (defaults to “None”). See language options belowenglish_normalization
- Enable English text normalization for better number reading (boolean, defaults to false)fileLinksExpireInDays
- How long generated files remain accessible: 1-7 days (defaults to 7)fileLinksExpireInMinutes
- How long generated files remain accessible in minutes (takes precedence over days)
Response Format
The command returns:fileLinksExpireInDays
parameter.
Examples
Basic Text-to-Speech
Professional Voiceover
Interactive Voice Response (IVR)
Multilingual Content
Podcast Introduction
Audiobook Narration
Children’s Content
Corporate Training
Emergency Announcement
Marketing Advertisement
Meditation and Wellness
Notes
Model Capabilities:- High-quality neural text-to-speech synthesis
- 17 different voice personalities
- Multilingual support with language-specific enhancements
- Precise emotion control for natural-sounding speech
- Advanced pause control with
<#x#>
notation - Professional audio quality up to 44.1kHz/256kbps
- Wise_Woman - Mature, knowledgeable female voice
- Friendly_Person - Warm, approachable neutral voice
- Inspirational_girl - Uplifting, motivational young female
- Deep_Voice_Man - Rich, authoritative male voice
- Calm_Woman - Soothing, peaceful female voice
- Casual_Guy - Relaxed, conversational male voice
- Lively_Girl - Energetic, animated young female
- Patient_Man - Steady, educational male voice
- Young_Knight - Noble, heroic male voice
- Determined_Man - Confident, resolute male voice
- Lovely_Girl - Sweet, gentle female voice
- Decent_Boy - Polite, well-mannered male voice
- Imposing_Manner - Authoritative, commanding voice
- Elegant_Man - Refined, sophisticated male voice
- Abbess - Dignified, spiritual female voice
- Sweet_Girl_2 - Charming, endearing female voice
- Exuberant_Girl - Enthusiastic, spirited female voice
- None - No language-specific processing
- Automatic - Auto-detect and enhance
- Chinese - Mandarin Chinese enhancement
- Chinese,Yue - Cantonese Chinese enhancement
- English - English language enhancement
- Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi
- Use
<#x#>
for precise pause control (e.g.,<#1.5#>
for 1.5 second pause) - Enable english_normalization for better number and abbreviation reading
- Use higher bitrates (256000) for professional applications
- Choose appropriate voice_id based on content type and target audience
- Adjust speed based on content complexity (slower for educational, faster for energetic content)
- Maximum 5000 characters per request
- Processing time increases with text length and quality settings
- Some voices may be more suitable for specific languages
- Pause control syntax must be exact:
<#number#>
Model Parameters (minimax/speech-02-turbo)
Text-to-Speech Parameters
- text (required): Text to convert to speech (max 5000 chars). Use
<#x#>
for pause control (0.01-99.99s) - pitch: Speech pitch (-12 to 12, default: 0)
- speed: Speech speed (0.5 to 2, default: 1)
- volume: Speech volume (0 to 10, default: 1)
- bitrate: Bitrate (32000, 64000, 128000, 256000, default: 128000)
- channel: Audio channels (“mono”, “stereo”, default: “mono”)
- emotion: Speech emotion (“auto”, “neutral”, “happy”, “sad”, “angry”, “fearful”, “disgusted”, “surprised”, default: “auto”)
- voice_id: Voice ID (default: “Wise_Woman”). See available voices above
- sample_rate: Sample rate (8000, 16000, 22050, 24000, 32000, 44100, default: 32000)
- language_boost: Language enhancement (default: “None”). See language options above
- english_normalization: Enable English text normalization for better number reading (boolean, default: false, slightly increases latency)