How to Generate Podcast Audio Programmatically (Multi-Speaker Dialogue API)

I've been building a tool that generates short audio summaries of articles — think "your morning briefing" style content. Single-voice narration works fine for that, but when I added an interview format, I ran into the obvious problem: how do you make two characters actually sound like two different people?

The naive approach is to call the TTS API twice, stitch the audio together yourself with some FFmpeg command you copy-pasted from Stack Overflow, add silence between lines, and pray the timing feels natural. It works. It's also a massive pain to maintain.

LeanVox has a /dialogue endpoint that handles all of this in one call. You send a list of lines, each with its own speaker and voice, and it comes back as a single MP3 with the conversation assembled — silence gaps included. Here's how it works.

The dialogue endpoint

The API is POST /v1/tts/dialogue. The body looks like this:

{
  "model": "pro",
  "lines": [
    {
      "text": "Welcome back to the show. Today we're talking about API pricing.",
      "voice": "emma",
      "language": "en"
    },
    {
      "text": "Thanks for having me. I have a lot of opinions about this.",
      "voice": "james",
      "language": "en",
      "exaggeration": 0.6
    },
    {
      "text": "I figured you would. Let's start with the obvious one — ElevenLabs.",
      "voice": "emma",
      "language": "en"
    }
  ],
  "gap_ms": 600
}

Each line gets its own voice. gap_ms is how many milliseconds of silence to insert between speakers — 400–700ms feels natural for conversation, longer for dramatic pauses. The whole thing comes back as one MP3.

The exaggeration field on individual lines controls how expressive that specific line is — useful when one speaker is more animated than the other, or when you want a particular line to land harder.

A minimal working example

Here's a Python script that generates a two-person podcast intro and saves it to disk:

import requests

API_KEY = "lv_live_your_key_here"

dialogue = {
    "model": "pro",
    "gap_ms": 500,
    "lines": [
        {
            "text": "Hey everyone, welcome to Developer Office Hours. I'm your host.",
            "voice": "emma",
            "language": "en"
        },
        {
            "text": "And I'm the guest who actually knows what they're talking about.",
            "voice": "james",
            "language": "en",
            "exaggeration": 0.65
        },
        {
            "text": "Bold claim. We'll see about that. [laugh]",
            "voice": "emma",
            "language": "en",
            "exaggeration": 0.5
        },
        {
            "text": "Today we're covering rate limiting — why it matters and why most devs get it wrong.",
            "voice": "james",
            "language": "en"
        }
    ]
}

resp = requests.post(
    "https://api.leanvox.com/v1/tts/dialogue",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json=dialogue,
    timeout=60
)
resp.raise_for_status()

# API returns JSON with an audio_url — fetch the MP3 from there
data = resp.json()
audio = requests.get(data["audio_url"]).content

with open("podcast_intro.mp3", "wb") as f:
    f.write(audio)

print(f"Saved podcast_intro.mp3 ({len(audio) // 1024} KB)")

That's it. One HTTP call, one MP3.

Using voice cloning for consistent characters

If you want your podcast hosts to have consistent, recognizable voices across episodes, you can use voice cloning with the Pro model. Upload a short voice sample (10–30 seconds of clean audio is plenty) and use the returned voice ID in your dialogue lines.

import requests

API_KEY = "lv_live_your_key_here"

# Upload a voice sample once, store the returned ID
def upload_voice(file_path: str, voice_id: str) -> str:
    with open(file_path, "rb") as f:
        resp = requests.post(
            "https://api.leanvox.com/v1/voices/clone",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"voice_id": voice_id}
        )
    resp.raise_for_status()
    return resp.json()["voice_id"]

# Generate dialogue using cloned voice IDs
host_voice = "my_host_alice"    # uploaded once, reused forever
guest_voice = "guest_voice_bob"

resp = requests.post(
    "https://api.leanvox.com/v1/tts/dialogue",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "pro",
        "gap_ms": 550,
        "lines": [
            {"text": "So what actually broke in production?", "voice": host_voice, "language": "en"},
            {"text": "Honestly? A missing semicolon in a config file.", "voice": guest_voice, "language": "en", "exaggeration": 0.7},
            {"text": "No.", "voice": host_voice, "language": "en", "exaggeration": 0.8},
            {"text": "I wish I was joking.", "voice": guest_voice, "language": "en"}
        ]
    }
)

data = resp.json()
audio = requests.get(data["audio_url"]).content
with open("episode_clip.mp3", "wb") as f:
    f.write(audio)

Once the voices are uploaded, generating a new episode is just the API call — no re-uploading, no re-training. The same voice ID works indefinitely.

Multilingual dialogue

Each line can have a different language. This is useful if you're building localised content or have bilingual speakers. The dialogue endpoint handles the mixing — you don't need to do anything special:

{
  "model": "standard",
  "gap_ms": 500,
  "lines": [
    {"text": "What do you think about the new feature?", "voice": "af_heart", "language": "en"},
    {"text": "Me parece muy bien. Es mucho más rápido.", "voice": "ef_dora", "language": "es"},
    {"text": "Glad to hear it. We worked hard on the performance improvements.", "voice": "af_heart", "language": "en"}
  ]
}

The Standard model covers 20+ languages so you're not limited to English-only content.

What this actually costs

The dialogue endpoint bills the same way as the single-voice generate endpoint — by character count across all lines combined. If your podcast intro is 800 characters total:

Standard model: $0.005/1K = ~$0.004 per episode intro
Pro model (with cloned voices): $0.01/1K = ~$0.008 per episode intro

If you're generating 1,000 episodes a month at 5,000 characters each (roughly a 3-minute conversation), that's:

Standard: $25/month
Pro: $50/month

Compare that to ElevenLabs, where the same volume would run $825–$1,100/month depending on your plan. The math is pretty stark.

Handling longer conversations

For conversations beyond roughly 10,000 characters total, the API automatically processes the job asynchronously and returns a job ID instead of streaming audio. You poll for completion:

import requests
import time

API_KEY = "lv_live_your_key_here"

# Submit a long dialogue
resp = requests.post(
    "https://api.leanvox.com/v1/tts/dialogue",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"model": "pro", "gap_ms": 500, "lines": very_long_lines}
)

data = resp.json()

# All responses return an audio_url — fetch the MP3 from it
# For very long scripts, the job may still be processing; poll until complete
if data.get("status") in ("pending", "processing"):
    job_id = data["id"]
    while True:
        status = requests.get(
            f"https://api.leanvox.com/v1/jobs/{job_id}",
            headers={"Authorization": f"Bearer {API_KEY}"}
        ).json()

        if status["status"] == "completed":
            data = status
            break
        elif status["status"] == "failed":
            raise Exception(f"Job failed: {status.get('error')}")

        time.sleep(2)

audio = requests.get(data["audio_url"]).content
with open("full_episode.mp3", "wb") as f:
    f.write(audio)

In practice, a 30-minute podcast episode is around 40,000–50,000 characters. With async processing you can generate full episode audio overnight as a batch job, or trigger it from a webhook when your content is ready.

Where this actually fits in a real workflow

The use cases I've seen this work well for:

AI-generated podcasts: Use an LLM to write the script, pipe it into the dialogue endpoint, publish the MP3. The whole pipeline can run unattended.
Audio documentation: Some teams are generating audio versions of their changelogs and release notes in a "host explains the change, engineer explains the reasoning" format. Weird niche but apparently useful.
Language learning apps: Multi-language dialogue with natural back-and-forth between two speakers is exactly what you need for conversation practice content.
Interactive audiobooks: Characters with distinct voices, generated from existing text without re-recording anything.

The common thread is: you have a script, you want audio, and you don't want to record it yourself or pay someone to record it. The dialogue endpoint is the fastest path from text to something that sounds like a real conversation.

Getting started

Sign up at app.leanvox.com — you get $0.50 in free credits on signup, which is enough to generate a few minutes of dialogue and see if it fits your use case. The API reference is at leanvox.com/docs.

If you're building something with this and hit a wall, drop a note. Always curious to hear what people are actually making.