Audio / Voice Agent API Specification

Purpose: Define the API contract needed from the Voice Agent team so that pipeline.py can swap out edge-tts for a real voice clone of Prof. Hahne. Current placeholder: edge-tts (Microsoft Edge neural voice, offline, no API key)

What We Need

A REST API endpoint that accepts text + voice parameters and returns: - Synthesized audio (MP3/WAV) - Optional: word-level timestamps for advanced lip-sync

Required Endpoint

`POST /speak` (or `/tts` / `/synthesize`)

Request:

{
  "text": "Neural Radiance Fields enable photorealistic 3D scene reconstruction from sparse 2D images.",
  "voice_id": "prof_hahne_en",           // or "prof_hahne_de"
  "language": "en-US",                    // or "de-DE"
  "speed": 1.0,                          // 0.5 = slow, 2.0 = fast
  "format": "mp3",                       // mp3, wav, ogg
  "word_timestamps": true                // optional — for lip-sync
}

Response (success — 200 OK):

{
  "audio_url": "https://voice-agent.hfu.edu/api/audio/abc123.mp3",
  "duration_seconds": 8.42,
  "format": "mp3",
  "word_timestamps": [
    {"word": "Neural", "start": 0.0, "end": 0.62},
    {"word": "Radiance", "start": 0.62, "end": 1.15},
    {"word": "Fields", "start": 1.15, "end": 1.58},
    ...
  ]
}

Response (error — 4xx/5xx):

{
  "error": "voice_not_found",
  "message": "Voice ID 'prof_hahne_en' does not exist. Available: ['prof_hahne_en', 'prof_hahne_de']"
}

Alternative: Simple File Upload

If a full API is not available, a simpler contract also works:

`POST /speak` — Returns audio directly

Request headers:

Content-Type: application/json
Authorization: Bearer <API_KEY>   // if auth required

Response: - Content-Type: audio/mpeg (or audio/wav) - Body: raw audio bytes - Header: X-Duration: 8.42 (optional)

Our pipeline would then: 1. POST text to /speak 2. Save response body to .mp3 3. Use ffprobe to get duration locally

Integration Points in Our Code

Current placeholder (to be replaced):

File: scripts/pipeline.py Function: generate_tts(text, out_mp3, voice="en-US-AriaNeural")

def generate_tts(text, out_mp3, voice="en-US-AriaNeural"):
    """Placeholder TTS via edge-tts."""
    run(["edge-tts", "--voice", voice, "--text", text, "--write-media", out_mp3])

Replacement implementation:

import requests

VOICE_API_URL = "https://voice-agent.hfu.edu/api/speak"
VOICE_API_KEY = os.environ.get("VOICE_API_KEY", "")

def generate_tts(text, out_mp3, voice="prof_hahne_en"):
    """Real voice clone via Voice Agent API."""
    response = requests.post(
        VOICE_API_URL,
        json={
            "text": text,
            "voice_id": voice,
            "language": "en-US",
            "format": "mp3",
            "word_timestamps": True
        },
        headers={"Authorization": f"Bearer {VOICE_API_KEY}"},
        timeout=60
    )
    response.raise_for_status()

    data = response.json()

    # Download audio
    audio_response = requests.get(data["audio_url"], timeout=30)
    audio_response.raise_for_status()
    with open(out_mp3, "wb") as f:
        f.write(audio_response.content)

    # Return timestamps for optional advanced lip-sync
    return data.get("word_timestamps", [])

What the Voice Agent Team Needs to Provide

1. Base URL

https://voice-agent.hfu.edu/api/

(or whatever domain)

2. Authentication

Option A: API key in Authorization: Bearer <token> header
Option B: IP allowlist (no key needed from our VPS: 141.28.79.251)

3. Voice Model

We need a voice clone of Prof. Hahne in: - English (prof_hahne_en) — primary - German (prof_hahne_de) — optional, for German-language lectures

Training data needed: ~10–30 minutes of clean Prof. Hahne speech recordings

4. Response Format Preference

Priority	Format	Why
1	JSON + audio_url	Best — we can cache, retry, inspect
2	Raw audio bytes	Simple — direct file save

5. Optional: Word-Level Timestamps

If provided, we can upgrade from Wav2Lip (frame-level lip-sync) to phoneme-level lip-sync (mouth shape per phoneme). This gives more accurate lip movement.

Format:

[
  {"word": "hello", "start": 0.0, "end": 0.45, "phonemes": [
    {"phoneme": "h", "start": 0.0, "end": 0.08},
    {"phoneme": "eh", "start": 0.08, "end": 0.25},
    {"phoneme": "l", "start": 0.25, "end": 0.35},
    {"phoneme": "ow", "start": 0.35, "end": 0.45}
  ]}
]

Fallback Options (If Custom API Not Ready)

If a custom voice API is months away, these are acceptable interim solutions:

Option	Pros	Cons
ElevenLabs API	High quality, fast setup, good German support	External dependency, paid
Microsoft Azure TTS	Neural voices, SSML support, custom voice training	Azure account needed
Google Cloud TTS	WaveNet voices, 40+ languages	GCP account needed
Coqui TTS	Open source, self-hosted	Requires training data + GPU for cloning

Recommended interim: ElevenLabs with voice cloning — upload 10 min of Prof. Hahne audio, get instant API endpoint.

Questions for Voice Agent Team

What is the base URL / endpoint path?
What authentication method? (Bearer token, API key, IP allowlist?)
What voice IDs will be available?
What audio formats are supported? (MP3, WAV, OGG?)
Can we get word-level timestamps?
What is the rate limit? (requests/minute)
What is the max text length per request?
Do you need us to chunk long texts, or does the API handle it?

Test Script (for Voice Agent team)

import requests

url = "YOUR_ENDPOINT_HERE"
payload = {
    "text": "This is a test of the voice agent API.",
    "voice_id": "prof_hahne_en",
    "format": "mp3"
}
headers = {"Authorization": "Bearer YOUR_TOKEN"}

r = requests.post(url, json=payload, headers=headers)
print(f"Status: {r.status_code}")
print(f"Response: {r.json() if 'json' in r.headers.get('content-type', '') else 'binary audio'}")

Spec maintained by Project 3 team | HFU — Prof. Dr. Uwe Hahne