Audio / Voice Agent API Specification
Purpose: Define the API contract needed from the Voice Agent team so that
pipeline.pycan swap outedge-ttsfor a real voice clone of Prof. Hahne. Current placeholder:edge-tts(Microsoft Edge neural voice, offline, no API key)
What We Need
A REST API endpoint that accepts text + voice parameters and returns: - Synthesized audio (MP3/WAV) - Optional: word-level timestamps for advanced lip-sync
Required Endpoint
POST /speak (or /tts / /synthesize)
Request:
{
"text": "Neural Radiance Fields enable photorealistic 3D scene reconstruction from sparse 2D images.",
"voice_id": "prof_hahne_en", // or "prof_hahne_de"
"language": "en-US", // or "de-DE"
"speed": 1.0, // 0.5 = slow, 2.0 = fast
"format": "mp3", // mp3, wav, ogg
"word_timestamps": true // optional — for lip-sync
}
Response (success — 200 OK):
{
"audio_url": "https://voice-agent.hfu.edu/api/audio/abc123.mp3",
"duration_seconds": 8.42,
"format": "mp3",
"word_timestamps": [
{"word": "Neural", "start": 0.0, "end": 0.62},
{"word": "Radiance", "start": 0.62, "end": 1.15},
{"word": "Fields", "start": 1.15, "end": 1.58},
...
]
}
Response (error — 4xx/5xx):
{
"error": "voice_not_found",
"message": "Voice ID 'prof_hahne_en' does not exist. Available: ['prof_hahne_en', 'prof_hahne_de']"
}
Alternative: Simple File Upload
If a full API is not available, a simpler contract also works:
POST /speak — Returns audio directly
Request headers:
Response:
- Content-Type: audio/mpeg (or audio/wav)
- Body: raw audio bytes
- Header: X-Duration: 8.42 (optional)
Our pipeline would then:
1. POST text to /speak
2. Save response body to .mp3
3. Use ffprobe to get duration locally
Integration Points in Our Code
Current placeholder (to be replaced):
File: scripts/pipeline.py
Function: generate_tts(text, out_mp3, voice="en-US-AriaNeural")
def generate_tts(text, out_mp3, voice="en-US-AriaNeural"):
"""Placeholder TTS via edge-tts."""
run(["edge-tts", "--voice", voice, "--text", text, "--write-media", out_mp3])
Replacement implementation:
import requests
VOICE_API_URL = "https://voice-agent.hfu.edu/api/speak"
VOICE_API_KEY = os.environ.get("VOICE_API_KEY", "")
def generate_tts(text, out_mp3, voice="prof_hahne_en"):
"""Real voice clone via Voice Agent API."""
response = requests.post(
VOICE_API_URL,
json={
"text": text,
"voice_id": voice,
"language": "en-US",
"format": "mp3",
"word_timestamps": True
},
headers={"Authorization": f"Bearer {VOICE_API_KEY}"},
timeout=60
)
response.raise_for_status()
data = response.json()
# Download audio
audio_response = requests.get(data["audio_url"], timeout=30)
audio_response.raise_for_status()
with open(out_mp3, "wb") as f:
f.write(audio_response.content)
# Return timestamps for optional advanced lip-sync
return data.get("word_timestamps", [])
What the Voice Agent Team Needs to Provide
1. Base URL
(or whatever domain)2. Authentication
- Option A: API key in
Authorization: Bearer <token>header - Option B: IP allowlist (no key needed from our VPS:
141.28.79.251)
3. Voice Model
We need a voice clone of Prof. Hahne in:
- English (prof_hahne_en) — primary
- German (prof_hahne_de) — optional, for German-language lectures
Training data needed: ~10–30 minutes of clean Prof. Hahne speech recordings
4. Response Format Preference
| Priority | Format | Why |
|---|---|---|
| 1 | JSON + audio_url | Best — we can cache, retry, inspect |
| 2 | Raw audio bytes | Simple — direct file save |
5. Optional: Word-Level Timestamps
If provided, we can upgrade from Wav2Lip (frame-level lip-sync) to phoneme-level lip-sync (mouth shape per phoneme). This gives more accurate lip movement.
Format:
[
{"word": "hello", "start": 0.0, "end": 0.45, "phonemes": [
{"phoneme": "h", "start": 0.0, "end": 0.08},
{"phoneme": "eh", "start": 0.08, "end": 0.25},
{"phoneme": "l", "start": 0.25, "end": 0.35},
{"phoneme": "ow", "start": 0.35, "end": 0.45}
]}
]
Fallback Options (If Custom API Not Ready)
If a custom voice API is months away, these are acceptable interim solutions:
| Option | Pros | Cons |
|---|---|---|
| ElevenLabs API | High quality, fast setup, good German support | External dependency, paid |
| Microsoft Azure TTS | Neural voices, SSML support, custom voice training | Azure account needed |
| Google Cloud TTS | WaveNet voices, 40+ languages | GCP account needed |
| Coqui TTS | Open source, self-hosted | Requires training data + GPU for cloning |
Recommended interim: ElevenLabs with voice cloning — upload 10 min of Prof. Hahne audio, get instant API endpoint.
Questions for Voice Agent Team
- What is the base URL / endpoint path?
- What authentication method? (Bearer token, API key, IP allowlist?)
- What voice IDs will be available?
- What audio formats are supported? (MP3, WAV, OGG?)
- Can we get word-level timestamps?
- What is the rate limit? (requests/minute)
- What is the max text length per request?
- Do you need us to chunk long texts, or does the API handle it?
Test Script (for Voice Agent team)
import requests
url = "YOUR_ENDPOINT_HERE"
payload = {
"text": "This is a test of the voice agent API.",
"voice_id": "prof_hahne_en",
"format": "mp3"
}
headers = {"Authorization": "Bearer YOUR_TOKEN"}
r = requests.post(url, json=payload, headers=headers)
print(f"Status: {r.status_code}")
print(f"Response: {r.json() if 'json' in r.headers.get('content-type', '') else 'binary audio'}")
Spec maintained by Project 3 team | HFU — Prof. Dr. Uwe Hahne