Project 3 — Build Log
Chronological record of everything implemented to reach the current pipeline state.
Phase 1: Foundation (Initial Setup)
Environment
- Python 3.10.14 installed side-by-side via
uv(system is 3.13.5, LivePortrait requires ~3.10) - cuDNN compatibility fix: Extracted cuDNN 8.9 side-by-side with cuDNN 9.1
- PyTorch uses cuDNN 9.x
- ONNX Runtime 1.18 requires cuDNN 8.x
pipeline.pyauto-setsLD_LIBRARY_PATHto both paths before inference- GPU verified: 1184 MiB usage during inference, zero CPU-fallback warnings
Tools Installed
| Tool | Version | Purpose |
|---|---|---|
| PyTorch | 2.7.1+cu118 | LivePortrait inference |
| ONNX Runtime | 1.18.0 | Face detection backend |
| edge-tts | latest | Placeholder TTS (no API key) |
| FFmpeg | 7.1.3 | Video composition |
| Chromium | 1217 (headless shell) | Slide rendering |
Repositories Cloned
LivePortrait/— Deepfake engine + pretrained weights from HuggingFacewav2lip/— Lip-sync engine + 2 checkpoints (GAN + no-GAN)
Phase 2: Basic Pipeline (Single Slide)
Goal: Generate a single educational video: slide + TTS + avatar overlay
Components Built
pipeline.py— 4-step CLI pipeline:--topic,--text,--outputargs- Slide rendering via Chromium headless (1920×1080 PNG)
- TTS via
edge-tts(Microsoft Edge neural voice) - Avatar animation via LivePortrait
-
FFmpeg composition: slide background + avatar overlay + audio
-
Slide HTML template — Dark-themed, HFU branded, responsive typography
-
generate_index.py— Auto-generatesindex.htmlgallery
First Output
assets/output/final_demo.mp4assets/output/pipeline_test.mp4
Phase 3: Multi-Slide Presentations
Goal: Support JSON slide decks for longer lectures
Changes
--presentationflag added topipeline.py- Accepts JSON file with
slides[]array - Each slide gets independent TTS + rendered PNG
-
FFmpeg concatenates segments with per-slide timing
-
Example presentation created:
presentations/nerf_presentation.json- 5 slides about Neural Radiance Fields
-
Topics: What is NeRF, How it works, Volume rendering, Positional encoding, AR/MR applications
-
Test result: 74.8s final video, all 5 slides composited correctly
Output
assets/output/nerf_presentation.mp4assets/output/nerf_presentation.json(manifest)
Phase 4: Gallery Enhancement (Detail Pages)
Goal: Per-video detail view with metadata, slide list, raw JSON
Changes
generate_index.pyextended:- Creates
detail/{video}.htmlfor every video with a manifest - Shows: full video player, slide-by-slide breakdown, voice used, raw metadata JSON
-
Download links + back navigation
-
Manifest system added:
- Every generated video gets a
.jsonmanifest -
Contains: topic, subtitle, voice, slide list, timestamps, duration
-
Gallery filtering:
- Skip
*_looped.mp4intermediates - Skip videos without
.jsonmanifests (prevents broken placeholders)
Output
assets/output/index.html(gallery)assets/output/detail/nerf_presentation.html
Phase 5: Wav2Lip Lip-Sync Integration
Goal: Sync avatar mouth movements to actual TTS audio (not just pre-canned driving motion)
Wav2Lip Setup
- Repository cloned:
github.com/Rudrabha/Wav2Lip - Checkpoints downloaded:
Wav2Lip-SD-GAN.pt(146 MB)Wav2Lip-SD-NOGAN.pt(146 MB)- Compatibility patches applied:
audio.py:librosa.filters.mel()kwargs fix for librosa 0.11audio.py:librosa.load()instead of deprecatedlibrosa.core.load()audio.py:soundfile.write()instead of removedlibrosa.output.write_wav()inference.py:torch.load(weights_only=False)for PyTorch 2.7inference.py: TorchScript archive fallback handling
Pipeline Changes
- New stage added:
lip_sync_avatar() - Takes LivePortrait-animated video + audio file
- Runs Wav2Lip inference
-
Outputs lip-synced avatar video
-
Audio concatenation helper:
concat_audios() -
Merges multiple slide MP3s into one WAV for Wav2Lip (multi-slide presentations)
-
Pipeline flow updated:
-
Manifest updated:
- Added
"lip_sync": true - Added
"lip_sync_engine": "Wav2Lip"
Test Results
| Test | Duration | Result |
|---|---|---|
| Wav2Lip standalone | 8.6s | ✅ Mouth synced to demo narration |
| Full pipeline (single slide) | 9.4s | ✅ End-to-end with lip-sync |
Output
assets/output/lipsync_test.mp4assets/output/lipsync_test.json
Phase 6: Documentation Reorganization (Current)
Goal: Clean GitHub page with subpages for different audiences
Structure Created
project03/
├── README.md ← Landing page (status + quick start + links)
├── SETUP.md ← Full reproduction guide
└── docs/
├── LOG.md ← This file — build history
├── AUDIO_API.md ← Voice Agent API contract (what we need)
└── IMAGES.md ← Image/avatar API contract (what we need)
README.md rewritten as:
- Small status overview
- Architecture diagram
- Quick start commands
- Links to subpages
- Current limitations table
- Tech stack summary
Current State Summary
| Feature | Status |
|---|---|
| Single-slide videos | ✅ Working |
| Multi-slide presentations | ✅ Working |
| TTS (placeholder) | ✅ Working (edge-tts) |
| Avatar animation | ✅ Working (LivePortrait) |
| Lip-sync | ✅ Working (Wav2Lip) |
| Gallery generation | ✅ Working |
| Detail pages | ✅ Working |
| GPU acceleration | ✅ Working (cuDNN 8/9 fix) |
| Real avatar (Prof. Hahne) | ⏳ Awaiting Image Agent |
| Real voice clone | ⏳ Awaiting Voice Agent API |
Generated Videos
| Video | Slides | Duration | Lip-sync |
|---|---|---|---|
nerf_presentation.mp4 |
5 | 74.8s | ✅ |
lipsync_test.mp4 |
1 | 9.4s | ✅ |
Known Issues / TODO
- Cloudflare tunnel URL is ephemeral — rotates on restart. For stable preview, need custom domain or VPS port exposure.
- Gallery cache — Browser may show stale entries;
Ctrl+Shift+Rhard refresh needed. - Wav2Lip resize_factor — Currently set to 1 (full resolution). May need tuning for 4K output.
- Audio-driven LivePortrait — Alternative to Wav2Lip (better quality but slower). Deferred until needed.
Log maintained by Hermes Agent | Last updated: June 4, 2026
Phase 7: Avatar Quality Iteration — Root-Cause Diagnosis (v4–v8)
Problem identified: LivePortrait copies mouth motion directly from the driving video, not from the TTS audio. This causes:
- v4–v6: Exaggerated mouth flapping, completely unsynced to spoken words
- v7: --animation-region pose fixed the mouth but eliminated all facial expression (too static)
- Root cause: The driving video (d6.mp4) contains its own mouth movements. LivePortrait transfers these 1:1.
Solution for v8:
1. Use d0.mp4 (natural idle with subtle sway + blinks) as driving video
2. Apply Wav2Lip-SD-NOGAN.pt for sharper mouth quality (NOGAN variant avoids GAN artifacts)
3. Resize avatar to 350px (previously 480px was too dominant)
4. Use CRF 20 for sharp encoding
Result: videoretalking_presentation_v8.mp4 — significantly improved with natural idle motion + sharp lipsync. Still not perfect but usable.
Phase 8: Hallo2 Diffusion Integration (v9)
Goal: Replace LivePortrait+Wav2Lip with a single diffusion model that natively handles audio-driven lip-sync + natural head motion.
Why Hallo2: - Audio-conditioned diffusion — lips move based on TTS audio waveform, not a separate driving video - Natural head sway + blinks generated natively by the model - Higher visual quality than LivePortrait + Wav2Lip stacked - One inference pass instead of two separate models
Setup:
1. Cloned github.com/fudan-generative-vision/Hallo2.git
2. Installed dedicated venv (/opt/data/Hallo2/venv) with torch 2.2.2+cu121, xformers, diffusers
3. Downloaded pretrained models (~40 files, ~2GB) from HuggingFace fudan-generative-ai/hallo2
4. Configured hallo_full.yaml with:
- source_image: prof_hahne_v4_512.jpg (real photo)
- driving_audio: concatenated 5-slide TTS (97.2s, 16kHz mono WAV)
- inference_steps: 20 (reduced from 40 for speed — ~2× faster)
- 512×512 output, 25fps
Speed optimization: - Original 40 steps: ~2.5–3 hours estimated - Reduced to 20 steps: ~50 minutes actual (per batch: 21s vs 42s) - Quality tradeoff: slightly softer but acceptable — v-prediction scheduler handles low steps well
ONNXRuntime CUDA mismatch (non-blocking):
- Face analysis falls back to CPU: libcublasLt.so.11 missing
- System has CUDA 12.1; ONNXRuntime built against 11.8
- Diffusion runs on GPU fine at 100% utilization
Composition (build_hallo2_presentation.py):
- Hallo2 output: /tmp/hallo2_full_output/.../merge_video.mp4 (512×512, 97.2s)
- Loops video to total slide duration
- Scales to 350px via lanczos
- Overlays on each slide at bottom-right (W-w-30, H-h-30)
- Burns HFU logo at top-right (W-160:30)
- CRF 20, slow preset
Output:
- assets/output/hallo2_presentation.mp4 — 6.1MB, 1920×1080, 97.2s
- assets/output/hallo2_presentation.json — manifest with 5 slides
- Detail page: detail/hallo2_presentation.html
Gallery update:
- Added cache-busting query params (?t=...) to all video URLs
- Added Cache-Control: no-cache meta tags to index.html
- Filters: skips _looped.mp4, skips videos without JSON manifests
Version Comparison
| Version | Engine | Lip-sync Source | Motion | Quality | Verdict |
|---|---|---|---|---|---|
| v4–v6 | LivePortrait + Wav2Lip | Driving video (d6) | Extreme, unsynced | ❌ Poor | Mouth flapping |
| v7 | LivePortrait (pose) + Wav2Lip |
Audio (via Wav2Lip) | Static, no eyes | ⚠️ Acceptable | Too lifeless |
| v8 | LivePortrait (d0) + Wav2Lip-NOGAN |
Audio (via Wav2Lip) | Natural idle | ✅ Good | Best legacy |
| v9 | Hallo2 diffusion | Audio (native) | Natural + blink | ✅ Best | Preferred |
Phase 9: Documentation Refresh (Current)
Updates:
- README.md — updated status, added Hallo2 architecture, version table, current state table
- SETUP.md — added Hallo2 installation section
- LOG.md — appended Phases 7–8 with root-cause analysis and speed optimization details
- docs/projects/project03/index.md — synced with README for MkDocs site
Current State Summary
| Feature | Status |
|---|---|
| Single-slide videos | ✅ Working |
| Multi-slide presentations | ✅ Working |
| TTS (placeholder) | ✅ Working (edge-tts) |
| LivePortrait avatar | ✅ Working (legacy) |
| Wav2Lip lip-sync | ✅ Working (legacy) |
| Hallo2 diffusion avatar | ✅ Preferred |
| Real avatar (Prof. Hahne) | ✅ Integrated (prof_hahne_v4_512.jpg) |
| Gallery generation | ✅ Working (cache-busting) |
| Detail pages | ✅ Working |
| GPU acceleration | ✅ Working |
Generated Videos
| Video | Slides | Duration | Engine | Quality |
|---|---|---|---|---|
hallo2_presentation.mp4 |
5 | 97.2s | Hallo2 (20 steps) | ✅ Best |
videoretalking_presentation_v8.mp4 |
5 | 97.2s | LivePortrait + Wav2Lip | ✅ Good |
hallo2_test.mp4 |
1 | 17.4s | Hallo2 (40 steps) | ✅ Test |
Log maintained by Hermes Agent | Last updated: June 9, 2026