Pixify feature

AI Avatar / Talking Head

One photo + one audio = a talking digital human

Just a photo and audio — that's it
OmniHuman v1.5 / Hedra and other top models
Auto lip-sync, expressions, head motion
9:16 vertical and 16:9 landscape

Generate your first avatar Explore all features

What is it

AI Avatar combines a static face photo with audio to generate a digital human speaking video. The system auto-aligns lips with audio and adds natural blinks, nods, expression changes. Common uses: e-commerce voiceover, educational videos, virtual hosts. A 30s video takes 2-5 minutes to generate.

How to use it

Get started in 5 steps

1
Upload a face photo
Clear front-facing works best. 5-10MB PNG/JPG/WEBP. Face is auto-detected.
2
Upload audio
The line you want spoken. MP3/WAV/M4A up to 20MB. English / Mandarin / etc. supported. Or generate with Text to Audio node first.
3
Optional: scene prompt
Describe shot framing, action, expression hints ("medium shot, natural smile, occasional nods"). Optional.
4
Pick model + aspect ratio
OmniHuman v1.5 recommended. 9:16 for short-form, 16:9 for long-form platforms.
5
Generate + download
Hit Generate, wait 2-5 minutes. Download, save, or send to Workflow Editor for postprocessing.

Use cases

What other users build with it

E-commerce voiceover

Host photo + product script audio → short video. 90% time savings vs live recording.

Educational content

Historical figure photo + lecture audio = "the ancients" teaching history.

Virtual hosts

Same character across episodes for consistent brand persona.

Multilingual marketing

One photo, multiple language audios → one shoot, all languages.

Why Pixify

Dead-simple two steps

Upload photo + audio. 30 seconds to submit.

Frame-accurate lip-sync

OmniHuman v1.5 is current industry SOTA for lip alignment.

Workflow chainable

Chain with Text to Audio (synth lines) or Audio Video Merge (add BGM).

Frequently asked questions

What kind of photo works best?

Clear front-facing, even lighting. Profile shots, sunglasses, extreme angles hurt lip-sync accuracy. Minimum 1024x1024 recommended.

How long can the audio be?

Currently ~60 seconds per generation. For longer content, split audio into segments and use Video Merge to concatenate.

Can I do two-person dialogue?

One avatar per generation. For dialogue: generate A and B separately, then Video Merge + Audio Video Merge to compose.

Who owns the output?

You own the generated video. BUT — the input face photo must be yours legally (your face / licensed / AI-generated). No celebrity or non-consented real-person photos.

Ready to start?

Generate your first avatar