Pixify feature

AI Avatar / Talking Head

One photo + one audio = a talking digital human

  • Just a photo and audio — that's it
  • OmniHuman v1.5 / Hedra and other top models
  • Auto lip-sync, expressions, head motion
  • 9:16 vertical and 16:9 landscape
AI Avatar / Talking Head

What is it

AI Avatar combines a static face photo with audio to generate a digital human speaking video. The system auto-aligns lips with audio and adds natural blinks, nods, expression changes. Common uses: e-commerce voiceover, educational videos, virtual hosts. A 30s video takes 2-5 minutes to generate.

How to use it

Get started in 5 steps

  1. 1

    Upload a face photo

    Clear front-facing works best. 5-10MB PNG/JPG/WEBP. Face is auto-detected.

  2. 2

    Upload audio

    The line you want spoken. MP3/WAV/M4A up to 20MB. English / Mandarin / etc. supported. Or generate with Text to Audio node first.

  3. 3

    Optional: scene prompt

    Describe shot framing, action, expression hints ("medium shot, natural smile, occasional nods"). Optional.

  4. 4

    Pick model + aspect ratio

    OmniHuman v1.5 recommended. 9:16 for short-form, 16:9 for long-form platforms.

  5. 5

    Generate + download

    Hit Generate, wait 2-5 minutes. Download, save, or send to Workflow Editor for postprocessing.

Use cases

What other users build with it

E-commerce voiceover

Host photo + product script audio → short video. 90% time savings vs live recording.

Educational content

Historical figure photo + lecture audio = "the ancients" teaching history.

Virtual hosts

Same character across episodes for consistent brand persona.

Multilingual marketing

One photo, multiple language audios → one shoot, all languages.

Why Pixify

Dead-simple two steps

Upload photo + audio. 30 seconds to submit.

Frame-accurate lip-sync

OmniHuman v1.5 is current industry SOTA for lip alignment.

Workflow chainable

Chain with Text to Audio (synth lines) or Audio Video Merge (add BGM).

Frequently asked questions

What kind of photo works best?

+
Clear front-facing, even lighting. Profile shots, sunglasses, extreme angles hurt lip-sync accuracy. Minimum 1024x1024 recommended.

How long can the audio be?

+
Currently ~60 seconds per generation. For longer content, split audio into segments and use Video Merge to concatenate.

Can I do two-person dialogue?

+
One avatar per generation. For dialogue: generate A and B separately, then Video Merge + Audio Video Merge to compose.

Who owns the output?

+
You own the generated video. BUT — the input face photo must be yours legally (your face / licensed / AI-generated). No celebrity or non-consented real-person photos.

Ready to start?

Sign up gets you starter credits. No card required.

Generate your first avatar