Please open the demo webpage in Chrome for an enhanced experience. Headphones are recommended for better audio quality.
⭐ Avatar Animation Videos
⭐ New Update ⭐: With HarmoniVox, we generated speech that naturally complements the visual characteristics of the avatar image. Then, using the speech-driven talking video generator Hallo3, we produced an avatar animation that achieves audiovisual harmony.
These demos are powered by Hallo3, a speech-driven talking video generator. We gratefully acknowledge their contribution.
HarAvaSpeech Dataset
Dataset Visualization
Here we present the visualization of our dataset HarAvaSpeech on the distribution of: body posture (a.k.a. Fig.5(a)), scene (a.k.a. Fig.5(b)), langauge, gender, age, and emotion.
The following are the random samples from our dataset HarAvaSpeech.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Effective of Momentary Visual Contexts
To verify the effectiveness of non-facial regions, we conduct comparative experiments on different regions of the image: face-only, non-face, original. All three of them are conducted with HarmoniVox framework but without loss weight annealing.
# | Non-Face | Face-Only | Original |
---|---|---|---|
1 | The cubs go sixty and thirty four from that point forward , | ||
![]() |
![]() |
![]() |
|
2 | I mean you cannot , in the same moment of thought , | ||
![]() |
![]() |
![]() |
|
3 | 二奶奶在哭泪珠打她眼角里簌簌往下落*(trans: The second lady was crying, and tears fell from the corners of her eyes.)* | ||
![]() |
![]() |
![]() |
|
4 | Amazing , amazing poems . | ||
![]() |
![]() |
![]() |
HarmoniVox Framework
In this section we provide samples of intricated modeling of the HarmoniVox framework. First we present the model’s capabilities of utilizing out-of-domain portraits and emojis to refine the style of speech. Then we demonstrate the model from various aspects such as gender, hues, and emotion. Finally, we present the results of ablation study.
Generaliability Case Study
Out-of-Domain Portraits
In this section, we present the model’s capabilities of synthesizing speech refined by out-of-domain portrait.
# | condition 1 | condition 2 | |
---|---|---|---|
1 | I just got a call from the hospital about the test results. | ||
![]() |
![]() |
||
2 | Listen, I never expected it to turn out this way. | ||
![]() |
![]() |
||
3 | No matter the obstacles, we will always find a way to overcome them and reach our goals. | ||
![]() |
![]() |
||
4 | This is exactly what I was talking about. | ||
![]() |
![]() |
||
5 | 这些日子做了多少违背我良心的事*(trans: These days, I've done so many things that go against my conscience.)* | ||
![]() |
![]() |
Gender
# | Conditional Image A | Conditional Image B |
---|---|---|
1 | Hello, welcome to Mickey Mouse Clubhouse, this is super cool! | |
![]() |
![]() |
|
2 | 新思想引领新征程,我国综合立体交通网加速形成*(trans: New ideas lead a new journey, and my country’s comprehensive three-dimensional transportation network accelerates the formation of)* | |
![]() |
![]() |
|
3 | 所以你看,张三构不构成犯罪,是由什么决定的*(trans: So you see, what determines that Zhang Sangou does not constitute a crime?)* | |
![]() |
![]() |
|
4 | Hahaha, are you kidding me? | |
![]() |
![]() |
Hues
# | Conditional Image A | Conditional Image B |
---|---|---|
1 | I would like to say, that your journey has come to an end. | |
![]() |
![]() |
|
2 | 你可真是个小坏蛋*(trans: You are such a little rascal)* | |
![]() |
![]() |
Emotion
# | Conditional Image A | Conditional Image B | Conditional Image C |
---|---|---|---|
1 | 你太让我失望了*(trans: You have let me down too much)* | ||
![]() |
![]() |
![]() |
|
2 | 我的天哪你别靠近我*(trans: Oh my god, don't come near me.)* | ||
![]() |
![]() |
![]() |
# | Conditional Image A | Conditional Image B |
---|---|---|
3 | 他们还厚颜无耻的说出这样的话,恶心*(trans: They have the audacity to say such things, it's disgusting.)* | |
![]() |
![]() |
|
4 | We were going to take the shortcut onto brightside and as we were turning | |
![]() |
![]() |
|
5 | My best friend and my sister! I can not believe that! | |
![]() |
![]() |
|
6 | What are you talking about? | |
![]() |
![]() |
Ablation Study
To verify the effectiveness of multi-modal contrastive learning, we conduct ablation experiments. We present partial synthesized audios as follows.
# | Image & Text | Proposed | Proposed w/o Text | Baseline |
---|---|---|---|---|
1 |
![]() 你这个问题问的的确好笑 |
|||
2 |
![]() 弄的我名声不佳我简直要气死了呀*(trans: It's given me a bad reputation, and I'm really pissed off)* |
|||
3 |
![]() 有的,最后一朵花的幽灵依然存在着!*(trans: Yes, the ghost of the last flower still exists!)* |
|||
4 |
![]() A man stood behind blaine , |
|||
5 |
![]() From what you understand , |
Compared to Vanilla TTS
In this section, we compare the proposed HarmoniVox framework with the vanilla TTS model. We take CosyVoice as the representative for vanilla TTS model.
From the comparison, we can observe that the HarmoniVox framework generates speech that is more closely aligned with the visual context. However, due to the limitations of the VITS backbone, the HarmoniVox framework may not generate inter-word pauses as effectively as the vanilla TTS model, which could impact the naturalness of the speech.
The following are the samples of this comparison.
# | We were going to take the shortcut onto the brightside and as we were turning. |
---|---|
Vanilla-TTS(CosyVoice) | |
HarmoniVox |
![]() |
HarmoniVox |
![]() |
Appendix Content
Appendix A: Intra-domain Testset
Appendix B: Out-of-domain Testset
The out-of-domain testset are collected from the Internet, and the emotion, age distribution of the out-of-domain testset are more balanced than the intra-domain testset.