This is a joint work by Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. The paper was accepted by the 20th ACM International Conference on Multimodal Interaction.

Abstract
We present a deep learning framework for real-time speech-driven 3D facial animation from speech audio. Our deep neural network directly maps an input sequence of speech spectrograms to a series of micro facial action unit intensities to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or both eyebrows raise higher in a surprised state. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.
Full paper at https://dl.acm.org/citation.cfm?id=3243017.
Published code at https://github.com/haixpham/end2end_AU_speech.
Reference
2018
- H. X. Pham, Y. Wang, and V. Pavlovic, “End-to-end Learning for 3D Facial Animation from Speech,” in ICMI, 2018, p. 361–365. doi:10.1145/3242969.3243017
[BibTeX]@InProceedings{hai18icmi, author = {Hai Xuan Pham and Yuting Wang and Vladimir Pavlovic}, booktitle = {{ICMI}}, title = {End-to-end Learning for 3D Facial Animation from Speech}, year = {2018}, pages = {361--365}, publisher = {{ACM}}, doi = {10.1145/3242969.3243017}, }