|
Visual Speech Recognition (Lip Reading) was an effort I lead when I was at Sony Electronics do develop their in car application for lip reading technology. In the presents of ambient noise automatic speech recognition (auditory) degrades in accuracy (WER). To address this a pipeline was developed to augment and replace ASR with videos of users speaking without audio. Large scale data curation was done to collect an internal dataset of phrases associated with Entertainment, Environmental, and Controls signals for the vehicle. All major published works in the application were reproduced, tested with our internal dataset and expanded on. This work led to several patients for myself and our team. |
Temporal Receptive FieldsOne of the major novel research done on this project was the definition of targeted temporal receptive fields in spatio-temporal data and its effect on VSR applications.
|
Spatio-Temporal
|
Online Refinement with MFCC FeaturesOne interesting side effect of VSR is that you have audio. While audio was not used for the token sequence generation, it could be used to validate results after inference but adding a fully connected head to generate MFCC features trained with a reconstruction loss.
|