Matthew Wnuk
  • Bio
  • Resume
  • Projects
    • RFMLS
    • Visual Speech Recognition (Lip Reading)
    • Retail Applications
  • Travels
  • Contact
Picture

​Visual Speech Recognition (Lip Reading) was an effort I lead when I was at Sony Electronics do develop their in car application for lip reading technology. In the presents of ambient noise automatic speech recognition (auditory) degrades in accuracy (WER). To address this a pipeline was developed to augment and replace ASR with videos of users speaking without audio. 

Large scale data curation was done to collect an internal dataset of phrases associated with Entertainment, Environmental, and Controls signals for the vehicle. All major published works in the application were reproduced, tested with our internal dataset and expanded on. This work led to several patients for myself and our team. 

Picture

Temporal Receptive Fields

One of the major novel research done on this project was the definition of targeted temporal receptive fields in spatio-temporal data and its effect on VSR applications. 
Picture

Spatio-Temporal
Feature Extraction

Most of the major publications in the application space utilized spatio-temporal feature extractors. These are 3D convolutional CNNs. The best performing of our models were custom 3D ResNets designed in house. 
Picture

Online Refinement with MFCC Features

One interesting side effect of VSR is that you have audio. While audio was not used for the token sequence generation, it could be used to validate results after inference but adding a fully connected head to generate MFCC features trained with a reconstruction loss.