What is a "viseme" and how does it relate to the concept of a phoneme?
A phoneme is the basic acoustic unit of speech (42 in American English), while a viseme is the basic unit for "visual" speech, representing a class of visually indistinguishable phonemes.
Compare "Appearance Based Features" and "Model-based features" for visual speech extraction.
Appearance Based Features use pixel information and offer robust extraction but have high dimensionality (requiring PCA/LDA); Model-based features use specific parameters/vectors and have low dimensionality but are more difficult to obtain accurately.
What are the four core challenges in implementing a computer vision system for speech recognition?
The challenges include feature selection (invariants for lighting/motion), motion estimation (finding correspondence in images), active vision (controlling camera zoom/tilt), and ensuring real-time implementation.
What is the mathematical purpose of "Histogram Normalization" in processing lip images?
It uses a modification function f′(p)=T(f(p)) to transform original gray values into a new standardized gray value, making the system robust against different illumination conditions.
What are the three levels where information fusion can take place in an AV-ASR system?
Fusion can occur at the Feature level (often too simple), the Phoneme/Viseme Level (typically successful), or the Word Level (known as the decision level).
In the context of fusion, what is the primary disadvantage of "Early Fusion" (Feature Level)?
Early fusion requires a large amount of training data and often fails to generalize well if the different modes (like speech and lip movement) differ substantially in their information content or time scales.
How does the "Multistream Architecture" differ from a standard "Concatenation" approach in fusion?
In Multistream, audio and video are processed as separate, parallel streams through their own feature extraction and LDA before being fused during recognition, whereas Concatenation blends them into a single feature vector earlier in the process.
What is the "Reliability based weights" method in multimodal hypothesis calculation (hypc)?
This method uses weights (λ) based on the Signal-to-Noise Ratio (SNR) of the audio; if the audio is noisy, the system automatically shifts higher weight to the visual data stream.
Based on the LVCSR database results, how does increasing the training set from 14 to 30 speakers affect accuracy?
Increasing the training set significantly improves word accuracy across all models, with the Multi-Stream architecture reaching the highest accuracy (61.76%) compared to pure audio (60.71%).
What are the three "best" system settings identified by the Nlips lipreading system research?
The research concluded that the best features are Gray Values/LDA, the best fusion level is Phonetic, and the best fusion weights are SNR-based or Entropy-based.
What specific problem regarding user privacy does EMG-based ASR aim to solve?
EMG-ASR allows for speech recognition without acoustic information (measuring muscle potential instead), which enables "silent" communication in public spaces or meetings where privacy is a concern.
What is the "Session-independent recognition" challenge in EMG-based ASR?
It refers to the difficulty of training a system to recognize a word when the electric signal for that word changes significantly across different recording sessions or users.
How is Multimodal Fusion functionally described in a system architecture?
It is the component that combines and integrates all incoming uni-modal events into a single representation of the intention most likely expressed by the user.
Last changed11 days ago