Dual-pipeline system combining hand gesture recognition and facial emotion recognition in a unified live webcam application running at 20-30 FPS.
A 63-dimensional normalized vector feeds a three-layer MLP for 13 gesture classes. Custom multi-hand filtering logic improved reliability from 86% to about 95%.
A Graph Convolutional Network uses three GCNConv layers plus global pooling. Handcrafted expression features such as mouth openness, brow height, and eye openness improve classification quality.
Temporal wrist trajectory tracking adds dynamic gesture detection on top of static pose recognition, enabling more natural real-time interaction patterns.
Modeling facial landmarks as a graph preserves structural relationships between features that raw image pipelines can miss, improving emotion recognition while keeping computation low.
The system runs entirely on landmark geometry rather than heavy raw-image CNN processing, reducing overhead while keeping the application responsive on standard webcam setups.