Real-Time Multimodal Computer Vision System

System Architecture

Hand Gesture Pipeline

MediaPipe -> 21 Landmarks -> MLP

A 63-dimensional normalized vector feeds a three-layer MLP for 13 gesture classes. Custom multi-hand filtering logic improved reliability from 86% to about 95%.

Facial Emotion Pipeline

MediaPipe -> 468 Landmarks -> GCN

A Graph Convolutional Network uses three GCNConv layers plus global pooling. Handcrafted expression features such as mouth openness, brow height, and eye openness improve classification quality.

Wave Detection Innovation

Temporal wrist trajectory tracking adds dynamic gesture detection on top of static pose recognition, enabling more natural real-time interaction patterns.

Performance Breakdown

+11%

GCN improvement over MLP baseline (44% to 56% accuracy)

19K

Total MLP parameters with a lightweight deployment profile

7

Emotion classes plus 13 gesture classes and wave detection in one live system

Why GCN Over MLP?

Modeling facial landmarks as a graph preserves structural relationships between features that raw image pipelines can miss, improving emotion recognition while keeping computation low.

Technical Stack

PyTorch MediaPipe GCN MLP OpenCV PyTorch Geometric Python Computer Vision

Key Innovation

The system runs entirely on landmark geometry rather than heavy raw-image CNN processing, reducing overhead while keeping the application responsive on standard webcam setups.