Hand Gesture Interface

Motivation

Computers traditionally rely on precise input devices such as keyboards and mice. While effective, these interfaces are not ideal or accessible for everyone. Users with arthritis, carpal tunnel syndrome, or limited hand mobility may find conventional inputs uncomfortable or difficult to use.

Newer platforms such as VR/AR systems and smart TVs also lack intuitive ways of interacting without specialized controllers. A gesture-based system provides a natural, hands-free, hardware-free method of interacting with computers.

Our goal is to design a flexible, real-time gesture recognition system using only a standard webcam. This project allowed us to explore dataset creation, model fine-tuning, and real-time computer vision while building a practical interface for hands-free computer control.

Approach Overview

Our system processes live webcam video and recognizes static hand gestures using a fine-tuned YOLOv12n model. Each detected gesture is then mapped to a computer action such as adjusting volume or controlling media playback.

Webcam Capture
Frames are captured from the user's webcam in real time.
Gesture Recognition
The YOLOv12n model detects the hand and classifies it as one of:
- closed-hand
- closed-hand-thumb-out
- open-hand
- palm-open
- thumbs-up
- thumbs-down
Action Trigger
Using pynput, each gesture maps to a desktop action:
- thumbs-up → volume up
- thumbs-down → volume down
- palm-open → play / pause
- open-hand → skip backward
- closed-hand → move mouse
- closed-hand-thumb-out → mouse click

YOLOv12n was selected because of its speed, lightweight architecture, and strong performance when fine-tuned on small custom datasets.

YOLOv12n fine-tuned for six custom hand gestures.

Dataset

Since no existing dataset matched our specific gesture definitions, we created our own dataset from scratch. Images were collected from all team members under varying lighting, backgrounds, and hand orientations.

445 original images 1,071 images after augmentation 6 gesture classes

Augmentation included:

random rotation and flipping
brightness, exposure, and hue changes
cropping and scaling
background variation

These augmentations increased dataset diversity and improved robustness to lighting, perspective, and user-specific gesture differences.

Animated preview of our custom gesture dataset.

Manual labeling of each gesture using bounding boxes.

Implementation Details

The project consists of three main components: model training, live inference, and gesture-to-action mapping.

Model Training
We fine-tuned YOLOv12n on our custom dataset. Training was completed on an NVIDIA T4 GPU in under 30 minutes due to the model’s efficient architecture.

Real-Time Inference
Using OpenCV, each webcam frame is forwarded through the YOLO model. The detected gesture and bounding box are rendered in real time on the screen so users can see exactly what the model is predicting.

Gesture → Action Mapping
Using pynput, gestures trigger actions such as media control, volume changes, and mouse operations. Confidence thresholds and prediction smoothing help eliminate flicker during continuous use.

YOLOv12 inference powering real-time interaction.

Results

The fine-tuned YOLO model performed well across all gesture classes. It operated in real time with low latency, even on CPU-only systems. Some confusion occurred between visually similar gestures (e.g., open-hand and palm-open), but overall accuracy was strong for our intended use case.

In real-world testing, the system was able to control media playback, volume, and mouse actions reliably while remaining responsive and stable.

Below is a real-time demonstration of gesture recognition:

Confusion matrix showing model performance

Confusion matrix for the six gesture classes. Most errors occur between similar open-hand and palm-open gestures.

Problems Encountered & Lessons Learned

Data Imbalance

Early versions of the dataset had uneven gesture representation, causing biased predictions toward over-represented classes. Additional data collection and targeted augmentation helped resolve this.

Lack of Variety

Limited backgrounds, lighting, and subjects reduced model robustness. Expanding the dataset with more locations and users significantly improved performance.

Gesture Similarity

Certain gestures, especially open-hand vs. palm-open, required substantially more training samples due to their similarity. We learned that some classes inherently need more data than others.

Testing Methodology

Initially we tested on a single split, which gave misleading results. We improved evaluation consistency using a more systematic test set and by examining confusion patterns, not just overall accuracy.

Overall, the project highlighted the importance of dataset quality, variance, and the power of transfer learning for small, domain-specific tasks.

Future Work

Dynamic gesture sequences AR/VR integration Multi-hand support Hybrid YOLO + landmarks

Support for dynamic or sequential gestures instead of only static poses.
Integration with AR/VR headsets for natural, controller-free interaction.
Recognition of multi-hand and multi-user gestures.
Hybrid architectures combining YOLO detection with hand landmarks (e.g., MediaPipe).
Packaging as a standalone accessibility or smart TV control application.
Scaling to a much larger, multi-user dataset for improved generalization.

With more data and advanced modeling, this system could become a robust, low-cost, hands-free interface for accessibility, entertainment, and immersive computing.

Code & Downloads

All project materials are available below: