CS 566 · Computer Vision

Hand Gesture Interface

A real-time, webcam-based hand gesture system for controlling your computer without a mouse, powered by a fine-tuned YOLOv12 model.

Matej Popovski · Matthew Kasper · Aniket Patel

YOLOv12n Custom Dataset Real-Time Inference
YOLO detection overview diagram
Images (augmented) 1,071
Gesture Classes 6
Train Time ~30 min
Hardware NVIDIA T4 + CPU

Motivation

Computers traditionally rely on precise input devices such as keyboards and mice. While effective, these interfaces are not ideal or accessible for everyone. Users with arthritis, carpal tunnel syndrome, or limited hand mobility may find conventional inputs uncomfortable or difficult to use.

Newer platforms such as VR/AR systems and smart TVs also lack intuitive ways of interacting without specialized controllers. A gesture-based system provides a natural, hands-free, hardware-free method of interacting with computers.

Our goal is to design a flexible, real-time gesture recognition system using only a standard webcam. This project allowed us to explore dataset creation, model fine-tuning, and real-time computer vision while building a practical interface for hands-free computer control.

Approach Overview

Our system processes live webcam video and recognizes static hand gestures using a fine-tuned YOLOv12n model. Each detected gesture is then mapped to a computer action such as adjusting volume or controlling media playback.

  1. Webcam Capture
    Frames are captured from the user's webcam in real time.
  2. Gesture Recognition
    The YOLOv12n model detects the hand and classifies it as one of:
    • closed-hand
    • closed-hand-thumb-out
    • open-hand
    • palm-open
    • thumbs-up
    • thumbs-down
  3. Action Trigger
    Using pynput, each gesture maps to a desktop action:
    • thumbs-up → volume up
    • thumbs-down → volume down
    • palm-open → play / pause
    • open-hand → skip backward
    • closed-hand → move mouse
    • closed-hand-thumb-out → mouse click

YOLOv12n was selected because of its speed, lightweight architecture, and strong performance when fine-tuned on small custom datasets.

YOLOv12 architecture visualization

YOLOv12n fine-tuned for six custom hand gestures.

Dataset

Since no existing dataset matched our specific gesture definitions, we created our own dataset from scratch. Images were collected from all team members under varying lighting, backgrounds, and hand orientations.

445 original images 1,071 images after augmentation 6 gesture classes

Augmentation included:

These augmentations increased dataset diversity and improved robustness to lighting, perspective, and user-specific gesture differences.

Animated dataset preview

Animated preview of our custom gesture dataset.

Labeled training examples

Manual labeling of each gesture using bounding boxes.

Implementation Details

The project consists of three main components: model training, live inference, and gesture-to-action mapping.

Model Training
We fine-tuned YOLOv12n on our custom dataset. Training was completed on an NVIDIA T4 GPU in under 30 minutes due to the model’s efficient architecture.

Real-Time Inference
Using OpenCV, each webcam frame is forwarded through the YOLO model. The detected gesture and bounding box are rendered in real time on the screen so users can see exactly what the model is predicting.

Gesture → Action Mapping
Using pynput, gestures trigger actions such as media control, volume changes, and mouse operations. Confidence thresholds and prediction smoothing help eliminate flicker during continuous use.

YOLOv12 illustration

YOLOv12 inference powering real-time interaction.

Results

The fine-tuned YOLO model performed well across all gesture classes. It operated in real time with low latency, even on CPU-only systems. Some confusion occurred between visually similar gestures (e.g., open-hand and palm-open), but overall accuracy was strong for our intended use case.

In real-world testing, the system was able to control media playback, volume, and mouse actions reliably while remaining responsive and stable.

Below is a real-time demonstration of gesture recognition:

Confusion matrix showing model performance

Confusion matrix for the six gesture classes. Most errors occur between similar open-hand and palm-open gestures.

Problems Encountered & Lessons Learned

Data Imbalance

Early versions of the dataset had uneven gesture representation, causing biased predictions toward over-represented classes. Additional data collection and targeted augmentation helped resolve this.

Lack of Variety

Limited backgrounds, lighting, and subjects reduced model robustness. Expanding the dataset with more locations and users significantly improved performance.

Gesture Similarity

Certain gestures, especially open-hand vs. palm-open, required substantially more training samples due to their similarity. We learned that some classes inherently need more data than others.

Testing Methodology

Initially we tested on a single split, which gave misleading results. We improved evaluation consistency using a more systematic test set and by examining confusion patterns, not just overall accuracy.

Overall, the project highlighted the importance of dataset quality, variance, and the power of transfer learning for small, domain-specific tasks.

Future Work

Dynamic gesture sequences AR/VR integration Multi-hand support Hybrid YOLO + landmarks

With more data and advanced modeling, this system could become a robust, low-cost, hands-free interface for accessibility, entertainment, and immersive computing.

Code & Downloads

All project materials are available below: