Image Recognition (OCR)

Self-hosted solution for optical character recognition using the lightweight open-source Qwen3-VL-2B-Instruct vision-language model

Self-Hosted Solution

This is a self-hosted solution for optical character recognition (OCR) using the lightweight open-source Qwen3-VL-2B-Instruct vision-language model from Hugging Face. The model is utilized to interpret images and extract text within the image.

In a production environment, a self-hosted solution ensures data privacy and a model with a larger parameter can be hosted on a cloud service provider such as Google Cloud, AWS or Azure.

👁️

Image Recognition (OCR)

Extract text from images using GPU-accelerated OCR with automatic image preprocessing. Optimized for receipt text extraction and document processing.

View on GitHub

Example: Receipt Text Extraction

Intentionally picked a receipt with faded text and sub-optimal angle to test the model's ability to handle low-quality images.

The Qwen3-VL-2B-Instruct model successfully extracts all text from restaurant receipts, including:

Restaurant name and location
Order details and itemized pricing
Subtotal, tax, and total amounts
Suggested tip percentages
Payment information

RUBY SOFT
Located in the Garment District
587 King St W
Toronto, ON M5V 1V5
...
Total: $55.37

Features

⚡

GPU-Accelerated OCR

Leverages CUDA for fast inference using the lightweight Qwen3-VL-2B-Instruct model

🧾

Receipt Text Extraction

Optimized prompts for extracting text from receipts and documents

🖼️

Automatic Preprocessing

Resizes images to reduce VRAM usage while maintaining quality

🔧

Customizable

Easy to customize prompts for different use cases and information extraction

Quick Start

1. Install PyTorch with CUDA support:

(Adjust CUDA version as needed for your system)

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

2. Install project dependencies:

pip install -r requirements.txt

3. Run the model:

python model.py

4. View results:

The extracted text will be saved to example.txt and printed to the console.

Requirements

Software:

• Python 3.11+
• PyTorch with CUDA support
• CUDA-compatible GPU (required for model inference)

Customization Options:

• Change image path in model.py
• Modify prompt for different use cases
• Adjust image resolution for VRAM management
• General use-case: "Describe the image in detail"

Google Vision Utility

The google_vision_util/ directory contains a small utility script for extracting text from Google Vision API JSON responses. This can be useful if you're working with Google Cloud Vision API outputs and need to parse the JSON format.

See google_vision_util/extract_text.py for usage details.

Important Notes

GPU Recommended: This project defaults to using GPU for inference but falls back to CPU if no GPU is available.
VRAM Usage: The default image resolution is limited to 1728px to reduce VRAM usage. If you have more VRAM available, you can increase or remove this limit.
Model Size: The Qwen3-VL-2B-Instruct model is relatively lightweight (~2B parameters) but still requires significant GPU memory.