Image Recognition (OCR)
Self-hosted solution for optical character recognition using the lightweight open-source Qwen3-VL-2B-Instruct vision-language model
Self-Hosted Solution
This is a self-hosted solution for optical character recognition (OCR) using the lightweight open-source Qwen3-VL-2B-Instruct vision-language model from Hugging Face. The model is utilized to interpret images and extract text within the image.
In a production environment, a self-hosted solution ensures data privacy and a model with a larger parameter can be hosted on a cloud service provider such as Google Cloud, AWS or Azure.
Image Recognition (OCR)
Extract text from images using GPU-accelerated OCR with automatic image preprocessing. Optimized for receipt text extraction and document processing.
View on GitHubExample: Receipt Text Extraction
Intentionally picked a receipt with faded text and sub-optimal angle to test the model's ability to handle low-quality images.
The Qwen3-VL-2B-Instruct model successfully extracts all text from restaurant receipts, including:
- Restaurant name and location
- Order details and itemized pricing
- Subtotal, tax, and total amounts
- Suggested tip percentages
- Payment information
RUBY SOFT Located in the Garment District 587 King St W Toronto, ON M5V 1V5 ... Total: $55.37
Features
GPU-Accelerated OCR
Leverages CUDA for fast inference using the lightweight Qwen3-VL-2B-Instruct model
Receipt Text Extraction
Optimized prompts for extracting text from receipts and documents
Automatic Preprocessing
Resizes images to reduce VRAM usage while maintaining quality
Customizable
Easy to customize prompts for different use cases and information extraction
Quick Start
1. Install PyTorch with CUDA support:
(Adjust CUDA version as needed for your system)
2. Install project dependencies:
3. Run the model:
4. View results:
The extracted text will be saved to example.txt and printed to the console.
Requirements
Software:
- • Python 3.11+
- • PyTorch with CUDA support
- • CUDA-compatible GPU (required for model inference)
Customization Options:
- • Change image path in
model.py - • Modify prompt for different use cases
- • Adjust image resolution for VRAM management
- • General use-case: "Describe the image in detail"
Google Vision Utility
The google_vision_util/ directory contains a small utility script for extracting text from Google Vision API JSON responses. This can be useful if you're working with Google Cloud Vision API outputs and need to parse the JSON format.
See google_vision_util/extract_text.py for usage details.
- GPU Recommended: This project defaults to using GPU for inference but falls back to CPU if no GPU is available.
- VRAM Usage: The default image resolution is limited to 1728px to reduce VRAM usage. If you have more VRAM available, you can increase or remove this limit.
- Model Size: The Qwen3-VL-2B-Instruct model is relatively lightweight (~2B parameters) but still requires significant GPU memory.