Welcome to the first article in our Inside ScannerPro series, where we explore the advanced technology that powers our mobile scanning app. Throughout this series, we’ll take you on a technical journey, discussing how ScannerPro automatically detects documents, aligns them, removes shadows, and enhances contrast.
Today, we’re focusing on one of its core features: real-time document detection and the valuable lessons we’ve learned along the way.
The challenge: seamless scanning experience
The task seems simple on paper—find the corners and edges of a document in a frame, then crop it from the background. The challenge is making this happen in real time, so users can adjust their camera view on the fly for the perfect scan. To achieve this, we needed our detector to run fast—ideally, under 34 milliseconds per frame (30 frames per second) on a mobile CPU/GPU, all within the constraints of a limited memory footprint.
First iteration: Classic Computer Vision
Several years ago, our team began addressing this problem using classic computer vision algorithms, built around time-tested components. Why this approach? It's transparent, easy to debug, doesn't need much-labeled training data, runs efficiently on CPUs, and is light on memory usage.
Edge Detection
First up: detecting edges. This is a foundational task in image processing, and there are plenty of established algorithms (Canny, LSD, Laplace, etc.). But in real-world conditions, the results were underwhelming. So, we built a custom edge detection algorithm that offered a good balance of accuracy and performance.
Line Detection
With a reliable edge map in hand, we moved to detecting straight lines. For this, we used variations of the Hough Transform—a classic image processing tool designed for this purpose.
Document Boundary Detection
Given that most documents are rectangular in the physical world, they appear as convex quadrilaterals when projected onto a 2D image. The task was to find the “best” quadrilateral to serve as a proxy for the document's boundary.
We computed line intersections to pinpoint potential document corners, applying simple geometric constraints to guide the process. By evaluating all possible quadrilaterals and scoring them based on the probabilities from the edge detector, we identified the highest-scoring quadrilateral as the document’s boundary.
This method worked well and ran in real-time on most iPhones, thanks to vectorization techniques (leveraging specialized hardware instructions like Neon and the Accelerate framework). The pipeline could handle around 20-25 frames per second.
Second iteration: Enter Deep Learning
While our first iteration performed solidly, it struggled in certain environments. Low or uneven lighting, low-contrast backgrounds, or complex cases (e.g., a white document on a white table) could trip up the edge map, leading to poor results.
Around the same time, deep neural networks (DNNs) were gaining traction, showing tremendous success in various computer vision tasks like classification and segmentation. But DNNs are heavy—both computationally and in terms of memory—making them difficult to deploy on mobile devices for real-time processing. Still, we were intrigued and decided to give it a shot.
Key Challenges
Several obstacles stood in our way:
- Data scarcity: We lacked the vast amounts of labeled data needed to train a robust network.
- Diversity: Our dataset needed to capture different types of documents in various lighting and environmental conditions.
- Performance: DNNs are demanding, and we needed a lightweight solution to work within mobile constraints.
To address the data issue, we built an in-house labeled dataset. While smaller than modern datasets, transfer learning enabled us to train a compact semantic segmentation network. Also, we used aggressive data augmentation techniques such as rotation, scaling, and color jittering to simulate a wide range of document appearances.
This network takes an image as input and outputs a binary mask, marking document areas as white (foreground) and everything else as black (background). Then we use a post-processing algorithm (similar to the Hough transform) to get quad-corner positions.
However, the initial pipeline was too slow, achieving only 5 frames per second on an iPhone X.
Optimizations
We made several key optimizations to improve performance:
- Neural Network Architecture: We re-architected the network to run on Apple’s Neural Engine (ANE).
- Post-Processing: We ported our post-processing algorithm to Metal, Apple’s GPU framework.
These optimizations brought the pipeline’s speed up to acceptable levels on devices with ANE chips, and quality tests showed the new algorithm significantly outperformed the classical approach in challenging scenarios.
This led to the development of a hybrid border detection system: on devices with ANE chips, we ran the DNN-based algorithm, while older devices continued using the classical computer vision approach.
Third iteration: Real-Time Refinement
While the second iteration was a leap forward, we saw room for improvement.
Performance Bottlenecks
One issue was the mismatch between camera capture and processing speed. The camera captures 30 frames per second, but our algorithm could only process 10-15 fps. This caused a lag where multiple frames were captured and displayed, but quads (the detected document boundaries) weren’t computed for each frame, leading to a choppy experience.
Additionally, we wanted to standardize the user experience across all devices, reducing reliance on the hybrid system.
Keypoint Detection
To solve these issues, we redesigned our document detection pipeline:
- Keypoint Detection: We built a lightweight neural network (based on MobileNet architecture) to predict key points for document corners, significantly speeding up performance across both new and older devices. Switching from segmentation to keypoint detection not only simplified the network architecture but also allowed us to streamline the detection process. Keypoint detection outputs require minimal post-processing, reducing the computational overhead and making the approach more feasible for real-time mobile applications.
- Improved Quality: Despite its smaller size, the new network delivered even better results than the segmentation network.
Two-Stage Detection
We also implemented a two-stage detection process. Balancing real-time performance with energy efficiency was a key concern. During video streaming, we run a small, energy-efficient keypoint detector. After the user captures an image, we refine the detection using the more computationally expensive segmentation network to remove any background noise. Using such an approach, we were able to keep energy consumption low while maintaining high detection accuracy.
Enhanced Stability
To improve stability, we implemented a Kalman filter-based tracker. We integrated IMU data from the device’s accelerometer and gyroscope with the neural network’s keypoint predictions. This allowed us to only run the detector when significant movement was detected, reducing the overall performance load. Using a Kalman filter, we were able to smooth out noisy predictions caused by motion blur, delivering more reliable document boundaries even under challenging conditions.
Conclusion
Our latest pipeline runs at over 30 fps on all supported devices, delivering both real-time performance and high-quality document detection. This system is currently in experimental testing with a small cohort of users, and we’re excited to continue improving ScannerPro’s capabilities.
Future directions:
This deep dive into document detection is just the beginning. In future episodes of the Inside ScannerPro series, we’ll explore other cutting-edge technologies that power ScannerPro, from advanced image processing techniques to machine learning models that make our app smarter and faster.
Andrii Denysov, ML Lead Engineer