Real time OCR of Unstructured Text
Tips and tricks on how to do fast OCR with Python and Tesseract
Unstructured text is the opposite of what you find in a printed text book. It’s when you have text in random places with different fonts, sizes and colors. Traditionally OCR (optical character recognition / text extraction) of such unstructured text has been a difficult problem. When I attempted this previously, a decade ago, it required a lot of preprocessing and was not even close being real time.
The open source solutions have come along way since then and I am happy to report that it’s now viable to run them at real time speeds. This blog post outlines some tips and tricks that I picked up while playing around with the currently available cutting edge.
Tips #1 : Skip the detection step
Some casual internet research would lead you to believe that the way to approach unstructured text detection would be to first detect the areas with text and them to feed them individually to a OCR engine like Tesseract. This effectively reduces your unstructured text problem in to multiple structured texts. Ref:
- EAST: An Efficient and Accurate Scene Text Detector
- End to End Text Recognition Model Deployment on CPU, GPU, and VPU With OpenVINO
So, I tried an EAST detection implementation and the OpenVINO text detection and recognition implementations. I was unimpressed. OpenVINO detection was in the 200ms range using CPU (AVX-512) acceleration; changing precision didn’t help. It also returned bounding boxes for each character, which meant that I had to write additional logic to merge adjacent bounding boxes to identify text blobs. It missed a lot of characters, and recognition accuracy was pretty low too. I am sure there is room for optimizations and improvement here, but out of the box it is now where near what I was looking for.
Tesseract on the other hand just worked! In fact since v4, it has gotten so good at unstructured text detection that you really shouldn’t bother with anything else. You can skip the text detection entirely and just pass the full image. It is miles ahead in terms of both speed and accuracy.
At least this was the case for my requirement of processing 480p-720p images. I saw processing latencies between 60ms to 500ms with Tesseract. This latency was proportional to the amount of text in the image.
Tip: Before you try to do text detection, just feed your input to Tesseract. You might find that it’s good enough for your application.
Tip #2 : Less is more for preprocessing
It used to be the case, 10 years ago, that you needed to really massage and clean up your images before feeding them to Tesseract. Not any more. I played around with the old norms of noise filtering, binarizing, perspective corrections etc.. Only converting to gray scale made any improvement. Everything else either just added latency or made the accuracy worse. This is probably attributable to the fact that OpenCV 4 moved to a deep-learning based LSTM network.
Tip: Covert your images from RGB to grey scale to improve performance; nothing more.
Tip #3 : Initialize Tesseract only once
You will notice that if you use the Tesseract command line tool or a wrapper to it like pytesseract, processing latency is in the 100’s of ms range. It’s not bad but it can be better. It turns out that initializing Tesseact takes a lot of cycles. So what you want to do, if you are gonna process a video stream, is to access the dev-libs directly and initialize it once for the life time of your app. Python users you can use the tesserocr module, which neatly wraps the C++ API calls. This change is good to bring down the latency by up to an order of magnitude.
Tip: Ignore Google’s top suggestions and use tesserocr instead of the pytesseract python module
Tip #4 : Leverage multiple threads
Another tip is to leverage multiple threads to process the video stream. Use something like a thread-pool based implementation as linked here: https://github.com/opencv/opencv/blob/master/samples/python/video_threaded.py
However there is a catch with using tesserocr and multiple threads. The C++ APIs are not thread safe. Meaning that you can’t share a tesserocr instance among threads. It would crash in many different and exciting ways. So each thread in your thread pool needs it’s own instance of tesserocr. In CPython this can me easily achieved by sub-classing “threading.local()”. Reference: https://github.com/python/cpython/blob/master/Lib/_threading_local.py
Note that Tesseract does leverage multiple CPU cores for it’s OCR operations. So depending on your scene complexity, target frame rate and PC specs, using a thread pool many not net a big improvement. But it would definitely help to improve the apps responsiveness by freeing up the main thread. So, I’d recommend incurring the complexity of implementing a thread pool for the benefit of better overall responsiveness.
Tip: Leverage the power of threading and thread-local objects to OCR in parallel
Tip #5: Moar GHz
My last tip is about choosing a CPU for these kind of workloads. I took a i7 10700 CPU and tested the throughput under different permutations of disabling cores and limiting frequencies. What I can conclude is that Tesseract loves CPU frequency. There is a linear scaling with CPU speeds. Scaling with the number of cores weas not very pronounced. i.e 4 cores @ 4.5GHz would perform better than 8 cores at 3 GHz. So, if you need to buy dedicated hardware for this kind of application; avoid laptop class CPUs and go for high frequency desktop class CPUs. The new AMD Ryzen 5000 CPUs would likely be the best choice.
Tip: Prefer higher frequency desktop CPUs