I periodically read about Optical Character Recognition (OCR) software, and keep thinking how cool it would be if someone in the open source community came up with a good OCR package. While prices are far better than they were about 10 years ago when I first looked in to OCR software, it can still be expensive to get going with OCR. Now, Google has announced work on an open source OCR system. There is a technology preview available, with a 3rd quarter alpha release targetted. The code page points out that no real training of the character recognition engine has taken place yet, but I wouldn’t be surprised at all to see a sister project get going to use distributed tools for training, letting thousands of open source fans get involved on the project if they aren’t capable of contributing on the code side.
The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90’s and deployed by the US Census bureau, and novel high-performance layout analysis methods.
OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
I’ll watch this project. It could be another highly significant open source tool in the near future.
[tags]Google working on open source OCR software, Open source Optical Character Recognition package from Google[/tags]