Open source OCR software by Google

In April 2007 at the IUPR Research Group, Google sponsored the development of open source OCR software called- OCRopus, it was a high-tech document analysis and Optical Character Recognition system. Some of its features included:

  • Pluggable layout analysis
  • Pluggable character recognition
  • Statistical natural language
  • Modelling
  • Multi-lingual capabilities

The end goal of the project was to improve the condition of OCR as well as other related technologies and to provide the best optical character recognition system for, document conversions, electronic libraries.vision impaired users, historical document analysis and general desktop use.

Part of the software is based on Tesseract; one of the best open source OCR engine’s available in the market today. The project is expected to be released at the end of this year and will be utilised for Google's book scanning project. There are a few fascinating applications the team has in mind for the software-

  • web service interface
  • Integration with desktop search tools (e.g., beagle, spotlight etc.)
  • PDF, camera and screen OCR

More information here:

http://code.google.com/p/tesseract-ocr/