Friday, November 9, 2012

Why FOSS Indic OCR is now feasible

I was supposed to attend mediawiki.org/wiki/Pune_LanguageSummit_November_2012, but ended up not going. I did not have too much to share since I have not worked on OCR for almost 2 years now. I wanted to share the few things I did have to say and hence I decided to write this post.

Tesseract-OCR today has several new features that make it more suitable for Indic OCR now.

1) They have now moved to a new classifier called "cube" which can handle many  more character classes than the older neural net engine. This is important because Indic script has hundreds of different glyphs when you consider conjuncts and overlapping vowels.

2) They have now added significant amounts of code to carry out "shirorekha clipping" to the code base which will help Hindi and Bengali OCR http://code.google.com/p/tesseract-ocr/source/search?q=shirorekha&origq=shirorekha&btnG=Search+Trunk . The approach is similar to http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf

3) Generous amounts of Indic script training data is now hosted on the official Tesseract-OCR website, and also on other community supported projects like http://code.google.com/p/parichit/

For someone who wants to start working on Indic OCR now, the barriers have been lowered somewhat. Much work still needs to be done because in OCR the last 2% is what matters most, and that requires a lot of fine tuning and testing. Getting your hands on ground truth test data is hence of vital importance.