Details of work done till now in the Tesseract-Indic project
1) Maatraa Clipping
Maatraa here refers to shironaam, or the headline in Devanagri and Bengali script.
The first step in adapting Tesseract-OCR to recognise Indic script like Devanagari and Bengali was to clip (remove) the shironaam at points between successive characters so that Tesseract's connected component analysis does not mistake the entire word for a character,
Here is the algorithm and the code is in the form of a patch in the May 27th, 2008 entry on http://sites.google.com/site/debayanin/hackingtesseract .
Ray Smith, the project owner of Tesseract-OCR commented on the code here and Thomas Breuel makes a mention of "Matraa Clipping" in the morphological operations wiki in the OCRopus project.
For the above clipping algorithm to work, the page should be perfectly aligned. The should be no skew/tilt during the OCR process. For this purpose a de-skewing algorithm was required. I wrote an ad-hoc algorithm for that purpose, which has been disabled by default in recent releases of tesseract-indic. Better deskewing methods are available elsewhere. Code can be found at October 28 entry in http://sites.google.com/site/debayanin/hackingtesseract .
3) Training Data Auto Generation
I was initially working alone. One of the biggest problems of working alone on an OCR project is generating training data for different scripts. I tried to solve the problem by rendering all possible glyphs for a script onto an image, recording corresponding bounding boxes to a text file and then feeding the pair to the Tesseract-OCR training mechanism.
Instructions on how to use it can be found here and you may download the latest version at http://code.google.com/p/tesseractindic/downloads/list . The latest version at time of writing this is TesseractIndic-Trainer-GUI-0.1.3 .
4) Getting the dictionary to work
One of the big blockers for this project was a non-working dictionary for Indic scripts. It turned out to be one missing line of code that never caused the dictionary sub routine to be called.
Here is how the problem was located.
I was working on creating a desktop GUI for scratch in PyGtk. Sayamindu suggested that I look at OCRFeeder instead. The code is very nice and the author has even taken care of surrounding all printable strings with suitable modifiers so gettext can process them for i18n requirements. I am modifying the GUI to support other scripts suitably. Am yet to upload it to a public space, but will do it soon. Sayamindu and I fixed a few problems with it during FOSS.IN 2009.
5) Tilt method
6) Community Building
At FOSS.IN i saw a strong urge in people to work on OCR related problems. I felt responsible for creating a community and a framework for the OCR project that allows comunity contribution an easy process.
For a technology intensive project, the traditional FOSS model does not work in the same way. You generally wont expect people to tweak with core algorithms in pattern matching or machine learning components. This is something that Prof. C.V. Jawahar said, and I find it true for Tesseract-OCR too. In the case of Tesseract, a lot of people work on training data, fixing bugs, tweaking parameters, creating UIs but very rarely does someone decide to touch the core algorithms.
The fact is (as said by Prof. Anoop ), core algorithms and the training data/UI share a 50/50 ratio in importance in OCR development.
It is my intention to create a feedback based learning system for the OCR, which makes it trivially easy for the user to send back erroneous recognitions to a maintainer, and it becomes trivially easy for the maintainer to incorporate that data to the newer better training set.
1) Documentation on how different language teams can help
2) Integrating OCRFeeder with Training and Testing frameworks. Create feedback module.
3) Web based OCR. Feedback based learning mechanism
4) Can the dictionary be improved?
5) OCRFeeder page layout analysis is a little off