Monday, May 11, 2009

Issues for Indic Meet

1) Find out what others know
2) Discuss the problems with OCR
3) Discuss work done
4) Discuss plans
5) Discuss available tools
6) Discuss tools to be developed
7) Discuss application

Objectives/Deliverables for Indic Meet

I shall first demonstrate the working of the OCR on some sample images. Then I plan to explain the working of the OCR system on a higher level. It shall be followed by a demonstration of the problems that exist in the present system and potential solutions that I have in mind. I shall demonstrate how to train this OCR for a particular language. This should be over in 75 minutes.
Then we move on to the problems I am facing. We have a discussion on possible solutions. Here are a few problems to tackle:

1) Learning about the various efforts made in the past. BOCRA / Aksharbodh etc
2) Dealing with the post-OCR spell-checker problem
3) A better segmentation algorithm. Ocropus Curved cut segmenter. Merits/demerits
3) Reducing number of character classes to be trained as explained at http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html
4) Talk to Santhosh Thottingal about integrating the service to Silpa
5) How to build a web interface that can train the OCR engine from user input.

No comments:

Post a Comment