Sunday, December 6, 2009

The tilt method for character segmentation in Indic Scripts

One of the problems in Indic script character classification is the huge number of glyphs. This is mostly due to conjuncts. But a major component of the huge number of glyphs are also symbols formed by consonant+vowel signs. Once a combination of consonant and vowel sign overlaps on a vertical axis, Tesseract has to be trained with that entire symbol. This is because Tesseract does a left-to-right scan of the image and can only box a wholly connected component. Then it proceeds to sub-divide the box, again on a vertical axis, in case it fails to recognise the entire word. For example:

For the image below, the OCR may first box the 2 characters together.

At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.


Hence, for a symbol like কু the OCR can not segment ক and ু separately.
There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?



As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.
This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.

No comments:

Post a Comment