Tesseract-Indic-OCR: May 2009

Total character classes required to be trained:

ক
খ
গ
ঘ
ঙ
চ
ছ
জ
ঝ
ঞ
ট
ঠ
ড
ঢ
ণ
ত
থ
দ
ধ
ন
প
ফ
ব
ভ
ম
য
র
ল
শ
ষ
স
হ
য
য়
ৰ
ৱ

অ
আ
ই
ঈ
উ
ঊ
ঋ
এ
ঐ
ও
ঔ

০
১
২
৩
৪
৫
৬
৭
৮
৯

া
ে
ৈ
ৌ (cant get to render the last symbol independently :()
ং
 ঃ

৷
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<>
?
@
=
[
\
]
^
_
`

{
|
}
~
ৢ
ৣ
‘
’
“

Here are semivowels that need to be trained combined with consonants/conjuncts:

ি
ী
ু
ূ
ৃ
ৄ
্
়

Here are the conjuncts:

ক্ক
ক্ট
ক্ত
ক্ন
ক্ম
ক্র
ক্ল
ক্ব
ক্ষ
ক্স
ক্ষ্ণ
ক্ষ্ম
ক্ট্র
খ্র
গ্গ
গ্ধ
গ্ন
গ্ম
গ্ল
গ্ব
গ্র
ঘ্ন
ঘ্র
ঙ্ক
ঙ্খ
ঙ্গ
ঙ্ঘ
ঙ্ম
ঙ্ক্ষ
চ্চ
চ্ছ
চ্ঞ
চ্ছ্র
চ্ছ্ব
ছ্ব
ছ্র
জ্জ
জ্ঝ
জ্ঞ
জ্র
জ্ব
জ্জ্ব
ঞ্চ
ঞ্ছ
ঞ্জ
ঞ্ঝ
ট্ট
ট্র
ঠ্র
ড্ড
ড্র
ড়্গ
ণ্ট
ণ্ঠ
ণ্ড
ণ্ঢ
ণ্ণ
ণ্ম
ণ্ব
ণ্র
ণ্ড্র
ত্ত
ত্থ
ত্ন
ত্ম
ত্র
ত্ব
ত্ত্ব
থ্র
থ্ব
দ্গ
দ্ঘ
দ্দ
দ্ধ
দ্ভ
দ্ম
দ্র
দ্ব
দ্দ্ব
দ্ধ্ব
ধ্ন
ধ্র
ধ্ব
ন্ত
ন্থ
ন্দ
ন্ধ
ন্ন
ন্য
ন্ব
ন্ম
ন্স
ন্ত্ব
ন্ত্র
ন্দ্ব
ন্দ্র
ন্ধ্র
প্ট
প্প
প্ন
প্ত
প্ল
প্স
প্র
ফ্র
ফ্ল
ব্জ
ব্দ
ব্ধ
ব্ব
ব্ল
ব্র
ব্দ্র
ভ্র
ম্ন
ম্প
ম্ফ
ম্ব
ম্ভ
ম্ম
ম্র
ম্ল
ম্ভ্র
ম্প্র
ল্ক
ল্গ
ল্ট
ল্ড
ল্প
ল্ফ
ল্ব
ল্ম
ল্ল
শ্চ
শ্ছ
শ্ন
শ্ম
শ্ব
শ্র
শ্ল
শ্য
ষ্ক
ষ্ট
ষ্ঠ
ষ্ণ
ষ্প
ষ্ফ
ষ্ম
ষ্ক্র
ষ্ট্র
ষ্য
স্ক
স্খ
স্ট
স্ত
স্থ
স্ন
স্প
স্ফ
স্ম
স্র
স্ল
স্ব
স্ত্র
স্ক্র
স্ট্র
স্য
হ্ণ
হ্ন
হ্ম
হ্র
হ্ল
হ্ব
হ্য
গু
ন্তু
নু
সু
রু
রূ
দু
শু
হৃ
হু
গ্রু
গ্রূ
ব্রু
ভ্রু
ভ্রূ
শ্রু
শ্রূ
স্তু
ন্দু
ত্রু
থ্রু
থ্রূ
দ্রু
দ্রূ
ধ্রু
ধ্রূ
ল্গু
ন্ড
ন্ট
ন্ঠ
চ্ন
ট্ম
ট্ব
ড্ম
ভ্ল
ম্ত
ম্থ
ম্দ
ল্ত
ল্ধ
শ্ত

Total number of character classes to be trained:

36 (number of consonants) + 11 (number of vowel) + 10 (digits) +
6 (vowel-signs that can be rendered separately) + 49 (punctuations and symbols) + 215 (conjuncts)
+ (215+36)x6 (for semi-vowels that can not be trained individually) = 1833

Hence the character classifier for an Indic OCR needs to comb through 1833 character classifications
to find a character. For an english OCR on the other hand, this number is below 50.
Hence the difficulties in Indic OCR.

How to reduce number of character classes to be trained?

In my conversation with Prof. B.B. Chaudhuri I learnt techniques to reduce the number of character classes.
First we need to separate a word image into three parts, top, middle, bottom. The top part will have
the rising part of vowel signs like ি ী , the middle part will have consonant, conjuncts, vowels, digits etc.
The bottom part will have descending part of vowel-signs like ু.

Ocropus already seems capable of achieving this. See this. The image below has segmented rising part
of a few vowel signs separately:



If we can successfully adopt this segmentation approach, we can reduce the number of trainable character
classes to around 350.
Now once we have segmented the image, how does the present Tesseract-OCR engine classify the new
character classes. For example how to train the engine so it understands that the rising part of
ি is part of another vowel-sign. In any case, Tesseract only understands characters with unicode
values during training. Hence I dont think Tesseract-OCR will understand this segmentation.
So what do we do. There are 2 possibilites:

1) We use a different OCR engine. Will have to dig deeper into ocropus.
2) We use the Tesseract-OCR classifier and the 1800 odd character classes augmented with a strong
spell checker based correction mechanism.

The 2nd method I am working on right now.
Tesseract-Indic-OCR

Monday, May 11, 2009

Issues for Indic Meet

Friday, May 8, 2009

Bengali Stats

Blog Archive

About Me