Wednesday, December 16, 2009

Review of current status and vision document

Details of work done till now in the Tesseract-Indic project
-------------------------------------------------------------------------

1) Maatraa Clipping

Maatraa here refers to shironaam, or the headline in Devanagri and Bengali script.

The first step in adapting Tesseract-OCR to recognise Indic script like Devanagari and Bengali was to clip (remove) the shironaam at points between successive characters so that Tesseract's connected component analysis does not mistake the entire word for a character,

Here is the algorithm and the code is in the form of a patch in the May 27th, 2008 entry on http://sites.google.com/site/debayanin/hackingtesseract .

Ray Smith, the project owner of Tesseract-OCR commented on the code here and Thomas Breuel makes a mention of "Matraa Clipping" in the morphological operations wiki in the OCRopus project.

2) De-skewing

For the above clipping algorithm to work, the page should be perfectly aligned. The should be no skew/tilt during the OCR process. For this purpose a de-skewing algorithm was required. I wrote an ad-hoc algorithm for that purpose, which has been disabled by default in recent releases of tesseract-indic. Better deskewing methods are available elsewhere. Code can be found at October 28 entry in http://sites.google.com/site/debayanin/hackingtesseract .

3) Training Data Auto Generation

I was initially working alone. One of the biggest problems of working alone on an OCR project is generating training data for different scripts. I tried to solve the problem by rendering all possible glyphs for a script onto an image, recording corresponding bounding boxes to a text file and then feeding the pair to the Tesseract-OCR training mechanism.
Instructions on how to use it can be found here and you may download the latest version at http://code.google.com/p/tesseractindic/downloads/list . The latest version at time of writing this is TesseractIndic-Trainer-GUI-0.1.3 .

4) Getting the dictionary to work

One of the big blockers for this project was a non-working dictionary for Indic scripts. It turned out to be one missing line of code that never caused the dictionary sub routine to be called.
Here is how the problem was located.

5) OCRFeeder

I was working on creating a desktop GUI for scratch in PyGtk. Sayamindu suggested that I look at OCRFeeder instead. The code is very nice and the author has even taken care of surrounding all printable strings with suitable modifiers so gettext can process them for i18n requirements. I am modifying the GUI to support other scripts suitably. Am yet to upload it to a public space, but will do it soon. Sayamindu and I fixed a few problems with it during FOSS.IN 2009.

5) Tilt method

http://hacking-tesseract.blogspot.com/2009/12/tilt-method-for-character-segmentation.html

http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html

6) Community Building

At FOSS.IN i saw a strong urge in people to work on OCR related problems. I felt responsible for creating a community and a framework for the OCR project that allows comunity contribution an easy process.
For a technology intensive project, the traditional FOSS model does not work in the same way. You generally wont expect people to tweak with core algorithms in pattern matching or machine learning components. This is something that Prof. C.V. Jawahar said, and I find it true for Tesseract-OCR too. In the case of Tesseract, a lot of people work on training data, fixing bugs, tweaking parameters, creating UIs but very rarely does someone decide to touch the core algorithms.
The fact is (as said by Prof. Anoop ), core algorithms and the training data/UI share a 50/50 ratio in importance in OCR development.
It is my intention to create a feedback based learning system for the OCR, which makes it trivially easy for the user to send back erroneous recognitions to a maintainer, and it becomes trivially easy for the maintainer to incorporate that data to the newer better training set.

http://hacking-tesseract.blogspot.com/2009/11/crowd-sourcing-ocr-development.html

ToDo
------

1) Documentation on how different language teams can help

2) Integrating OCRFeeder with Training and Testing frameworks. Create feedback module.

3) Web based OCR. Feedback based learning mechanism

4) Can the dictionary be improved?

5) OCRFeeder page layout analysis is a little off

Monday, December 7, 2009

Preliminary results for tilt method

I wrote this python code that reads in a box file and performs the rotation operation on the corresponding image

#!/usr/bin/python
#-*- coding:utf8 -*-

import Image,ImageDraw
import sys

box = open(sys.argv[1],'r')
print type(sys.argv[1])

lines = box.readlines()
image_name = sys.argv[1].split('.')[0]+'.tif'

input_image = Image.open(image_name)

wt = input_image.size[0]
ht = input_image.size[1]
#print wt," ",ht
new_image=Image.new("L",(wt*2,ht),255)
pen=ImageDraw.Draw(new_image)

offset = 0
prevtlx = 0
for line in lines:
    fields = line.split(' ')
    delta_y = int(int(fields[4].strip())) - int(fields[2])
    delta_x = int(fields[3]) - int(fields[1])
    top_left_x = int(fields[1])
    top_left_y = ht - int(fields[2]) - delta_y
    bot_right_x = int(fields[3])
    bot_right_y = ht - int(fields[4].strip()) + delta_y
    box = (top_left_x,top_left_y,bot_right_x,bot_right_y)
    char = input_image.crop(box)
    char = char.rotate(90)
    if top_left_x<prevtlx:
        offset = 0
    
    newwt = char.size[0]
    newht = char.size[1]

    newbox = (top_left_x+offset , top_left_y , top_left_x+offset+newwt ,top_left_y+newht)
    print newbox
    offset = offset+ (newwt - newht + 2)
    prevtlx = top_left_x
     
        new_image.paste(char, newbox)
    #aw_input('>')
new_image.save('mod.tif',"TIFF")

Then I take an image and run the following command on it to generate the box file:

tesseract bengali2.tif bengali2 -l ban batch.nochop makebox

On running the script one finds the images below:

Original Image

Transformed Image

The experiment has been somewhat disappointing. The quality of the character images degrades after rotation. Also since the boxing is not perfect, wrong groups have been rotated. Not that this technique can not be used. I need to make the same modifications in Tesseract C++ code. The idea is to rotate the character images and compare the classifier confidence between the original and the modified character image. The higher value will be chosen.
Also, I need a version of Pango renderer that can render the vowel signs without the dotted circles. I probably need to make a few lines of changes and rebuild Pango, as Sayamindu said.
So here I dive into the code base again.

Sunday, December 6, 2009

The tilt method for character segmentation in Indic Scripts

One of the problems in Indic script character classification is the huge number of glyphs. This is mostly due to conjuncts. But a major component of the huge number of glyphs are also symbols formed by consonant+vowel signs. Once a combination of consonant and vowel sign overlaps on a vertical axis, Tesseract has to be trained with that entire symbol. This is because Tesseract does a left-to-right scan of the image and can only box a wholly connected component. Then it proceeds to sub-divide the box, again on a vertical axis, in case it fails to recognise the entire word. For example:

For the image below, the OCR may first box the 2 characters together.

At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.

Hence, for a symbol like কু the OCR can not segment ক and ু separately.
There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?

As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.
This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.

Saturday, November 28, 2009

Initial tests

Initial test results are pretty good.

Test Condition:

Image: A deskewed image with Bengali text.

Training data:
Word List: Superset of all the words contained in the image
Shape Information: Using CRBLP's data

Output Text:

বৈঠকী মেজাজের এক উপেক্ষিত সা হিতি্যক
পত্র।স্তরে অনুজপ্রতিম লেখক ভ্রী শৈবাল মিত্র কিছুদিন অাগে বা-ঙা লি
লেখবঢ়ুঘুদ্ধিজীবীদের খুব একচেটি বকূনি দিয়েছো। তার অভিযোড়ৈগ এই যে, -যখনই
এই সব ব্যক্তিদের প্রশ্ল কর। হয়, অাপনার। গত এক বছরে উল্লেখযোগ্য কীড়ী কীা বই
পড়েছো, তখন অবধা রিতভাবে প্রায় সকলেই কিছুইংরিজি বই বা বিদেশি সাহিভ্যের
সূখ্য।তি ভরঢ়ু করেন। তার। কি ব।ংল। বই পড়্গে না ? নিজেরা বাংলা ভাষার লেখক
হয়েও অপর কে।নও বাঙালি লেখকের রচন।কে ণ্ডরঢ়ুত্বপুর্ণ মনে করেন নাড়ৈ? না কি
ব।ংল। ভ।ষায় উল্লেখযে।গ্য কিছু লেখাই হয় না।
এই অভিযে।গে সত্যতা অাছে। প।ণ্ডিত্য প্রমাণ করার জন্য অনেকেই বিদেশি
স।হিত্য সম্পর্কে জান জ। হির কর।র জন্য ব্যস্ত হয়ে পড়্গে; পা ণ্ডিত্য কিংবানূরবারি
য।ই হে।ক, ব।ংল। বই-টই এর মধ্যে অ।সে না। কফি হাউসের বুদ্ধিজীবীদের হাভে
বাংলা বই র।খ।র রেওয়।জ নেই। ঢেউয়ের মতন কখনও ম।র্কেজ, কখনও-
দেরিদ।-গ্র।মসি, কখনও টনি মরিসন ব।ঙ।লি বুদ্ধিজীবীদের ওণ্ঠের ওপর খেলাকরে
যান।
বিদেশি স।হিত্য ও তত্ত্বগ্রছ প।ঠ করা অবশ্যই জকরি, কিত বাংলা ভাষায়
অালোচন।যে।গ্য কে।নও গ্রছ লেখ। হয় ন।, এমন য দি মনে কর। হয় তা হলে বাংলা
-ভাষানিয়ে এত গর্ব কর।রইবা কী অ।ছে ? বিদেশি স।হিত্য প।ঠ করলেই বরং বে।ঝা
-যায়-, সম্প্রতি অন্যান্য ভ।ষ।য় রচিত গল্প-উপন্য।সবঢ়ুবিত। বাংলার তূলনায় এমন কিছু
-অাহাম রি উচচ।ঙ্গের নয়। সৈয়দ মুস্ত।ফ। সির।জের ড়ৃঅলীক মানুষ,এর মতন উপন্যাস
বিঢ়ুংব। -জয় গে।স্ব।মীর ড়ৈপ।গলী, ভে।মার সঙ্গে-ব তুল্য কাব্যগ্রছইদানীং কোন ভ।ষ।য়
প্রকাশিত হয়েছে?
যাই হ্রোক, অামার পক্ষে এরকম পণ্ডি তিপনা কিংবা স্নবারি দেখ।বার কোনও
সুযে।গই নেই; কারণ গত এব৪ বৎসরে অামি বিদেশি সাহিত্য কিছুই প ড়িনি ! এমনকী
ইংরিজি অক্ষরে লেখ। নিতাস্ত কয়েবঢ়ুখান। পুরনো ইতিহাসখজীবনীগ্রস্হু ছাড়া কোনও
গল্পঞ্ঝউপন্যাস চোখেও দেখি নি! বস্কু-ব।ন্ধবরা কেউ যখন সা।ভ.ঘতিক কোনও
সাড়।-জাগ।নো বইয়ের প্রসঙ্গ তুলে জিজ্ঞেস করে, তূ মি পড়েছ নিশ্চয়ই? অামারে৪
সসংকে।চে স্বীক।র করতেই হয়ড়ৈ, না ভ।ই পড়িনি! কিংব। বিদেশ খেকে ফিরে এলে
যখন কেউ জিজেস করে, ও দেশের হালফিল সাহিত্যের ধারা কী দেখলে, অামি
মাথা চুলকোই। জা নি না, খবর নেব।র সময় পাইনি! লভনে গিয়ে ইন্ডিয়া অ ফিস
লাইরেরিভে অামি পুরনো গুথিপত্র ঘেঁটেছি, এবচটাও নজৃব ইংরিজি কবিতার বই
কিনি নি, এটা স্বীকার করভে অামার লজ্জ। হয়। ত্রটা অামার এবচঁটা অধঃপতনের চিহ্ন

Accuracy: 93% ~

One major source of errors is । vs া ambiguity. That can be fixed.

This is pretty good news. The OCR is working well.

Conversation with Sayamindu regarding ambiguities

Here is a mail i sent to Sayamindu:
"By the way, one difficult problem I am facing is that all the া are
being mistakenly recognised as । . The dictionary should help in
resolving this, and also there is a file where we can specify
ambiguities like these. But nothing seems to work.
One way to solve the problem is to add the following rule in the
reorder script: We make a pass and replace all instances of । with া .
Then we make another pass and see whether there are any leftover া
with the dotted circle. These should be replaced by । .
Is the logic ok? How to find out if an া has a dotted circle?"

He has not replied yet.

Here is what i think. The change can not be made simply in the reorder script, which gets executed only in the post-ocr stage. The problem is that the OCR engine itself recognises this wrongly and it throws off the rest of the recognition.
One solution is to not train । (equivalent to fullstop in bengali) at all. We can always add the । in the post OCR script using the method in the mail addressed to Sayamindu.

TesseractIndic Trainer GUI

I just uploaded the TesseractIndic Trainer GUI Version 0.1 to
http://tesseractindic.googlecode.com/files/TesseracIindic-Trainer-GUI-0.1.tar.gz
. This application allows a person to generate custom/application
specific training data quickly.
To see how to use it, read
http://code.google.com/p/tesseractindic/wiki/TrainerGUI or watch
http://www.youtube.com/watch?v=xuBlfN6Va4k .

Wednesday, November 25, 2009

How the dictionary was fixed?

Well, it was a single line!

Added the following line to line number 1077 in dict/permute.cpp

any_alpha=1;

Here is the diff against 2.04 release:

--- tesseract-2.04/dict/permute.cpp 2008-11-14 23:07:17.000000000 +0530
+++ tessmod/dict/permute.cpp 2009-11-26 00:34:50.660737699 +0530
@@ -1077,6 +1077,7 @@
return (NULL);
if (permute_only_top)
return result_1;
+ any_alpha=1;
if (any_alpha && array_count (char_choices) <= MAX_WERD_LENGTH) {
result_2 = permute_words (char_choices, rating_limit);
if (class_probability (result_1) < class_probability (result_2)

For non-eng script the if condition was never getting satisfied and
hence the DAWG files were not being scanned properly. Adding a
any_alpha=1 on the top explicitly on the top solves this problem for
the time. There is probably a more elegant solution though.
By the way, I do not see this particular if condition in the trunk
anywhere in the file. Perhaps the deveopers have fixed it in the trunk
already.

Tesseract Dictionary (finally) works for Indic

This is going to be one long post

For the past few weeks I have been experimenting with the Indic script support in the Tesseract dictionary. I will first record the observations/results of my experiments and then elaborate on the logic involved.

We start with a pristine copy of tesseract-2.04 downloaded from here. Then we add some code to enable the maatraa clipping support (March 27 entry here).
Our aim is to see whether the dictionary works for Indic. Here is the methodology:

1) Take an image with a single word.
2) Create empty DAWG files.
3) OCR and see the result.
4) Now create DAWG files with a single word. The word is the same as the one in the image.
5) Now OCR again and see if the result improves.

I chose this image:

In text form it reads: পূনরায় (punoraaye which means 'again' in Bengali)

On OCRing this image with empty dawg files I received this result: পূনরুায়

The result is wrong. The third character is রু instead of র . Also the vowel sign া is not joined to the previous consonant.

Now I generate the 2 DAWG files: freq-dawg and word-dawg with a word list containing just this word : পূনরায় . Here is the process:

debayan@deep-blur:/tmp/orig/tesseract-2.04$ cat list
পূনরায়
debayan@deep-blur:/tmp/orig/tesseract-2.04$ wordlist2dawg list dawg
Building DAWG from word list in file, 'list'
Compacting the DAWG
Compacting node from 9570029 to 1000034 (2)
Writing squished DAWG file, 'dawg'
18 nodes in DAWG
18 edges in DAWG

Each symbol holds three bytes (according to unicode specs). There are 6 symbols in all: প ূ ন র া য় ; hence 6x3= 18 nodes in the DAWG. Makes sense!

Now I copy these DAWG files to the appropriate locations (/usr/local/share/tessdata/) and OCR again, and get the same result. This shows that the DAWG files are ineffective currently.

Now lets look at how to solve the problem. Ofcourse, the first step is to find out what is going on in DAWG creation/reading process. This involves inserting several cprintf statements all throughout the code. This gives us an insight (600 KB download) on how the DAWG file is being used. I intend to analyse the output and pinpoint the problem in the next post. In this post, lets concentrate on the results.

After I made the changes, I followed the same 5 steps followed above. Here is the output:

debayan@deep-blur:~/ocr/branches/tesseract-2.04$ vim space
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg space dawg
Building DAWG from word list in file, 'space'
Compacting the DAWG
Compacting node from 0 to 1000000 (2)
Writing squished DAWG file, 'dawg'
1 nodes in DAWG
1 edges in DAWG
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.
ban.DangAmbigs ban.freq-dawg ban.inttemp ban.pffmtable ban.user-words ban.word-dawg
ban.DangAmbigs~ ban.freq-dawg.old ban.normproto ban.unicharset ban.user-words.old ban.word-dawg.old
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg
[sudo] password for debayan:
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2>temp
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt
পূনরুায়

========================================================================

debayan@deep-blur:~/ocr/branches/tesseract-2.04$ echo 'পূনরায়'>list
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat list
পূনরায়
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg list dawg
Building DAWG from word list in file, 'list'
Compacting the DAWG
Compacting node from 9570029 to 1000034 (2)
Writing squished DAWG file, 'dawg'
18 nodes in DAWG
18 edges in DAWG
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2>temp
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt
পূনরায়

If you follow the output above closely, you will find that adding the word to the DAWG affected the output constructively. If you have the patience follow the 600KB text file to see how it did that. Wait for my next post for a detailed analysis of the process.

For now the conclusion is: The dictionary works for Indic. I need to send a patch to ray and team.

Thursday, November 19, 2009

utf-8 = ok?

I have been trying to add wide character support to Tesseract code base by converting most char* to wchar_t* data types. However I read in depth about UTF-8 encoding today here . It says UTF-8 handles unicode well. Tesseract already supports UTF-8, or so it says.
However when I print out the dawg file contents I see garbage for Indic scripts, but see proper characters for english. Why is this happening?
This makes me think that maybe I am on the wrong track. I did ask the Tesseract list whether I am on the right track or not, but found no useful replies.
Infact now that I think about it, we are creating the dictionaries out of word lists, we are forgetting that we need to introduce the vower 'de-reordering' rules. Only then will the OCR be able match words in run time.
Refer to my earlier post where I have mentioned that we need to do vowel reordering post OCR. If you reverse the analogy, we need to intentionally include anomalies in the dictionary so the OCR can work on the dictionary. Hence the OCR may think েক by looking at the dictionary, and we can use the vowel reordering code to correct this.

What a realisation!!

char* to wchar_t* conversion

Say you have a const char* string and you need to convert it to wchar_t type so that it can be stored in wide character format, here is the piece of code that takes the const char* and returns the wchar string for you.
Note that it does not work without the setlocale funtion.

You need to include locale.h and wchar.h header files for this to work.

wchar_t* utf2wchar(const char *str) {
setlocale(LC_ALL, "en_US.UTF-8");
int size = strlen(str);
wchar_t uni[100]; //assuming that there wont be a 101+ charcter word
int ret = mbstowcs(uni,str,size);
if(ret<=0){cprintf("mbstowc failed, ret=%d",ret);}
return uni;
}

Monday, November 16, 2009

No unicode support in Tesseract-OCR?

If I were to point out one single issue on which this project's success depends. it would be the dictionary. The dictionary for this OCR system is not just a text file full of words, but a data structure called Directed acyclic word graph.
I decided to finally solve this blocker of a problem and delved into the mailing lists once again. I did not find any new information there and hence decided to look at the source code itself.
I soon noticed that while building the dictionary, the code is treating the words as a stream of bytes and storing each byte per node. This means that the code does not support wide characters. Wide character support requires wchar_t type instead of char.
This is a major problem. One could try to make the code wide character compatible, but it might require considerable labour. Also reading contents from the dictionary also needs to be done with wide character support.
the alternative is shifting to a new OCR engine like OCRopus, which CRBLP folks seem to have done already.

Friday, November 6, 2009

Is a document suitable for OCR?

This is an important question for certain contexts.
1) There may be an online web service that allows people to upload images to be OCRed. Some pranksters or bots may start uploading images with no or little text. The OCR engine tries to make sense of the image and wastes immense amounts of CPU cycles.

2) The visually challenged may want to use the computer in this manner: Whenever they have an image with text infront of them, the software automatically recognises areas of text and OCRs it. Post-OCR A TTS system them reads out the text for them.

Now how do we achieve this?

There is a good method. The algorithm is called Run Length Smearing Algorithm (RLSA) . What it does is it smears lines of text into black lines, and then looks for parallel black lines as a sign of lines of text in the image.

The Problem of Dotted Circles

Certain vowel signs in in Indic scripts have a dotted circle in them. For example : ৈ, ে , া . When these are used in conjunction with consonants however, the dotted circles vanish. For example: কৈ, কে, কা .
This is a problem doing automated training. The python script draws ে and trains the engine to recognise the shape, along with the dotted circle. However, when we OCR a document, the dotted circle is no longer there.
Hence we somehow need a method of automatically eliminating the dotted circles from vowel signs while generating training images. Any ideas?

Why is Vowel Reordering required?

Indic scripts have the concept of vowel signs. The peculiarity of these vowel signs with respect to OCR is that sometimes consonant + vowel sign = a glyph where the consonant comes later and the vowel sign first.
Here I present just one simple example.
That is (in Bengali): ক + ে = কে

Now when we OCR কে , the OCR engine first encounters the vowel sign (ে without the dotted circle) and then the consonant ক. It then tries to do a string concatenation of the two characters seen in order, and it ends up producing this as the output: েক .
Since the OCR engine makes the same mistake all the time, its easy to write scripts which can move every such vowel sign to the appropriate place. This improves the OCR accuracy drastically.

OCRFeeder

I have been working on creating a complete OCR solution suite for Gnome. It tunrns out that OCRFeeder is already a pretty good solution.
Its a good thing that this exists, because I can now shift focus on adding Indic related code to OCRFeeder itself.
When I say Indic related code, I mean modified tesseract shironaam clipper, vowel reordering, automated training and the crowd-sourcing data feedback learning mechanism.

Crowd Sourcing OCR development

One of the biggest challenges in OCR development is gathering training data and then feeding it to the OCR engine. The data is generally carefully chosen and some emphasis is laid on the quality of scans too. This often requires a team of people working in close proximity, and hence has traditionally been a blocker for the distributed development model.
However, with proper planning in software development, such frameworks can be set up which allow end users to contribute to OCR training data.
The interface to the OCR system may be either command line based, GUI based or web based. Say a user OCRs a particular document. Post-OCR the interface presents to him an opportunity to correct any errors and send it back to a centralised server where certain volunteers/contributors shall verify the data. Once the data has been verified, it is fed to the engine for incremental training.
To check whether the data being added is improving the performance of the OCRs or not, we may run an automated nightly-OCR on a set of test image/text set and post the percentage daily.
The challenge is that most OCR training systems are not incrementally trainable out of the box. Tesseract-OCR is one example. However, one may write some code and implement it.
Crowd Sourcing training data is critical to align OCR development to a FOSS based model and hence free it from the clutches of research teams at big institutes.

Friday, June 5, 2009

How to train Tesseract-OCR

Check out the source code of tesseractindic from http://code.google.com/p/tesseractindic. Then cd to tesseract_trainer and follow the directions below:

Here is a demonstration of how you can create training data files for an arbitrary language for Tesseract-OCR and subsequently use it to perform OCR.

To create data files for , say, Bengali:

1) Create a directory in tesseract_trainer/ and name it arbitrarily. This contains the symbols of the alphabet. I name it 'beng.alphabet'. In the directory you may create a maximum of 4 files:
a)consonants- Put all the consonants in your script/language in the file. eg, ক , খ (ka, kha) etc
b)pre_semivowels- Put all the semivowels (if any in your script) that come before a consonant. eg, ি, ে, ৈ (e kaar, a kaar, oi kaar )
c)post_semivowels- Put all the semivowels (if any in your script) that come after a consonant, eg, া, ী (aa kaar, ee kaar)
d)rest- Put everything else here, like digits, punctuation, conjuncts, special characters, vowels. You could also choose not to create the 3 files above and put all the symbols in this file.

You need to have some fonts particular to your script installed on your system. On an Ubuntu system you will find them in /usr/share/fonts/truetype/ttf-bengali-fonts/. You will require the name(s) of the fonts later.

Now change directory to tesseract_trainer/ and execute the following on the shell(for bengali for example): python generate.py -font Mitra -l Bengali -s 15 -a beng.alphabet/

-font takes the ttf font name you are trying to train
-l takes the script name to be trained as input
-s size of the characters generated in images in Bengali.images/

This command will generate many images and corresponding box files in Bengali.images/. In the end it generates 5 files in Bengali.training_data/.
1)Bengali.unicharset
2)Bengali.Microfeat
3)Bengali.normproto
4)Bengali.pffmtable
5)Bengali.inttemp

These 5 files are needed by Tesseract-OCR engine to add a new script support. In addition there are 3 more files required that ae to be created by you separately. These are :

* Bengali.freq-dawg
* Bengali.word-dawg
* Bengali.user-words
You do not require the tesseract_trainer tool to create the files above. They can be created by following appropriate instructions at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.

Copy these files to /usr/local/share/tessdata and to tessdata/ folder of you tesseract-ocr source code.

Now lets OCR some image. Lets say the image is image.tif. Here is what you must execute at the terminal:

tesseract image.tif ocr -l Bengali

You get ocr.txt as the output.

Contact debayanin@gmail.com for clarifications.

Contact debayanin@gmail.com for further clarifications.

And yeah, this is work in progress.

Monday, May 11, 2009

Issues for Indic Meet

1) Find out what others know
2) Discuss the problems with OCR
3) Discuss work done
4) Discuss plans
5) Discuss available tools
6) Discuss tools to be developed
7) Discuss application

Objectives/Deliverables for Indic Meet

I shall first demonstrate the working of the OCR on some sample images. Then I plan to explain the working of the OCR system on a higher level. It shall be followed by a demonstration of the problems that exist in the present system and potential solutions that I have in mind. I shall demonstrate how to train this OCR for a particular language. This should be over in 75 minutes.
Then we move on to the problems I am facing. We have a discussion on possible solutions. Here are a few problems to tackle:

1) Learning about the various efforts made in the past. BOCRA / Aksharbodh etc
2) Dealing with the post-OCR spell-checker problem
3) A better segmentation algorithm. Ocropus Curved cut segmenter. Merits/demerits
3) Reducing number of character classes to be trained as explained at http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html
4) Talk to Santhosh Thottingal about integrating the service to Silpa
5) How to build a web interface that can train the OCR engine from user input.

Friday, May 8, 2009

Bengali Stats

Total character classes required to be trained:

ক
খ
গ
ঘ
ঙ
চ
ছ
জ
ঝ
ঞ
ট
ঠ
ড
ঢ
ণ
ত
থ
দ
ধ
ন
প
ফ
ব
ভ
ম
য
র
ল
শ
ষ
স
হ
য
য়
ৰ
ৱ

অ
আ
ই
ঈ
উ
ঊ
ঋ
এ
ঐ
ও
ঔ

০
১
২
৩
৪
৫
৬
৭
৮
৯

া
ে
ৈ
ৌ (cant get to render the last symbol independently :()
ং

 ঃ

৷
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<>
?
@
=
[
\
]
^
_
`

{
|
}
~
ৢ
ৣ
‘
’
“

Here are semivowels that need to be trained combined with consonants/conjuncts:

ি
ী
ু
ূ
ৃ
ৄ
্
়

Here are the conjuncts:

ক্ক
ক্ট
ক্ত
ক্ন
ক্ম
ক্র
ক্ল
ক্ব
ক্ষ
ক্স
ক্ষ্ণ
ক্ষ্ম
ক্ট্র
খ্র
গ্গ
গ্ধ
গ্ন
গ্ম
গ্ল
গ্ব
গ্র
ঘ্ন
ঘ্র
ঙ্ক
ঙ্খ
ঙ্গ
ঙ্ঘ
ঙ্ম
ঙ্ক্ষ
চ্চ
চ্ছ
চ্ঞ
চ্ছ্র
চ্ছ্ব
ছ্ব
ছ্র
জ্জ
জ্ঝ
জ্ঞ
জ্র
জ্ব
জ্জ্ব
ঞ্চ
ঞ্ছ
ঞ্জ
ঞ্ঝ
ট্ট
ট্র
ঠ্র
ড্ড
ড্র
ড়্গ
ণ্ট
ণ্ঠ
ণ্ড
ণ্ঢ
ণ্ণ
ণ্ম
ণ্ব
ণ্র
ণ্ড্র
ত্ত
ত্থ
ত্ন
ত্ম
ত্র
ত্ব
ত্ত্ব
থ্র
থ্ব
দ্গ
দ্ঘ
দ্দ
দ্ধ
দ্ভ
দ্ম
দ্র
দ্ব
দ্দ্ব
দ্ধ্ব
ধ্ন
ধ্র
ধ্ব
ন্ত
ন্থ
ন্দ
ন্ধ
ন্ন
ন্য
ন্ব
ন্ম
ন্স
ন্ত্ব
ন্ত্র
ন্দ্ব
ন্দ্র
ন্ধ্র
প্ট
প্প
প্ন
প্ত
প্ল
প্স
প্র
ফ্র
ফ্ল
ব্জ
ব্দ
ব্ধ
ব্ব
ব্ল
ব্র
ব্দ্র
ভ্র
ম্ন
ম্প
ম্ফ
ম্ব
ম্ভ
ম্ম
ম্র
ম্ল
ম্ভ্র
ম্প্র
ল্ক
ল্গ
ল্ট
ল্ড
ল্প
ল্ফ
ল্ব
ল্ম
ল্ল
শ্চ
শ্ছ
শ্ন
শ্ম
শ্ব
শ্র
শ্ল
শ্য
ষ্ক
ষ্ট
ষ্ঠ
ষ্ণ
ষ্প
ষ্ফ
ষ্ম
ষ্ক্র
ষ্ট্র
ষ্য
স্ক
স্খ
স্ট
স্ত
স্থ
স্ন
স্প
স্ফ
স্ম
স্র
স্ল
স্ব
স্ত্র
স্ক্র
স্ট্র
স্য
হ্ণ
হ্ন
হ্ম
হ্র
হ্ল
হ্ব
হ্য
গু
ন্তু
নু
সু
রু
রূ
দু
শু
হৃ
হু
গ্রু
গ্রূ
ব্রু
ভ্রু
ভ্রূ
শ্রু
শ্রূ
স্তু
ন্দু
ত্রু
থ্রু
থ্রূ
দ্রু
দ্রূ
ধ্রু
ধ্রূ
ল্গু
ন্ড
ন্ট
ন্ঠ
চ্ন
ট্ম
ট্ব
ড্ম
ভ্ল
ম্ত
ম্থ
ম্দ
ল্ত
ল্ধ
শ্ত

Total number of character classes to be trained:

36 (number of consonants) + 11 (number of vowel) + 10 (digits) +
6 (vowel-signs that can be rendered separately) + 49 (punctuations and symbols) + 215 (conjuncts)
+ (215+36)x6 (for semi-vowels that can not be trained individually) = 1833

Hence the character classifier for an Indic OCR needs to comb through 1833 character classifications
to find a character. For an english OCR on the other hand, this number is below 50.
Hence the difficulties in Indic OCR.

How to reduce number of character classes to be trained?

In my conversation with Prof. B.B. Chaudhuri I learnt techniques to reduce the number of character classes.
First we need to separate a word image into three parts, top, middle, bottom. The top part will have
the rising part of vowel signs like ি ী , the middle part will have consonant, conjuncts, vowels, digits etc.
The bottom part will have descending part of vowel-signs like ু.

Ocropus already seems capable of achieving this. See this. The image below has segmented rising part
of a few vowel signs separately:



If we can successfully adopt this segmentation approach, we can reduce the number of trainable character
classes to around 350.
Now once we have segmented the image, how does the present Tesseract-OCR engine classify the new
character classes. For example how to train the engine so it understands that the rising part of
ি is part of another vowel-sign. In any case, Tesseract only understands characters with unicode
values during training. Hence I dont think Tesseract-OCR will understand this segmentation.
So what do we do. There are 2 possibilites:

1) We use a different OCR engine. Will have to dig deeper into ocropus.
2) We use the Tesseract-OCR classifier and the 1800 odd character classes augmented with a strong
spell checker based correction mechanism.

The 2nd method I am working on right now.

Sunday, April 19, 2009

What is a DAWG file?

DAWG = Directed Acyclic WordGraph

First, we'll define DAWG (skip if you know already) and cover the specifics of tesseract below.

=========== this definition by Mark & Ceal Wutka, link below ============

A Directed Acyclic Word Graph, or DAWG, is a data structure that permits extremely fast word searches. The entry point into the graph represents the starting letter in the search. Each node represents a letter, and you can travel from the node to two other nodes, depending on whether you the letter matches the one you are searching for.

It's a Directed graph because you can only move in a specific direction between two nodes. In other words, you can move from A to B, but you can't move from B to A. It's Acyclic because there are no cycles. You cannot have a path from A to B to C and then back to A. The link back to A would create a cycle, and probably an endless loop in your search program.

The description is a little confusing without an example, so imagine we have a DAWG containing the words CAT, CAN, DO, and DOG. The graph woud look like this:

     C --Child--> A --Child--> N (EOW)
  |                         |
  |                       Next
Next                        |
  |                         v
  |                         T (EOW)
  v
  D--Child--> O (EOW) --Child --> G (EOW)

Now, imagine that we want to see if CAT is in the DAWG. We start at the entry point (the C) in this case. Since C is also the letter we are looking for, we go to the child node of C. Now we are looking for the next letter in CAT, which is A. Again, the node we are on (A) has the letter we are looking for, so we again pick the child node which is now N. Since we are looking for T and the current node is not T, we take the Next node instead of the child. The Next node of N is T. T has the letter we want. Now, since we have processed all the letters in the word we are searching for, we need to make sure that the current node has an End-of-word flag (EOW) which it does, so CAT is stored in the graph.

One of the tricks with making a DAWG is trimming it down so that words with common endings all end at the same node. For example, suppose we want to store DOG and LOG in a DAWG. The ideal would be something like this:

   D --Child--> O --Child--> G(EOW)
|            ^
Next          |
|            |
v            |
L --Child----

In other words, the OG in DOG and LOG is defined by the same pair of nodes.

=========== Creating a DAWG ============

[...] The idea is to first create a tree, where a leaf would represent the end of a word and there can be multiple leaves that are identical. For example, DOG and LOG would be stored like this:

  D --Child--> O --Child--> G (EOW)
|
Next
|
v
L --Child-> O --Child--> G (EOW)

Now, suppose you want to add DOGMA to the tree. You'd proceed as if you were doing a search. Once you get to G, you find it has no children, so you add a child M, and then add a child A to the M, making the graph look like:

  D --Child--> O --Child--> G (EOW) --Child--> M --Child--> A (EOW)
|
Next
|
v
L --Child-> O --Child--> G (EOW)

As you can see, by adding nodes to the tree this way, you share common beginnings, but the endings are still separated. To shrink the size of the DAWG, you need to find common endings and combine them. To do this, you start at the leaf nodes (the nodes that have no children). If two leaf nodes are identical, you combine them, moving all the references from one node to the other. For two nodes to be identical, they not only must have the same letter, but if they have Next nodes, the Next nodes must also be identical (if they have child nodes, the child nodes must also be identical).

Take the following tree of CITIES, CITY, PITIES and PITY:

 C --Child--> I --Child--> T --Child--> I --Child--> E --Child--> S (EOW)
|                                      |
|                                     Next
Next                                    |
|                                      v
|                                      Y (EOW)
P --Child--> I --Child--> T --Child--> I --Child--> E --Child--> S (EOW)
                                     |
                                    Next
                                     |
                                     v
                                     Y (EOW)

Continue reading this explanation at:

http://www.wutka.com/dawg.html

What does Tesseract use DAWG for?

Tesseract uses the Directed Acyclic Word Graphs to very compactly store and efficiently search several list(s) of words. There are four DAWGs in tesseract: (right?)

1. word_dawg (pre-set/fixed list read in from "tessdata/word-dawg")
(this one is read in raw/directly for speed, user can't change this right now)
2. document_words (document-words that have already been recognized)
(built during execution; FIX: is/isn't cleared per-document/baseapi call)
3. pending_words (words tess is working on, at the moment, before they are added to document_word)
4. user_words (user-adjustable list read in from "tessdata/user-words")
(add here custom words that tesseract tends to corrupt)

Disclosure: I don't know the order of preference - which DAWG does tesseract check first AND which DAWG over-rides the others. ex. "thls" is not in #1 but, say, is in #4 - will tesseract NOT jiggle the 'l' into an 'i' (which then matches in #1) or will it go with #4? Ray?

Let's say that tesseract thinks it found a word with four letters, "thls". Before this word is output, tesseract will:

look-up "thls" in DAWG #1 (see above)
(when does it check user-words?)
By looking through the sorted list for each of the classes, tesseract will note that the third character had a second-best choice to be an 'i' so it changes that letter and
look-up "this" in DAWG #1 and this time it DOES match.
(fmg has seen tess KEEP ON permuting even after a match in both #1 and #4 so is not sure what the ending conditions are - maybe someone who knows better can explain) which can only mean that:
until the certainty of the word isn't moved beyond some threshold, permuting of other letters continues...

So, the answer to "Why does tesseract bother with DAWGs" is that when a typical English word has one or two letters that have permutations possible, WITHOUT using the compact and fast DAWG's this lookup task would quickly become a huge bottle-neck.

=========== DAWG-related ToDo's ============

Todo:

Need to add info here on:

how to view/list words ALREADY IN "tessdata/word-dawg"
how to CREATE A NEW "tessdata/word-dawg"
which constants need to be tweaked when adding words to "tessdata/word-dawg"
which constants need to be tweaked when adding words to "tessdata/user-words" (because a poster on the forums said that after about 5000 words are added guano happens)
why/what for is rand() used in add_word_to_dawg()
what to do when the dreaded "DAWG Table is too full" error occurs AFTER Ray Smith's patch is already applied...

Generated on Wed Feb 28 19:49:30 2007 for Tesseract by

1.5.1

COPIED VERBATIM FROM http://tesseract-ocr.repairfaq.org/

Friday, April 17, 2009

Clipping accuracy

I had tried some time last year to push my matra clipping code to Tesseract-OCR upstream, but Ray Smith the lead developer of the project asked about the accuracy of the code and I never got around to calculating it. Well actually I still havent calculated it, but I did something new.
Check the set of pictures I uploaded at . The first picture is the normal picture to be OCRed. The second picture is the clipped+thresholded image. The third image is the difference of the clipped+thresholded and thresholded images.

Here is the Python code that creates a new image out of two input images:

#!/usr/local/bin/python

import ImageChops, Image

th=Image.open("benth.tif")
clip=Image.open("bentest.tif")

new=ImageChops.difference(th,clip)
new=ImageChops.invert(new)

new.save("diff.tif","TIFF")

I will now show this to Ray Smith. Lets see if he likes it.

My old training methodology

The principle on which this works is this: Tesseract needs two things to train itself, 1) An image of the character 2) The name of the character. This information is provided with the help of "box files". A box file contains the co-ordinates of the bounding boxes around characters with labels as to what those characters are. The traditional method of training the engine is to take a scanned image, meticulously create a box file using some tool such as tesseractrainer.py , edit the box file, and keep doing the same for several other images and fonts. This process was tedious enough to force me to seek new methods.

Now lets do a little reverse engineering. What if we could take a list of characters in a text file, "generate" an image out of those characters, store the co-ordinates of the bounding boxes of those generated images in a file and then feed these to the OCR engine? It would work, right?

Links:

http://tesseractindic.googlecode.com/files/tesseract_trainer.beta.tar.gz - The tar ball itself

http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/readme - The readme file

http://www.youtube.com/watch?v=vuuVwm5ZjkI - YouTube video of the tool working for Bengali

But there are problems. Tesseract-OCR has its quirks.

Tesseract wants one bounding box to enclose a single "blob" only. A blob is a wholly connected component. So ক is a blob, and ক খ are two blobs. There are cases where a consonant+vowel sign generates two blobs, for example the 3 images below have multiple blobs:

And hence Tesseract throws a "FATALITY" error during training.

So i had to change my approach a little bit. Obviously there has to be some feedback mechanism where i parse the output of Tesseract during training to see if a particular set of characters threw errors. Once I know what they are, I can separate them and train them later. To accomplish this, I changed my approach of generating a strip of character images to generating just one image per character, so I can pin point the problems better.

The downside, too many images getting generated. To train a simple font it generates 405 images+405 box files+405 tr files. And all this when I have not included conjuncts yet. It is not much of a problem though, since the images generated are not required once the training files have been generated.

Well it leads me to new challenges. I remember Prof B.B. Choudhury saying that training all the conjuncts will kill any recogniser, ie, it will work very slowly while recognising. He also told me some cool ways to get past that. May have to implement that. Lets see.

My training methodology does not work :(

As much as I hate to admit it my training methodology of generating one image per akshar does not work. I hate to say it since I put some effort into writing the Python code that does this .
Well the reason is probably that Tesseract OCR training code looks for characters on a single line during training as it also extracts base line metrics for rare/strange characters like numerals. As such it may not be able to extract all the information it needs for its training.
Or may be Tesseract OCR training code accepts a very little number of .tr files and since my code generates thousands of tr files, it becomes useless.
Let me show you an example of how miserably it failed.
I decided to test the training on the string " ভারত মাতা " (Bharat Mata which means Mother India). I generated the tiff image using Pango rendering.
Then I generated 7 images per sample of ভ র ত ম and used the subsequently generated training fils for OCR.
The result was this: " মভতভ ভভভভ "
Yes, I know. The result is absolutely outrageous.
However, what if I still autogenerate images of characters but this time in single lines adjacently? Will it work?

Tuesday, April 14, 2009

Image degradation

I added some code. The pango rendering works perfectly now. Also 1 pure image and 4 partially erased images are created per character.
The degradation has been chosen to suit the code that clips matraas. The only degradation seen is a vertical white strip overlapping certain characters.
Hence the same is done while generating training images.

[1] http://code.google.com/p/tesseractindic/source/detail?r=41

Saturday, March 21, 2009

Pango Magic

I was getting crappy rendering through normal PIL for bengali conjuncts.

I then remembered that Sayamindu had sent me some Pango/Cairo code to help me put with rendering . I modified it somewhat and I am getting this:

Its 6:21 AM now. Need to go to sleep. Shall work later today.

Bengali Conjuncts

I suddenly started looking for all the Bengali conjuncts in one single place on the web. After a lot of searching I found http://www.stat.wisc.edu/~deepayan/Bengali/FreeBangTemplate/juktolist.txt.
I downloaded the file, removed all the comments and used this python code to get a list of all conjuncts:

import os

f=open("juktolistocr.txt",'r')
fout=open("conjuncts.txt",'w')

for lines in f.readlines():
conjunct=lines.split('=')[1]
conjunct=conjunct.strip()
fout.write(conjunct+"\n")
print conjunct

In case you want a list of bengali conjuncts too download it from http://debayanin.googlepages.com/bengali_conjuncts.txt

Sunday, March 15, 2009

Recent Commits

Check Out http://code.google.com/p/tesseractindic/source/

Observations

1) Looks like creating per-font training file sets makes more sense
2) Omni-font training file sets make no sense. The same symbols may look remarkably different in different fonts. Also the size of such a training set would be huge. The recogniser would "go to sleep" in words of Prof. B.B. Choudhuri.
3) Must reduce training fatalities to zero. Strange errors creeping in. Yet to figure out why.
4) How to generate test images? Create a few text files. Use PIL draw() to generate images? Use a text editor manually

What after all this?

1) A nice how-to. Points to cover in the how-to:

* How to run using existing training files.
* How to create new training files.

Transferred Entries

March 5, 4:58 AM

I am dealing with the problems mentioned in the last post. There are several fatalities in training, but i have been successful in weeding them out using the following piece of code:

 qpipe = os.popen4(exec_string1) 
 o=qpipe[1].readlines() #returning your output.
 
 pos=str(o).find('FAILURE')
 #print len(str(o))print pos
 if(pos >0):
 #os.chdir("images")
 fileout=open("error",'a')
 filein=open("beng.images/"+box,'r')
 linein=filein.readline()
 fileout.write(linein+"\n")
 filein.close()
 fileout.close()

It parses the output of the string executed by popen4() and looks for characters

that failed while training. It writes those characters in a separate file.

I just need to work on generating one set of really good and flawless training data.

March 4, 5:07 AM

Its been more than a month since i recorded any of my work here, but I *have* been working and there are lots of updates.

We now have 3 project members. Jinesh from IIIT Hyderabad wishes to add Malayalam support to TesseractIndic. Baali (Shantanu) from Sarai, Delhi wishes to add Devanagri support. So finally am not working alone.

Also, around a month back I had gone to ISI Kolkata to consult with Prof. B.B. Choudhury. I had mailed Dr. B.B. Chaudhuri of ISI Kolkata to help
me out regarding training data and testing ground truth data and thanx
to Mr. Gora’s recommendation, he allowed me to meet him in Kolkata. I
met him at ISI and spoke to him for about 40 minutes regarding
different issues in Indic OCR. He discussed some really good ways to
significantly reduce recognition time etc, and rued the lack of good
research assistants.

He could not share data with me at that moment because of lack of
manpower and copyright issues. I was returning dejected, but I met Dr.
Mandar Mitra at the gates. Mr. Sankarshan, my mentor who originally
started me down this path, had introduced me to him in Kolkata about a
month back. It really helped and he took me to his lab. He mined last 7
8 years his work and gave me everythin useful he had, includeing a lot
of ground truth data and some images with bounding boxes information.

But that is not the most important thing i acquired there. While
talking to Mandar Mitra, it dawned on us that the entire training
process can be automated using python scripts, and there is no need of
manually feeding data using scanners and all.

Links:

http://tesseractindic.googlecode.com/files/tesseract_trainer.beta.tar.gz - The tar ball itself

http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/readme - The readme file

http://www.youtube.com/watch?v=vuuVwm5ZjkI - YouTube video of the tool working for Bengali

But there are problems. Tesseract-OCR has its quirks.

And hence Tesseract throws a "FATALITY" error during training.

And yeah, Sarai cheque arrived. :)

January 29, 3:16 AM

for akshar in alphabets:
#print akshar

  draw.text((x, y), unicode(akshar,'UTF-8'), font=font)
  leftx=x-20#the left end of the small bounding box
  box=(leftx,y,x+100,y+60) #the box in the big image within which the small image of interest lies
  sub_im=im.crop(box)
#sub_im.show()
  bbox_sub=sub_im.getbbox() #get the bounding box of the black pixels in the sub image, not the big image
  bbox_im=(leftx+bbox_sub[0],y+bbox_sub[1],leftx+bbox_sub[2],y+bbox_sub[3]) #calculate relative to the big image
  draw.rectangle(bbox_im)
print bbox_im

This block of code was instrumental in giving this:

January 25, 4:54 PM

I am so excited.

#!/usr/local/bin/python

#-*- coding:utf8 -*-

importImageFont,ImageDraw

fromPILimport Image

im = Image.new("RGB",(400,400))
#im.show()

draw = ImageDraw.Draw(im)

# use a truetype font
font= ImageFont.truetype("/usr/share/fonts/truetype/ttf-bengali-fonts/lohit_bn.ttf",50)

txt1="ক"
txt2=" ি"
txt=txt2+txt1


draw.text((10, 10), unicode(txt,'UTF-8'), font=font)
im.show()

generated:

That means the *entire* training+testing process can be automated :) :) :)

Damn, I dint even have to go to ISI. Ofcourse, going to ISI was quite an experience in itself.

January 10, 2009

08:57 PM

Forwarded conversation
Subject: Regarding Bangla training data
------------------------

From: Debayan Banerjee<debayanin@gmail.com>
Date: 2009/1/9
To: mhasnat@gmail.com

Hi,
I was going through your work on ocropus and your training data. I have a few questions:

Can you share with me your training data?

Have you trained only for Solaiman lipi font?

The maatraa-clipping code in lua, what is the logic/pseudocode?

What is the performance of the lua scripts on the 18 test images?

Eagerly waiting for your reply.

~Debayan

--
BE INTELLIGENT, USE GNU/LINUX
http://lug.nitdgp.ac.in
http://mukti09.in
http://planet-india.randomink.org

----------
From: Hasnat<mhasnat@gmail.com>
Date: 2009/1/10
To: Debayan Banerjee <debayanin@gmail.com>

Dear Debayan,

sorry for my late reply because of my traveling from Bangladesh to
outside. By training data do you mean it for OCROpus or tesseract? For
OCROpus we have created training data which was a complex process. I
worked on that few months ago to test the training procedure working
for Bangla script. I observed that this is quite complex process to
prepare training data. Few more work need to be done to complete this.
However, we (me and shouro) have tested basic training procedure and
observed the performance which seems satisfactory to me but very
sensitive. To make that training data very effective we had to collect
a large amount of data and train. As the segmentation algorithm was not
completed at that time as well as OCROpus was changing its procedures
continuously, so I left that task until the next stable version of
OCROpus (0.3). Now we again start looking at that and just finished the
basic compilation and checking other procedure to integration. So,
honestly there is no training data for version 0.3.

We are considering SuttunyMJ font for training which is the most widely used font in the Bangla documents.

I
didn't integrate any Matraa clipping code in Lua script. Rather I was
focusing on embedding our own procedures with C++ files which is not
completed yet. Matraa clipping is a big problem what I observed if you
follow the general procedures. I have tested three different methods
andobserved that nothing is giving 100% accuracy for all type of
documents. I think its a big deal to solve yet.

I have tested the images for tesseract. The test images is not
following the training document font size and type. Hence for different
images we are getting different results. From the feedback of different
people at the end user level I have the realization that we have to
work more for a market place standard OCR.

I will return back to my country at the end of this month and start
working on OCR. Then can concentrate on these issues and hopeful to
find out solutions. As we have implemented the complete framework so it
will be easier for us to solve the particular problems. Please do share
your work with us and you find our work on the web link.Regards,
--
Hasnat
Center for Research on Bangla Language Processing (CRBLP)
http://mhasnat.googlepages.com/

December 22

05:07 hrs.

This is the lua script in ocropus 0.3 release that deskews a page image. It did not work for me. Kept giving this error:

ocroscript: ocroscript/scripts/deskew.lua:9: attempt to call global 'make_DeskewPageByRAST' (a nil value)
stack traceback:
ocroscript/scripts/deskew.lua:9: in main chunk
[C]: ?

I used google, and found this. It worked well. The code is:

-proc = make_DeskewPageByRAST()

+proc = ocr.make_DeskewPageByRAST()

input = bytearray:new()
output = bytearray:new()

-read_image_gray(input,arg[1])

+iulib.read_image_gray(input,arg[1])

proc:cleanup(output,input)

-write_png(arg[2],output)

Result is:

tilt

tilt1

October 28

My work till date:

Author: debayanin

Date: Mon Oct 27 16:41:10 2008

New Revision: 8

Modified:

trunk/ccmain/baseapi.cpp

Log:

auto-indented baseapi.cpp

Modified: trunk/ccmain/baseapi.cpp

==============================

================================================

--- trunk/ccmain/baseapi.cpp (original)

+++ trunk/ccmain/baseapi.cpp Mon Oct 27 16:41:10 2008

@@ -409,161 +409,161 @@

////////////DEBAYAN//Deskew begins//////////////////////

void deskew(float angle,int srcheight, int srcwidth)

{

-//angle=4; //45° for example

-IMAGE tempimage;

-

-

-IMAGELINE line;

-//Convert degrees to radians

-float radians=(2*3.1416*angle)/360;

-

-float cosine=(float)cos(radians);

-float sine=(float)sin(radians);

-

-float Point1x=(srcheight*sine);

-float Point1y=(srcheight*cosine);

-float Point2x=(srcwidth*cosine-srcheight*sine);

-float Point2y=(srcheight*cosine+srcwidth*sine);

-float Point3x=(srcwidth*cosine);

-float Point3y=(srcwidth*sine);

-

-float minx=min(0,min(Point1x,min(Point2x,Point3x)));

-float miny=min(0,min(Point1y,min(Point2y,Point3y)));

-float maxx=max(Point1x,max(Point2x,Point3x));

-float maxy=max(Point1y,max(Point2y,Point3y));

-

-int DestWidth=(int)ceil(fabs(maxx)-minx);

-int DestHeight=(int)ceil(fabs(maxy)-miny);

-

-tempimage.create(DestWidth,DestHeight,1);

-line.init(DestWidth);

-

-for(int i=0;i<DestWidth;i++){ //A white line of length=DestWidth

-line.pixels[i]=1;

-}

-

-for(int y=0;y<DestHeight;y++){ //Fill the Destination image with white, else clipmatra wont work

-tempimage.put_line(0,y,DestWidth,&line,0);

-}

-line.init(DestWidth);

-

-

-

-for(int y=0;y<DestHeight;y++) //Start filling the destination image pixels with corresponding source image pixels

-{

- for(int x=0;x<DestWidth;x++)

- {

- int Srcx=(int)((x+minx)*cosine+(y+miny)*sine);

- int Srcy=(int)((y+miny)*cosine-(x+minx)*sine);

- if(Srcx>=0&&Srcx<srcwidth&&Srcy>=0&&

- Srcy<srcheight)

- {

- line.pixels[x]=

- page_image.pixel(Srcx,Srcy);

- }

- }

- tempimage.put_line(0,y,DestWidth,&line,0);

-}

-

-//tempimage.write("tempimage.tif");

-page_image=tempimage;//Copy deskewed image to global page image, so it can be worked on further

-tempimage.destroy();

-//page_image.write("page_image.tif");

-

+ //angle=4; //45° for example

+ IMAGE tempimage;

+

+

+ IMAGELINE line;

+ //Convert degrees to radians

+ float radians=(2*3.1416*angle)/360;

+

+ float cosine=(float)cos(radians);

+ float sine=(float)sin(radians);

+

+ float Point1x=(srcheight*sine);

+ float Point1y=(srcheight*cosine);

+ float Point2x=(srcwidth*cosine-srcheight*sine);

+ float Point2y=(srcheight*cosine+srcwidth*sine);

+ float Point3x=(srcwidth*cosine);

+ float Point3y=(srcwidth*sine);

+

+ float minx=min(0,min(Point1x,min(Point2x,Point3x)));

+ float miny=min(0,min(Point1y,min(Point2y,Point3y)));

+ float maxx=max(Point1x,max(Point2x,Point3x));

+ float maxy=max(Point1y,max(Point2y,Point3y));

+

+ int DestWidth=(int)ceil(fabs(maxx)-minx);

+ int DestHeight=(int)ceil(fabs(maxy)-miny);

+

+ tempimage.create(DestWidth,DestHeight,1);

+ line.init(DestWidth);

+

+ for(int i=0;i<DestWidth;i++){ //A white line of length=DestWidth

+ line.pixels[i]=1;

+ }

+

+ for(int y=0;y<DestHeight;y++){ //Fill the Destination image with white, else clipmatra wont work

+ tempimage.put_line(0,y,DestWidth,&line,0);

+ }

+ line.init(DestWidth);

+

+

+

+ for(int y=0;y<DestHeight;y++) //Start filling the destination image pixels with corresponding source image pixels

+ {

+ for(int x=0;x<DestWidth;x++)

+ {

+ int Srcx=(int)((x+minx)*cosine+(y+miny)*sine);

+ int Srcy=(int)((y+miny)*cosine-(x+minx)*sine);

+ if(Srcx>=0&&Srcx<srcwidth&&Srcy>=0&&

+ Srcy<srcheight)

+ {

+ line.pixels[x]=

+ page_image.pixel(Srcx,Srcy);

+ }

+ }

+ tempimage.put_line(0,y,DestWidth,&line,0);

+ }

+

+ //tempimage.write("tempimage.tif");

+ page_image=tempimage;//Copy deskewed image to global page image, so it can be worked on further

+ tempimage.destroy();

+ //page_image.write("page_image.tif");

+

}

/////////////DEBAYAN//Deskew ends/////////////////////

////////////DEBAYAN//Find skew begins/////////////////

float findskew(int height, int width)

{

-int topx=0,topy=0,sign,count=0,offset=1,ifcounter=0;

-float slope=-999,avg=0;

-IMAGELINE line;

-line.init(1);

-line.pixels[0]=0;

-///////Find the top most point of the page: begins///////////

-for(int y=height-1;y>0;y--){

- for(int x=width-1;x>0;x--){

- if(page_image.pixel(x,y)==0){

- topx=x;topy=y;

- break;

- }

-

- }

-

- if(topx>0){break;};

-}

-///////Find the top most point of the page: ends///////////

-

-

-///////To find pages with no skew: begins//////////////

-int c1,c2=0;

-for(int x=1;x<.25*width;x++){

- while(page_image.pixel((width/2)+x,c1++)==1){ }

- while(page_image.pixel((width/2)-x,c2++)==1){ }

- if(c1==c2){cout<<"0 ANGLE\n";return (0);}

- c1=c2=0;

-}

-///////To find pages with no skew: ends//////////////

-

-cout<<"width="<<width;

-if(topx>0 && topx<.5*width){

- sign=1;

-}

-if(topx>0 && topx>.5*width){

- sign=-1;

-}

-

-

-if(sign==-1){

- while((topx-offset)>width/2){

- while(page_image.pixel(topx-offset,topy-count)==1){

- //page_image.put_line(topx-offset,topy-count,1,&line,0);

- count++;

- }

-

- if((180/3.142)*atan((float)count/offset)<10){

- slope=(float)count/offset;

- ifcounter++;

- avg=(avg+slope);

- }

- count=0;

- offset++;

- }

- avg=(float)avg/ifcounter;

- //cout<<"avg="<<avg<<"\n";

- page_image.write("findskew.tif");

- //cout<<"(180/3.142)*atan((float)(count/offset)="<<(180/3.142)*atan(avg)<<"\n";

- return (sign*(180/3.142)*atan(avg));

-

-}

-if(sign==1){

- while((topx+offset)<width/2){

- while(page_image.pixel(topx+offset,topy-count)==1){

- //page_image.put_line(topx+offset,topy-count,1,&line,0);

- count++;

- }

-

- if((180/3.142)*atan((float)count/offset)<10){

- slope=(float)count/offset;

- ifcounter++;

- avg=(avg+slope);

- }

- count=0;

- offset++;

- }

- avg=(float)avg/ifcounter;

- //cout<<"avg="<<avg<<"\n";

- page_image.write("findskew.tif");

- //cout<<"(180/3.142)*atan((float)(count/offset)="<<(180/3.142)*atan(avg)<<"\n";

- return (sign*(180/3.142)*atan(avg));

-

-}

-

-if(sign==0)

-{return 0;}

-cout<<"SHIT";

-return (0);

+ int topx=0,topy=0,sign,count=0,offset=1,ifcounter=0;

+ float slope=-999,avg=0;

+ IMAGELINE line;

+ line.init(1);

+ line.pixels[0]=0;

+ ///////Find the top most point of the page: begins///////////

+ for(int y=height-1;y>0;y--){

+ for(int x=width-1;x>0;x--){

+ if(page_image.pixel(x,y)==0){

+ topx=x;topy=y;

+ break;

+ }

+

+ }

+

+ if(topx>0){break;};

+ }

+ ///////Find the top most point of the page: ends///////////

+

+

+ ///////To find pages with no skew: begins//////////////

+ int c1,c2=0;

+ for(int x=1;x<.25*width;x++){

+ while(page_image.pixel((width/2)+x,c1++)==1){ }

+ while(page_image.pixel((width/2)-x,c2++)==1){ }

+ if(c1==c2){cout<<"0 ANGLE\n";return (0);}

+ c1=c2=0;

+ }

+ ///////To find pages with no skew: ends//////////////

+

+ cout<<"width="<<width;

+ if(topx>0 && topx<.5*width){

+ sign=1;

+ }

+ if(topx>0 && topx>.5*width){

+ sign=-1;

+ }

+

+

+ if(sign==-1){

+ while((topx-offset)>width/2){

+ while(page_image.pixel(topx-offset,topy-count)==1){

+ //page_image.put_line(topx-offset,topy-count,1,&line,0);

+ count++;

+ }

+

+ if((180/3.142)*atan((float)count/offset)<10){

+ slope=(float)count/offset;

+ ifcounter++;

+ avg=(avg+slope);

+ }

+ count=0;

+ offset++;

+ }

+ avg=(float)avg/ifcounter;

+ //cout<<"avg="<<avg<<"\n";

+ page_image.write("findskew.tif");

+ //cout<<"(180/3.142)*atan((float)(count/offset)="<<(180/3.142)*atan(avg)<<"\n";

+ return (sign*(180/3.142)*atan(avg));

+

+ }

+ if(sign==1){

+ while((topx+offset)<width/2){

+ while(page_image.pixel(topx+offset,topy-count)==1){

+ //page_image.put_line(topx+offset,topy-count,1,&line,0);

+ count++;

+ }

+

+ if((180/3.142)*atan((float)count/offset)<10){

+ slope=(float)count/offset;

+ ifcounter++;

+ avg=(avg+slope);

+ }

+ count=0;

+ offset++;

+ }

+ avg=(float)avg/ifcounter;

+ //cout<<"avg="<<avg<<"\n";

+ page_image.write("findskew.tif");

+ //cout<<"(180/3.142)*atan((float)(count/offset)="<<(180/3.142)*atan(avg)<<"\n";

+ return (sign*(180/3.142)*atan(avg));

+

+ }

+

+ if(sign==0)

+ {return 0;}

+ cout<<"SHIT";

+ return (0);

}

////////////DEBAYAN//Find skew ends///////////////////

@@ -573,101 +573,101 @@

//used only if the language belongs to devnagri, eg, ben, hin etc.

void TessBaseAPI::ClipMaatraa(int height, int width)

{

-IMAGELINE line;

-line.init(width);

-int count,count1=0,blackpixels[height-1][2],arr_row=0,maxbp=0,maxy=0,matras[100][3],char_height;

-//cout<<"Connected Script="<<connected_script<<"\n";

-

-for(int y=0; y<height-1;y++){

- count=0;

- for(int x=0;x<width-1;x++){

- if(page_image.pixel(x,y)==0)

- {count++;}

- }

-

- if(count>=.05*width){

- blackpixels[arr_row][0]=y;

- blackpixels[arr_row][1]=count;

- arr_row++;

- }

-}

-blackpixels[arr_row][0]=blackpixels[arr_row][1]='\0';

-

-for(int x=0;x<width-1;x++){ //Black Line

- line.pixels[x]=0;

-}

-

-////////////line_through_matra() begins//////////////////////

-count=1;

-//cout<<"\nHeight="<<height<<" arr_row="<<arr_row<<"\n";

-char_height=blackpixels[0][0]; //max character height per sentence

-//cout<<"Char Height Init="<<char_height;

-while(count<=arr_row){

- //if(count==0){max=blackpixels[count][0];}

- if((blackpixels[count][0]-blackpixels[count-1][0]==1) && (blackpixels[count][1]>=maxbp)){

- maxbp=blackpixels[count][1];

- maxy=blackpixels[count][0];

- //cout<<"\nMax="<<maxy<<" bpc="<<maxbp;

- }

-

- if((blackpixels[count][0]-blackpixels[count-1][0])!=1){

- /////////////drawline(max)//////////////////////

-

- // cout<<"\nmax="<<maxy<<" bpc="<<maxbp;

-// page_image.put_line(0,maxy,width,&line,0);

- char_height=blackpixels[count-1][0]-char_height;

- matras[count1][0]=maxy; matras[count1][1]=maxbp; matras[count1][2]=char_height; count1++;

- char_height=blackpixels[count][0];

-

- //////////// drawline(max)/////////////////////

- maxbp=blackpixels[count][1];

- }

- count++;

- }

-matras[count1][0]=matras[count1][1]=matras[count1][2]='\0';

-

-//delete blackpixels;

-////////////line_through_matra() ends//////////////////////

-

- ////////////clip_matras() begins///////////////////////////

- for(int i=0;i<100;i++){ //where 100=max number of sentences per page

- if(matras[i][0]=='\0'){break;}

- //cout<<"\nY="<<matras[i][0]<<" bpc="<<matras[i][1]<<" chheight="<<matras[i][2];

- count=i;

-}

-

-for(int i=0;i<=count;i++){

-

- for(int x=0;x<width-1;x++){

- if(page_image.pixel(x,matras[i][0])==0){

- count1=0;

- for(int y=0;y<matras[i][2] && count1==0;y++){

- if(page_image.pixel(x,matras[i][0]-y)==1){count1++;

- for(int z=y+1;z<matras[i][2];z++){

- if(page_image.pixel(x,matras[i][0]-z)==1){count1++;}//black pixel encountered... stop counting.

- else

- {break;}

- }

- }

- }

- //cout<<"\nWPR @ "<<x<<","<<matras[i][0]<<"="<<count1;

- if(count1>.8*matras[i][2]){

- line.init(matras[i][2]+5);

- for(int j=0;j<matras[i][2]+5;j++){line.pixels[j]=1;}

- page_image.put_column(x,matras[i][0]-matras[i][2],matras[i][2]+5,&line,0);

- }

- }

- }

-

-}

-

-page_image.write("bentest.tif");

-

+ IMAGELINE line;

+ line.init(width);

+ int count,count1=0,blackpixels[height-1][2],arr_row=0,maxbp=0,maxy=0,matras[100][3],char_height;

+ //cout<<"Connected Script="<<connected_script<<"\n";

+

+ for(int y=0; y<height-1;y++){

+ count=0;

+ for(int x=0;x<width-1;x++){

+ if(page_image.pixel(x,y)==0)

+ {count++;}

+ }

+

+ if(count>=.05*width){

+ blackpixels[arr_row][0]=y;

+ blackpixels[arr_row][1]=count;

+ arr_row++;

+ }

+ }

+ blackpixels[arr_row][0]=blackpixels[arr_row][1]='\0';

+

+ for(int x=0;x<width-1;x++){ //Black Line

+ line.pixels[x]=0;

+ }

+

+ ////////////line_through_matra() begins//////////////////////

+ count=1;

+ //cout<<"\nHeight="<<height<<" arr_row="<<arr_row<<"\n";

+ char_height=blackpixels[0][0]; //max character height per sentence

+ //cout<<"Char Height Init="<<char_height;

+ while(count<=arr_row){

+ //if(count==0){max=blackpixels[count][0];}

+ if((blackpixels[count][0]-blackpixels[count-1][0]==1) && (blackpixels[count][1]>=maxbp)){

+ maxbp=blackpixels[count][1];

+ maxy=blackpixels[count][0];

+ //cout<<"\nMax="<<maxy<<" bpc="<<maxbp;

+ }

+

+ if((blackpixels[count][0]-blackpixels[count-1][0])!=1){

+ /////////////drawline(max)//////////////////////

+

+ // cout<<"\nmax="<<maxy<<" bpc="<<maxbp;

+ // page_image.put_line(0,maxy,width,&line,0);

+ char_height=blackpixels[count-1][0]-char_height;

+ matras[count1][0]=maxy; matras[count1][1]=maxbp; matras[count1][2]=char_height; count1++;

+ char_height=blackpixels[count][0];

+

+ //////////// drawline(max)/////////////////////

+ maxbp=blackpixels[count][1];

+ }

+ count++;

+ }

+ matras[count1][0]=matras[count1][1]=matras[count1][2]='\0';

+

+ //delete blackpixels;

+ ////////////line_through_matra() ends//////////////////////

+

+ ////////////clip_matras() begins///////////////////////////

+ for(int i=0;i<100;i++){ //where 100=max number of sentences per page

+ if(matras[i][0]=='\0'){break;}

+ //cout<<"\nY="<<matras[i][0]<<" bpc="<<matras[i][1]<<" chheight="<<matras[i][2];

+ count=i;

+ }

+

+ for(int i=0;i<=count;i++){

+

+ for(int x=0;x<width-1;x++){

+ if(page_image.pixel(x,matras[i][0])==0){

+ count1=0;

+ for(int y=0;y<matras[i][2] && count1==0;y++){

+ if(page_image.pixel(x,matras[i][0]-y)==1){count1++;

+ for(int z=y+1;z<matras[i][2];z++){

+ if(page_image.pixel(x,matras[i][0]-z)==1){count1++;}//black pixel encountered... stop counting.

+ else

+ {break;}

+ }

+ }

+ }

+ //cout<<"\nWPR @ "<<x<<","<<matras[i][0]<<"="<<count1;

+ if(count1>.8*matras[i][2]){

+ line.init(matras[i][2]+5);

+ for(int j=0;j<matras[i][2]+5;j++){line.pixels[j]=1;}

+ page_image.put_column(x,matras[i][0]-matras[i][2],matras[i][2]+5,&line,0);

+ }

+ }

+ }

+

+ }

+

+ page_image.write("bentest.tif");

+

////////////clip_matras() ends/////////////////////////////

-

-/////////DEBAYAN/////////////////

-

-

+

+ /////////DEBAYAN/////////////////

+

+

}

October 22

Its 4:45 AM.

Task number one for Indic OCR workout participants at foss.in 2008

Implement deskewing (basically straightening a tilted image) code in any language of choice. The algorithm may be any good standard one of your choice. The image to be tested on is this.

Then mail to me at debayanin AT gmail DOT com , or on any mailing list.

I think hough transforms would be the best way. I have been facing some difficulty in implementing this in python for the test images, but the theory is sound and will ultimately give good results.

October 14 2008

Its 5:05 AM. Feels so nice to revisit this page after 4 long months. Have seen a lot in these 4 months. hmm....

So all stuff added to this page will act as a reference for the workout proposed at foss.in 2008, and also the work for Sarai FLOSS fellowship (which i am not sure about yet).

Note: TesseractIndic is Tesseract-OCR with Indic script support. This will remain a separate project untill Tesseract-OCR actually decides to accept patches and merge Indic script support. TesseractIndic can be found here.

So lets see where we stand. We have Tesseract-OCR, which works great for english. I managed to apply "maatraa clipping" (which is a new term/approach in the world of OCR i think!) successfully as a proof of concept to the image being fed to the Tesseract OCR engine. Accuracy obtained by this method, along with some really crappy training, stands at about 85%.

A standard OCR process contains the following steps:

(1) Pre-processing, involving skew removal, etc. Pretty much

language-independent, though features like the shirorekha

might help here.

(2) Character extraction: Again, largely language-independent,

though language dependency might come in because of

features like shirorekha.

(3) Character identification: Language independent, maybe with

specialised plugins to take advantage of language features,

or items like known fonts.

(4) Post-processing, which involves things like spell-checking to

improve accuracy.

The current available version of Tesseract OCR does steps 3, and 4 above for any language. But that it can only do if it can do step 2 properly, which it cant for connected script like Hindi, Bengali etc. So the approach is to take the scanned image, apply some pre-processing to it, and then do the "maatraa clipping" operation on it. Now feed this image to Tesseract-OCR engine.

In detail, the things to do are:

(1) Pre-processing: Skew removal, Noise removal. Skew removal in particular is key for the "maatraa clipping" code to work.

(2) "maatraa clipping" : This enables the Tesseract-OCR engine to treat Devnagri connected script like any other script.

(3) Training: Very Important for getting good results. But well documented. Good tools exist for training Tesseract-OCR.

(4) Web Interface: We need to create a web interface so people can freely OCR their documents online. No big deal.

Now my intention is to implement skew removal using Hough transforms. Hough transforms are really good in finding staright lines (among other shapes) in images. So all we need to do is, find the "maatraas" and calculate thier slope. We have the skew angle, and we just rotate the page to correct the skew.

I had implemented "maatraa clipping" using projection based methods. It seems there is a better digital image processing method called "Morphological Operations" that is a better way of doing it. Well, actually i am not that sure about it yet. Still researching and trying out stuff.

Now, I had done all this work in C++, as the Tesseract-OCR code is also in C++. But, of late, i have been mesmerised by the simplicity and power of Python , and the Python image library. All the work i am doing now, including Hough transfroms, is in Python. So now we have 2 options:

(1) Do the pre-processing and "maatraa clipping" in Python and feed the page to the Tesseract-OCR (will be easy and quicker to implement)

(2) Do the entire thing in C++ (will execute much faster)

Again, we will probably end up doing both. In foss.in, I will probably bring along Python code that already works, and ask people to port it to C++ and merge upstream to TesseractIndic. Or we could ask people to implement algorithms of their choice in the language of their choice on a common set of test images and then shall convert that stuff to C++ and add.

Will go sleep now. This page will keep increasing in content on a daily basis now. So keep checking this.

PS: Special thanks to Mr. Sankarshan and Mr. Gora Mohanty for supporting me through out.

June 24 2008

Its 01:45 AM. Here is a mail i received from Mr. Gora Mohanty as a reply to my last post below on 23 June, and also to a mail i sent to the Aksharbodh mailing list:.

1. What were the issues with displaying Bengali fonts on Linux? Sample

images would help, as you do not give enough details to go on. People

on the indlinux-group mailing list (got to http://indlinux.org and the

Mailing Lists link towards the bottom to subscribe), and the Ankur

Bangla folk ought to be able to help you out here.

What GUI toolkit is tesseractTrainer.py using? Both gtk, and QT should

work fine for Bengali text, at least in any UTF-8 locale.

2. Could you give figures for % accuracy? Not all of us can read Bangla.

3. Is there any documentation on what training involves, and what kind

of training text you need? Could you the copious amount of Bangla text

in the GNOME/KDE .po files for Bengali?

Regards,

Gora

Ans 1) The issues with displaying bengali fonts in tesseractTrainer.py has mysteriously solved itself!! It does display Bengali text now. Here is a screenshot:

It shows that i can type and read in bengali anywhere on my linux installation now, including gedit, mozilla, terminal and tesseractTrainer.py.

Ans 2) The two test images below gave an accuracy of 89% and 85% respectively. These are not accurate, i just did a quick one time count of the errors. Some of the errors have occured because i havent trained the engine with the particular character yet, and some because i fed the wrong character.

Ans 3) Well, the entire training process is mentioned at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract, but what i really need is an image, either scanned, or electronically rendered from some text editor, that contains as many samples of characters as possible, including conjuncts for every character, and then its box file. So the steps are:

1) Type all the possible characters+conjucts in a editor (for example, type ক কি কী কা কু ক then খ খা খী খি etc). Increase the font size a little bit.

2) Use a screen capture software, and from the image generated crop the part that contains the characters, and nothing else.

3) Convert the image to tif format. Then run it through tesseract downloaded from here using this command:

tesseract fontfile.tif fontfile batch.nochop makebox

where fontfile is the name of the image. This will create a new file named fontfile.txt. Change the extension to .box .

4) Now download tesseractTrainer.py, or bbtesseract and open this image in it and then edit the box files.

5) Mail the image and box file to me.

June 23 2008

Its 11:54 AM. Have been training Tesseract_indic with images with bengali text. A lot of time was spent fixing seg faults, memory alloc errors etc. It did slow me down. I also have my college placements to prepare for. So could not devote a lot of time.

Training was made difficult by the fact that tesseractTrainer.py did not display bengali fonts properly on linux, although i have the locales set properly and i can type in bengali anywhere on my linux installation. Initially i had to frequently swap between Windows and Linux, since i was using bbtesseract, a utility that edits box files for Tesseract training images. Both the utilities are very useful and i can imagine how hard it would have been otherwise.

The test results are poor as of now. I havent trained it properly, and the maatraa clipping code has to be improved for the results to be of acceptable quality.

Here are the three images i trained it with:

The corresponding box files are at http://tesseractindic.googlecode.com/files/imges_boxes.tar.gz .

If you want to help out by training it, download this, and then follow this.

The patches here helped a lot. I do not know why these havent been integrated into Tesseract. Also i managed to get a 1,52,000 words strong wordlist from Ankur bangla project. It will improve the accuracy a lot. Initially it had some some strange characters, but i used sed and merged all the words into one big file.

I need people to start submitting training data, either to me or to tesseract group. i will make a few changes to the maatraa clipping code and mail the patch to Ray Smith. Lets see what happens.

Initial results are here:

OCRed text:

খযাব শনবীর খরওজায়
মাওলানা েমাহাম্‌মদ অাব্‌দুল েমাতালিব খশহীদ
েক অাছ ভাইখ নবীর েপ্রমিক চল নিডয় অামাডক
সাত সমুদ্র েপড়িেয় যাব নূর নবীজির রওজােত া ঐ
পাপি অামি কি েয করি রওজায় য৮ব েনই েকা কড়ি (২ব৮রহ
পাখা যদি থাকত অামার ভর করিতাম পাখােত া ঐ
দয়াল নবী দয়ায় ভরা েপালােমের দাওেপা ধরা (২বারহ
মাওক নিনা অাডশক হডয় েকমডন থযাব জারIডত া ঐ
ওেগা নবী গেলর মালা দ্রহর কের খদাও দ্রমদ্রেনদ্রর খজাত াা (২বা রদ্রহ
ঠভাইদ্র দিডয় চরন তডল বনর কর অামাডক া ঐ
পবীর রাডত কনদি বডস নবীজিডক পাইব৮র ত াাডশ (২বারহঠ
নবীজির দানি রিডন চাইখনা েযডত জায়াডত া ঐ
অবীন শহীেদ ভােক েক যাও েতারা রওজা পােক (২বারহ
অামায় সােথ নিেয় চল যাব নবীর খজার া েত I ঐ

Accuracy: 89% approx

OCRed text:

জন গণ মন অধি নায়ক জয় েহ
ভ\রত ভIগ্য বিধাত\I
পঞ্জ\ব গুজর\ত মর\ঠ\
দ্রারিড় উহকল বতমা া
বিংধ্য হিম\চল য়মুIা গংা
উচ্ছল জদ্রনরি তরহপা া
তব মুভ নাহম জােগ মু
তব গুভ অ\াদ্রাধ ম\াগ ৮
গাাহ তব ড়াব গ\থাI
জন গণ মংলদ\য়ক জয় েহ
ভ\রত ভদ্রগ্য বিধাতাI
ভহয় ৮ঠহ া জয় চহ ই ভহয় াঠহা
জয় জহা ড়ায় জয় েহ ভ

Accuracy: 85 % approx

Ya ya i know. Its not that great. But its only going to get better, And i dint train it properly, so be cheerful!!

Go to http://code.google.com/p/tesseractindic for all the downloadable stuff.

(PS: Some of my friends seem to think i made this software. Well, for them, i would like to say this; i am trying to add a pinch of salt to an already cooked and fine meal. Nothing more and nothing less :) )

June 12 2008

Its 11:50 AM. Latest work done is here. Download is here. Patch here.

Next is training. The most important part. It will finally make it usable.
Tom from OCRopus/Tesseract has been kind enough to help me out with maatraa clipping after going through some of my work.

June 8, 2008

Its 11:40 PM. This is my first release of the Indic script supported Tesseract OCR engine. Download the tarred gz file or the patch if you already have the Tesseract 2.03.

The release is very much in its alpha stage right now. Infact, after downloading the archive, you will have to train the engine with your language of choice. Here is how to train it. I will soon add the complete archive, with training data for Bengali, and later for Hindi.

You *must* download this english training data for the engine to work in recognising english text. My advice is, wait for a few more days before i release a fully working version. I have also applied for sourceforge hosting space.

I will mail the patches to the Tesseract maintainers only after i have the training data ready.

And i have decided i will go to sleep by 4:30 AM everyday.

June 7, 2008

Its 7:54 AM. From now on i shall document everything in a pretty formal manner. Here is the algorithm for the maatraa clipping code.

In a few days i shall provide the diff file, ie the code itself.

June 3, 2008

Spent the entire night experimenting with the code. Its 05:28 hrs now. No big deal though. I usually sleep at 6:30 in the morning.

Made my first box file for the "national anthem" image. I first made it with the normal Tesseract engine. As expected, it classified whole words as blobs. Then made box files after adding my code, and I was delighted by the results. Here are related screen shots.

The first two images show boxes made by bbtesseract which uses box files generated by the generic Tesseract engine.

For all the boxes generated below, bbtesseract used the boxfiles generated by the modified Tesseract engine, which included the maatraa clipping code. The results are quite good. Character segmentation is good.

Below are the incorrect segmentations, which show flaws in the algorithm i came up with (which i am sure any other guy would also come up with, hence i am not boasting).

The image below suffers the same boxing problem. The solution to this problem is still not apparent, because i do not know how i can box the hossoi, since it overlaps the ordinate of the second letter. I will figure something out though :) .

Here is another problem, but the opposite of the other. Here the loop below the "ga" overlaps the ordinate of the 2nd letter, but in the bottom.

Same problem.

What really remains is training the engine, which is simple, and all of a sudden we have a working, free, Indic OCR engine!!

WIll sleep now, and update this page later.

May 27, 2008

I have had this fetish of working on digital image processing/OCR related projects since my 2nd year of college. That was when some professors from ISI Kolkata came down to our college for a workshop on DIP. Now, after 18 months or so, I finally did something in the direction.

Tesseract was developed by HP labs and then transferred to Google which open sourced it. Some parts of it are still proprietary, like the feature recognition algos, and also it is covered by the Apache license which is somewhat restrictive, but it was great for me to work on. The two main developers for Tesseract are Ray Smith and Luc Vicent, both legends, and engineers at Google.

Well, lets get down to the problem. Tesseract currently does not support connected scripts or handwritten text. Devnagri scripts such as Bengali and Hindi hava a matra (মাত্রো), which is like an underline, but in this case on top of the word, not under it. The Tesseract engine recognises machine print english script, but it relies on the gaps between successive characters in english to classify each character into a blob. Hence, theoretically, every isolated character is a blob.

The problem with classifying devnagri is that the matra (মাত্রো) connects the entire word and hence the entire character recognition system fails. To this end, solution could be found if the matras were clipped between two successive characters. That way, the same engine could look at each character as an isolated blob.

The steps involved were:

Go through the Tesseract source code and identify the place where this could be added.

Once the function(s) have been identified, think about the algorithm that would allow us to clip matras.

The algorithm itself is as follows:

Threshold the image. Tesseract already took care of this.

Read each row of the image, starting from bottom(y=0). Note the black pixel count on each line.

Find the line with the maximum black pixel count between 2 successive zones of 0 black pixel count. This line is the matra.

Do the same for the entire page and store the Y co-ordinates of each such matra found.

Now, take such a matra Y co-ordinate. Iterate over each X co-ordinate and note the number of pixels having a continuous run of white pixels . If this number is greater than 90% of the character width, the column is a region between two characters. Clip this column and the matra above it.

Black lines between successive characters signifies that these spaces have been marked for clipping

Proceed in the same manner.

Here is a tif image. It contains India's national anthem in Bengali.

Here is the image after clipping the matras:

If you look carefully, you will see that the matras have been clipped between successive characters. Now, it is more or less ready to be fed to the Tesseract engine.

Now for the all important code. This is the only function i altered. It is in tesseract-2.03/ccmain/baseapi.cpp. I did not provide the diff file becuase i made some more changes in other parts that i am not sure of.

// Threshold the given grey or color image into the tesseract global
// image ready for recognition. Requires thresholds and hi_value
// produced by OtsuThreshold above.
void TessBaseAPI::ThresholdRect(const unsigned char* imagedata,
int bytes_per_pixel,
int bytes_per_line,
int left, int top,
int width, int height,
const int* thresholds,
const int* hi_values) {

IMAGELINE line;
page_image.create(width, height, 1);
line.init(width);
int count,count1=0,blackpixels[height-1][2],arr_row=0,maxbp=0,maxy=0,matras[100][3],char_height;
// For each line in the image, fill the IMAGELINE class and put it into the
// Tesseract global page_image. Note that Tesseract stores images with the
// bottom at y=0 and 0 is black, so we need 2 kinds of inversion.
const unsigned char* data = imagedata + top*bytes_per_line +
left*bytes_per_pixel;
for (int y = height - 1 ; y >= 0; --y) {
const unsigned char* pix = data;
for (int x = 0; x < width; ++x, pix += bytes_per_pixel) {
line.pixels[x] = 1;
for (int ch = 0; ch < bytes_per_pixel; ++ch) {
if (hi_values[ch] >= 0 &&
(pix[ch] > thresholds[ch]) == (hi_values[ch] == 0)) {
line.pixels[x] = 0;
break;
}
}
}
page_image.put_line(0, y, width, &line, 0);
data += bytes_per_line;
}
/////////DEBAYAN//////////////////

for(int y=0; y<height-1;y++){
count=0;
for(int x=0;x<width-1;x++){
if(page_image.pixel(x,y)==0)
{count++;}
}

if(count>0){

blackpixels[arr_row][0]=y;
blackpixels[arr_row][1]=count;
arr_row++;
}
}
blackpixels[arr_row][0]=blackpixels[arr_row][1]='\0';

for(int x=0;x<width-1;x++){ //Black Line
line.pixels[x]=0;
}

////////////line_through_matra() begins//////////////////////
count=1; cout<<"\nHeight="<<height<<" arr_row="<<arr_row<<"\n";
char_height=blackpixels[0][0]; //max character width per sentence
while(count<=arr_row){
//if(count==0){max=blackpixels[count][0];}
if((blackpixels[count][0]-blackpixels[count-1][0]==1) && (blackpixels[count][1]>=maxbp)){
maxbp=blackpixels[count][1];
maxy=blackpixels[count][0];
cout<<"\nMax="<<maxy<<" bpc="<<maxbp;
}

if((blackpixels[count][0]-blackpixels[count-1][0])!=1){
/////////////drawline(max)//////////////////////

cout<<"\nmax="<<maxy<<" bpc="<<maxbp;
// page_image.put_line(0,maxy,width,&line,0);

char_height=blackpixels[count-1][0]-char_height;
matras[count1][0]=maxy; matras[count1][1]=maxbp; matras[count1][2]=char_height; count1++;
char_height=blackpixels[count][0];

//////////// drawline(max)/////////////////////
maxbp=blackpixels[count][1];
}
count++;
}
matras[count1][0]=matras[count1][1]=matras[count1][2]='\0';

//delete blackpixels;
////////////line_through_matra() ends//////////////////////

////////////clip_matras() begins///////////////////////////
for(int i=0;i<100;i++){
if(matras[i][0]=='\0'){break;}
cout<<"\nY="<<matras[i][0]<<" bpc="<<matras[i][1]<<" chheight="<<matras[i][2];
count=i;
}

for(int i=0;i<=count;i++){

for(int x=0;x<width-1;x++){

count1=0;
for(int y=0;y<matras[i][2];y++){
if(page_image.pixel(x,matras[i][0]-y)==1){count1++;
for(int k=y+1;k<matras[i][2];k++){
if(page_image.pixel(x,matras[i][0]-k)==1){count1++;}
else{break;}
}
break;
}
}
cout<<"\nWPR @ "<<x<<","<<matras[i][0]<<"="<<count1;
if(count1>.8*matras[i][2] && count1<matras[i][2]){
line.init(matras[i][2]+5);
for(int j=0;j<matras[i][2]+5;j++)
{line.pixels[j]=1;}
cout<<"GA";
page_image.put_column(x,matras[i][0]-matras[i][2],matras[i][2]+5,&line,0);
}
}
}

page_image.write("bentest.tif");
////////////clip_matras() ends/////////////////////////////
/////////DEBAYAN/////////////////
}

Problems with the code:

It assumes the image is perfectly straight. This assumption is obviously wrong, but Tesseract already has an inbuilt function to correct this. In any case because devnagri scripts have a matra, finding the angle of tilt is pretty simple.

It run this "matra-clipping" code on all languages, which is totally wrong. One just has to add an "if-then-else" statement to make it run only for devnagri scripts such as hin_in, ben_in etc.

I implemented the entire thing in one block of code. Will break it into a few functions.

The algorithm has gaping flaws. Will plug em up.

Many more, some i know, most i dont :)

Long term goal:

Adding Devnagri script support to Tesseract, which includes making the above code error free, and then training Tesseract. As far as i understand, Tesseract developers are not concentrating on adding support to adding connected script support yet, hence i hope my work does not overlap with Google's work.

And i solemnly swear that i did not plagiarise/steal the code from anywhere, and i came up with the algo myself (which is why it is full of bugs) .

Honestly, kaam kar ke mazaa aa gaya yaar!!

(ps: What happened to aksharbodh? If anybody knows please tell me.)