Saturday, November 28, 2009

Initial tests

Initial test results are pretty good.

Test Condition:

Image: A deskewed image with Bengali text.




Training data:
Word List: Superset of all the words contained in the image
Shape Information: Using CRBLP's data

Output Text:

বৈঠকী মেজাজের এক উপেক্ষিত সা হিতি্যক
পত্র।স্তরে অনুজপ্রতিম লেখক ভ্রী শৈবাল মিত্র কিছুদিন অাগে বা-ঙা লি
লেখবঢ়ুঘুদ্ধিজীবীদের খুব একচেটি বকূনি দিয়েছো। তার অভিযোড়ৈগ এই যে, -যখনই
এই সব ব্যক্তিদের প্রশ্ল কর। হয়, অাপনার। গত এক বছরে উল্লেখযোগ্য কীড়ী কীা বই
পড়েছো, তখন অবধা রিতভাবে প্রায় সকলেই কিছুইংরিজি বই বা বিদেশি সাহিভ্যের
সূখ্য।তি ভরঢ়ু করেন। তার। কি ব।ংল। বই পড়্গে না ? নিজেরা বাংলা ভাষার লেখক
হয়েও অপর কে।নও বাঙালি লেখকের রচন।কে ণ্ডরঢ়ুত্বপুর্ণ মনে করেন নাড়ৈ? না কি
ব।ংল। ভ।ষায় উল্লেখযে।গ্য কিছু লেখাই হয় না।
এই অভিযে।গে সত্যতা অাছে। প।ণ্ডিত্য প্রমাণ করার জন্য অনেকেই বিদেশি
স।হিত্য সম্পর্কে জান জ। হির কর।র জন্য ব্যস্ত হয়ে পড়্গে; পা ণ্ডিত্য কিংবানূরবারি
য।ই হে।ক, ব।ংল। বই-টই এর মধ্যে অ।সে না। কফি হাউসের বুদ্ধিজীবীদের হাভে
বাংলা বই র।খ।র রেওয়।জ নেই। ঢেউয়ের মতন কখনও ম।র্কেজ, কখনও-
দেরিদ।-গ্র।মসি, কখনও টনি মরিসন ব।ঙ।লি বুদ্ধিজীবীদের ওণ্ঠের ওপর খেলাকরে
যান।
বিদেশি স।হিত্য ও তত্ত্বগ্রছ প।ঠ করা অবশ্যই জকরি, কিত বাংলা ভাষায়
অালোচন।যে।গ্য কে।নও গ্রছ লেখ। হয় ন।, এমন য দি মনে কর। হয় তা হলে বাংলা
-ভাষানিয়ে এত গর্ব কর।রইবা কী অ।ছে ? বিদেশি স।হিত্য প।ঠ করলেই বরং বে।ঝা
-যায়-, সম্প্রতি অন্যান্য ভ।ষ।য় রচিত গল্প-উপন্য।সবঢ়ুবিত। বাংলার তূলনায় এমন কিছু
-অাহাম রি উচচ।ঙ্গের নয়। সৈয়দ মুস্ত।ফ। সির।জের ড়ৃঅলীক মানুষ,এর মতন উপন্যাস
বিঢ়ুংব। -জয় গে।স্ব।মীর ড়ৈপ।গলী, ভে।মার সঙ্গে-ব তুল্য কাব্যগ্রছইদানীং কোন ভ।ষ।য়
প্রকাশিত হয়েছে?
যাই হ্রোক, অামার পক্ষে এরকম পণ্ডি তিপনা কিংবা স্নবারি দেখ।বার কোনও
সুযে।গই নেই; কারণ গত এব৪ বৎসরে অামি বিদেশি সাহিত্য কিছুই প ড়িনি ! এমনকী
ইংরিজি অক্ষরে লেখ। নিতাস্ত কয়েবঢ়ুখান। পুরনো ইতিহাসখজীবনীগ্রস্হু ছাড়া কোনও
গল্পঞ্ঝউপন্যাস চোখেও দেখি নি! বস্কু-ব।ন্ধবরা কেউ যখন সা।ভ.ঘতিক কোনও
সাড়।-জাগ।নো বইয়ের প্রসঙ্গ তুলে জিজ্ঞেস করে, তূ মি পড়েছ নিশ্চয়ই? অামারে৪
সসংকে।চে স্বীক।র করতেই হয়ড়ৈ, না ভ।ই পড়িনি! কিংব। বিদেশ খেকে ফিরে এলে
যখন কেউ জিজেস করে, ও দেশের হালফিল সাহিত্যের ধারা কী দেখলে, অামি
মাথা চুলকোই। জা নি না, খবর নেব।র সময় পাইনি! লভনে গিয়ে ইন্ডিয়া অ ফিস
লাইরেরিভে অামি পুরনো গুথিপত্র ঘেঁটেছি, এবচটাও নজৃব ইংরিজি কবিতার বই
কিনি নি, এটা স্বীকার করভে অামার লজ্জ। হয়। ত্রটা অামার এবচঁটা অধঃপতনের চিহ্ন


Accuracy: 93% ~

One major source of errors is । vs া ambiguity. That can be fixed.

This is pretty good news. The OCR is working well.

Conversation with Sayamindu regarding ambiguities

Here is a mail i sent to Sayamindu:
"By the way, one difficult problem I am facing is that all the া are
being mistakenly recognised as । . The dictionary should help in
resolving this, and also there is a file where we can specify
ambiguities like these. But nothing seems to work.
One way to solve the problem is to add the following rule in the
reorder script: We make a pass and replace all instances of । with া .
Then we make another pass and see whether there are any leftover া
with the dotted circle. These should be replaced by । .
Is the logic ok? How to find out if an া has a dotted circle?"

He has not replied yet.

Here is what i think. The change can not be made simply in the reorder script, which gets executed only in the post-ocr stage. The problem is that the OCR engine itself recognises this wrongly and it throws off the rest of the recognition.
One solution is to not train । (equivalent to fullstop in bengali) at all. We can always add the । in the post OCR script using the method in the mail addressed to Sayamindu.

TesseractIndic Trainer GUI

I just uploaded the TesseractIndic Trainer GUI Version 0.1 to
http://tesseractindic.googlecode.com/files/TesseracIindic-Trainer-GUI-0.1.tar.gz
. This application allows a person to generate custom/application
specific training data quickly.
To see how to use it, read
http://code.google.com/p/tesseractindic/wiki/TrainerGUI or watch
http://www.youtube.com/watch?v=xuBlfN6Va4k .

Wednesday, November 25, 2009

How the dictionary was fixed?

Well, it was a single line!

Added the following line to line number 1077 in dict/permute.cpp

any_alpha=1;

Here is the diff against 2.04 release:

--- tesseract-2.04/dict/permute.cpp 2008-11-14 23:07:17.000000000 +0530
+++ tessmod/dict/permute.cpp 2009-11-26 00:34:50.660737699 +0530
@@ -1077,6 +1077,7 @@
return (NULL);
if (permute_only_top)
return result_1;
+ any_alpha=1;
if (any_alpha && array_count (char_choices) <= MAX_WERD_LENGTH) {
result_2 = permute_words (char_choices, rating_limit);
if (class_probability (result_1) < class_probability (result_2)


For non-eng script the if condition was never getting satisfied and
hence the DAWG files were not being scanned properly. Adding a
any_alpha=1 on the top explicitly on the top solves this problem for
the time. There is probably a more elegant solution though.
By the way, I do not see this particular if condition in the trunk
anywhere in the file. Perhaps the deveopers have fixed it in the trunk
already.

Tesseract Dictionary (finally) works for Indic

This is going to be one long post

For the past few weeks I have been experimenting with the Indic script support in the Tesseract dictionary. I will first record the observations/results of my experiments and then elaborate on the logic involved.


We start with a pristine copy of tesseract-2.04 downloaded from here. Then we add some code to enable the maatraa clipping support (March 27 entry here).
Our aim is to see whether the dictionary works for Indic. Here is the methodology:

1) Take an image with a single word.
2) Create empty DAWG files.
3) OCR and see the result.
4) Now create DAWG files with a single word. The word is the same as the one in the image.
5) Now OCR again and see if the result improves.

I chose this image:

In text form it reads: পূনরায় (punoraaye which means 'again' in Bengali)

On OCRing this image with empty dawg files I received this result: পূনরুায়


The result is wrong. The third character is রু instead of র . Also the vowel sign া is not joined to the previous consonant.


Now I generate the 2 DAWG files: freq-dawg and word-dawg with a word list containing just this word : পূনরায় . Here is the process:

debayan@deep-blur:/tmp/orig/tesseract-2.04$ cat list
পূনরায়
debayan@deep-blur:/tmp/orig/tesseract-2.04$ wordlist2dawg list dawg
Building DAWG from word list in file, 'list'
Compacting the DAWG
Compacting node from 9570029 to 1000034 (2)
Writing squished DAWG file, 'dawg'
18 nodes in DAWG
18 edges in DAWG
Each symbol holds three bytes (according to unicode specs). There are 6 symbols in all: প ূ ন র া য় ; hence 6x3= 18 nodes in the DAWG. Makes sense!

Now I copy these DAWG files to the appropriate locations (/usr/local/share/tessdata/) and OCR again, and get the same result. This shows that the DAWG files are ineffective currently.

Now lets look at how to solve the problem. Ofcourse, the first step is to find out what is going on in DAWG creation/reading process. This involves inserting several cprintf statements all throughout the code. This gives us an insight (600 KB download) on how the DAWG file is being used. I intend to analyse the output and pinpoint the problem in the next post. In this post, lets concentrate on the results.

After I made the changes, I followed the same 5 steps followed above. Here is the output:

debayan@deep-blur:~/ocr/branches/tesseract-2.04$ vim space
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg space dawg
Building DAWG from word list in file, 'space'
Compacting the DAWG
Compacting node from 0 to 1000000 (2)
Writing squished DAWG file, 'dawg'
1 nodes in DAWG
1 edges in DAWG
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.
ban.DangAmbigs ban.freq-dawg ban.inttemp ban.pffmtable ban.user-words ban.word-dawg
ban.DangAmbigs~ ban.freq-dawg.old ban.normproto ban.unicharset ban.user-words.old ban.word-dawg.old
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg
[sudo] password for debayan:
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2>temp
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt
পূনরুায়

========================================================================

debayan@deep-blur:~/ocr/branches/tesseract-2.04$ echo 'পূনরায়'>list
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat list
পূনরায়
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg list dawg
Building DAWG from word list in file, 'list'
Compacting the DAWG
Compacting node from 9570029 to 1000034 (2)
Writing squished DAWG file, 'dawg'
18 nodes in DAWG
18 edges in DAWG
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2>temp
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt
পূনরায়

If you follow the output above closely, you will find that adding the word to the DAWG affected the output constructively. If you have the patience follow the 600KB text file to see how it did that. Wait for my next post for a detailed analysis of the process.

For now the conclusion is: The dictionary works for Indic. I need to send a patch to ray and team.

Thursday, November 19, 2009

utf-8 = ok?

I have been trying to add wide character support to Tesseract code base by converting most char* to wchar_t* data types. However I read in depth about UTF-8 encoding today here . It says UTF-8 handles unicode well. Tesseract already supports UTF-8, or so it says.
However when I print out the dawg file contents I see garbage for Indic scripts, but see proper characters for english. Why is this happening?
This makes me think that maybe I am on the wrong track. I did ask the Tesseract list whether I am on the right track or not, but found no useful replies.
Infact now that I think about it, we are creating the dictionaries out of word lists, we are forgetting that we need to introduce the vower 'de-reordering' rules. Only then will the OCR be able match words in run time.
Refer to my earlier post where I have mentioned that we need to do vowel reordering post OCR. If you reverse the analogy, we need to intentionally include anomalies in the dictionary so the OCR can work on the dictionary. Hence the OCR may think েক by looking at the dictionary, and we can use the vowel reordering code to correct this.

What a realisation!!

char* to wchar_t* conversion

Say you have a const char* string and you need to convert it to wchar_t type so that it can be stored in wide character format, here is the piece of code that takes the const char* and returns the wchar string for you.
Note that it does not work without the setlocale funtion.

You need to include locale.h and wchar.h header files for this to work.


wchar_t* utf2wchar(const char *str) {
setlocale(LC_ALL, "en_US.UTF-8");
int size = strlen(str);
wchar_t uni[100]; //assuming that there wont be a 101+ charcter word
int ret = mbstowcs(uni,str,size);
if(ret<=0){cprintf("mbstowc failed, ret=%d",ret);}
return uni;
}

Monday, November 16, 2009

No unicode support in Tesseract-OCR?

If I were to point out one single issue on which this project's success depends. it would be the dictionary. The dictionary for this OCR system is not just a text file full of words, but a data structure called Directed acyclic word graph.
I decided to finally solve this blocker of a problem and delved into the mailing lists once again. I did not find any new information there and hence decided to look at the source code itself.
I soon noticed that while building the dictionary, the code is treating the words as a stream of bytes and storing each byte per node. This means that the code does not support wide characters. Wide character support requires wchar_t type instead of char.
This is a major problem. One could try to make the code wide character compatible, but it might require considerable labour. Also reading contents from the dictionary also needs to be done with wide character support.
the alternative is shifting to a new OCR engine like OCRopus, which CRBLP folks seem to have done already.

Friday, November 6, 2009

Is a document suitable for OCR?

This is an important question for certain contexts.
1) There may be an online web service that allows people to upload images to be OCRed. Some pranksters or bots may start uploading images with no or little text. The OCR engine tries to make sense of the image and wastes immense amounts of CPU cycles.

2) The visually challenged may want to use the computer in this manner: Whenever they have an image with text infront of them, the software automatically recognises areas of text and OCRs it. Post-OCR A TTS system them reads out the text for them.

Now how do we achieve this?

There is a good method. The algorithm is called Run Length Smearing Algorithm (RLSA) . What it does is it smears lines of text into black lines, and then looks for parallel black lines as a sign of lines of text in the image.

The Problem of Dotted Circles

Certain vowel signs in in Indic scripts have a dotted circle in them. For example : ৈ, ে , া . When these are used in conjunction with consonants however, the dotted circles vanish. For example: কৈ, কে, কা .
This is a problem doing automated training. The python script draws ে and trains the engine to recognise the shape, along with the dotted circle. However, when we OCR a document, the dotted circle is no longer there.
Hence we somehow need a method of automatically eliminating the dotted circles from vowel signs while generating training images. Any ideas?

Why is Vowel Reordering required?

Indic scripts have the concept of vowel signs. The peculiarity of these vowel signs with respect to OCR is that sometimes consonant + vowel sign = a glyph where the consonant comes later and the vowel sign first.
Here I present just one simple example.
That is (in Bengali): ক + ে = কে

Now when we OCR কে , the OCR engine first encounters the vowel sign (ে without the dotted circle) and then the consonant ক. It then tries to do a string concatenation of the two characters seen in order, and it ends up producing this as the output: েক .
Since the OCR engine makes the same mistake all the time, its easy to write scripts which can move every such vowel sign to the appropriate place. This improves the OCR accuracy drastically.

OCRFeeder

I have been working on creating a complete OCR solution suite for Gnome. It tunrns out that OCRFeeder is already a pretty good solution.
Its a good thing that this exists, because I can now shift focus on adding Indic related code to OCRFeeder itself.
When I say Indic related code, I mean modified tesseract shironaam clipper, vowel reordering, automated training and the crowd-sourcing data feedback learning mechanism.

Crowd Sourcing OCR development

One of the biggest challenges in OCR development is gathering training data and then feeding it to the OCR engine. The data is generally carefully chosen and some emphasis is laid on the quality of scans too. This often requires a team of people working in close proximity, and hence has traditionally been a blocker for the distributed development model.
However, with proper planning in software development, such frameworks can be set up which allow end users to contribute to OCR training data.
The interface to the OCR system may be either command line based, GUI based or web based. Say a user OCRs a particular document. Post-OCR the interface presents to him an opportunity to correct any errors and send it back to a centralised server where certain volunteers/contributors shall verify the data. Once the data has been verified, it is fed to the engine for incremental training.
To check whether the data being added is improving the performance of the OCRs or not, we may run an automated nightly-OCR on a set of test image/text set and post the percentage daily.
The challenge is that most OCR training systems are not incrementally trainable out of the box. Tesseract-OCR is one example. However, one may write some code and implement it.
Crowd Sourcing training data is critical to align OCR development to a FOSS based model and hence free it from the clutches of research teams at big institutes.