Wednesday, November 25, 2009

Tesseract Dictionary (finally) works for Indic

This is going to be one long post

For the past few weeks I have been experimenting with the Indic script support in the Tesseract dictionary. I will first record the observations/results of my experiments and then elaborate on the logic involved.


We start with a pristine copy of tesseract-2.04 downloaded from here. Then we add some code to enable the maatraa clipping support (March 27 entry here).
Our aim is to see whether the dictionary works for Indic. Here is the methodology:

1) Take an image with a single word.
2) Create empty DAWG files.
3) OCR and see the result.
4) Now create DAWG files with a single word. The word is the same as the one in the image.
5) Now OCR again and see if the result improves.

I chose this image:

In text form it reads: পূনরায় (punoraaye which means 'again' in Bengali)

On OCRing this image with empty dawg files I received this result: পূনরুায়


The result is wrong. The third character is রু instead of র . Also the vowel sign া is not joined to the previous consonant.


Now I generate the 2 DAWG files: freq-dawg and word-dawg with a word list containing just this word : পূনরায় . Here is the process:

debayan@deep-blur:/tmp/orig/tesseract-2.04$ cat list
পূনরায়
debayan@deep-blur:/tmp/orig/tesseract-2.04$ wordlist2dawg list dawg
Building DAWG from word list in file, 'list'
Compacting the DAWG
Compacting node from 9570029 to 1000034 (2)
Writing squished DAWG file, 'dawg'
18 nodes in DAWG
18 edges in DAWG
Each symbol holds three bytes (according to unicode specs). There are 6 symbols in all: প ূ ন র া য় ; hence 6x3= 18 nodes in the DAWG. Makes sense!

Now I copy these DAWG files to the appropriate locations (/usr/local/share/tessdata/) and OCR again, and get the same result. This shows that the DAWG files are ineffective currently.

Now lets look at how to solve the problem. Ofcourse, the first step is to find out what is going on in DAWG creation/reading process. This involves inserting several cprintf statements all throughout the code. This gives us an insight (600 KB download) on how the DAWG file is being used. I intend to analyse the output and pinpoint the problem in the next post. In this post, lets concentrate on the results.

After I made the changes, I followed the same 5 steps followed above. Here is the output:

debayan@deep-blur:~/ocr/branches/tesseract-2.04$ vim space
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg space dawg
Building DAWG from word list in file, 'space'
Compacting the DAWG
Compacting node from 0 to 1000000 (2)
Writing squished DAWG file, 'dawg'
1 nodes in DAWG
1 edges in DAWG
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.
ban.DangAmbigs ban.freq-dawg ban.inttemp ban.pffmtable ban.user-words ban.word-dawg
ban.DangAmbigs~ ban.freq-dawg.old ban.normproto ban.unicharset ban.user-words.old ban.word-dawg.old
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg
[sudo] password for debayan:
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2>temp
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt
পূনরুায়

========================================================================

debayan@deep-blur:~/ocr/branches/tesseract-2.04$ echo 'পূনরায়'>list
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat list
পূনরায়
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg list dawg
Building DAWG from word list in file, 'list'
Compacting the DAWG
Compacting node from 9570029 to 1000034 (2)
Writing squished DAWG file, 'dawg'
18 nodes in DAWG
18 edges in DAWG
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2>temp
debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt
পূনরায়

If you follow the output above closely, you will find that adding the word to the DAWG affected the output constructively. If you have the patience follow the 600KB text file to see how it did that. Wait for my next post for a detailed analysis of the process.

For now the conclusion is: The dictionary works for Indic. I need to send a patch to ray and team.

No comments:

Post a Comment