Sunday, April 19, 2009

What is a DAWG file?

DAWG = Directed Acyclic WordGraph

First, we'll define DAWG (skip if you know already) and cover the specifics of tesseract below.

=========== this definition by Mark & Ceal Wutka, link below ============

A Directed Acyclic Word Graph, or DAWG, is a data structure that permits extremely fast word searches. The entry point into the graph represents the starting letter in the search. Each node represents a letter, and you can travel from the node to two other nodes, depending on whether you the letter matches the one you are searching for.

It's a Directed graph because you can only move in a specific direction between two nodes. In other words, you can move from A to B, but you can't move from B to A. It's Acyclic because there are no cycles. You cannot have a path from A to B to C and then back to A. The link back to A would create a cycle, and probably an endless loop in your search program.

The description is a little confusing without an example, so imagine we have a DAWG containing the words CAT, CAN, DO, and DOG. The graph woud look like this:

     C --Child--> A --Child--> N (EOW)
  |                         |
  |                       Next
Next                        |
  |                         v
  |                         T (EOW)
  v
  D--Child--> O (EOW) --Child --> G (EOW)

Now, imagine that we want to see if CAT is in the DAWG. We start at the entry point (the C) in this case. Since C is also the letter we are looking for, we go to the child node of C. Now we are looking for the next letter in CAT, which is A. Again, the node we are on (A) has the letter we are looking for, so we again pick the child node which is now N. Since we are looking for T and the current node is not T, we take the Next node instead of the child. The Next node of N is T. T has the letter we want. Now, since we have processed all the letters in the word we are searching for, we need to make sure that the current node has an End-of-word flag (EOW) which it does, so CAT is stored in the graph.

One of the tricks with making a DAWG is trimming it down so that words with common endings all end at the same node. For example, suppose we want to store DOG and LOG in a DAWG. The ideal would be something like this:

   D --Child--> O --Child--> G(EOW)
|            ^
Next          |
|            |
v            |
L --Child----

In other words, the OG in DOG and LOG is defined by the same pair of nodes.

=========== Creating a DAWG ============

[...] The idea is to first create a tree, where a leaf would represent the end of a word and there can be multiple leaves that are identical. For example, DOG and LOG would be stored like this:

  D --Child--> O --Child--> G (EOW)
|
Next
|
v
L --Child-> O --Child--> G (EOW)

Now, suppose you want to add DOGMA to the tree. You'd proceed as if you were doing a search. Once you get to G, you find it has no children, so you add a child M, and then add a child A to the M, making the graph look like:

  D --Child--> O --Child--> G (EOW) --Child--> M --Child--> A (EOW)
|
Next
|
v
L --Child-> O --Child--> G (EOW)

As you can see, by adding nodes to the tree this way, you share common beginnings, but the endings are still separated. To shrink the size of the DAWG, you need to find common endings and combine them. To do this, you start at the leaf nodes (the nodes that have no children). If two leaf nodes are identical, you combine them, moving all the references from one node to the other. For two nodes to be identical, they not only must have the same letter, but if they have Next nodes, the Next nodes must also be identical (if they have child nodes, the child nodes must also be identical).

Take the following tree of CITIES, CITY, PITIES and PITY:

 C --Child--> I --Child--> T --Child--> I --Child--> E --Child--> S (EOW)
|                                      |
|                                     Next
Next                                    |
|                                      v
|                                      Y (EOW)
P --Child--> I --Child--> T --Child--> I --Child--> E --Child--> S (EOW)
                                     |
                                    Next
                                     |
                                     v
                                     Y (EOW)

Continue reading this explanation at:

http://www.wutka.com/dawg.html

What does Tesseract use DAWG for?

Tesseract uses the Directed Acyclic Word Graphs to very compactly store and efficiently search several list(s) of words. There are four DAWGs in tesseract: (right?)

1. word_dawg (pre-set/fixed list read in from "tessdata/word-dawg")
(this one is read in raw/directly for speed, user can't change this right now)
2. document_words (document-words that have already been recognized)
(built during execution; FIX: is/isn't cleared per-document/baseapi call)
3. pending_words (words tess is working on, at the moment, before they are added to document_word)
4. user_words (user-adjustable list read in from "tessdata/user-words")
(add here custom words that tesseract tends to corrupt)

Disclosure: I don't know the order of preference - which DAWG does tesseract check first AND which DAWG over-rides the others. ex. "thls" is not in #1 but, say, is in #4 - will tesseract NOT jiggle the 'l' into an 'i' (which then matches in #1) or will it go with #4? Ray?

Let's say that tesseract thinks it found a word with four letters, "thls". Before this word is output, tesseract will:

look-up "thls" in DAWG #1 (see above)
(when does it check user-words?)
By looking through the sorted list for each of the classes, tesseract will note that the third character had a second-best choice to be an 'i' so it changes that letter and
look-up "this" in DAWG #1 and this time it DOES match.
(fmg has seen tess KEEP ON permuting even after a match in both #1 and #4 so is not sure what the ending conditions are - maybe someone who knows better can explain) which can only mean that:
until the certainty of the word isn't moved beyond some threshold, permuting of other letters continues...

So, the answer to "Why does tesseract bother with DAWGs" is that when a typical English word has one or two letters that have permutations possible, WITHOUT using the compact and fast DAWG's this lookup task would quickly become a huge bottle-neck.

=========== DAWG-related ToDo's ============

Todo:

Need to add info here on:

how to view/list words ALREADY IN "tessdata/word-dawg"
how to CREATE A NEW "tessdata/word-dawg"
which constants need to be tweaked when adding words to "tessdata/word-dawg"
which constants need to be tweaked when adding words to "tessdata/user-words" (because a poster on the forums said that after about 5000 words are added guano happens)
why/what for is rand() used in add_word_to_dawg()
what to do when the dreaded "DAWG Table is too full" error occurs AFTER Ray Smith's patch is already applied...

Generated on Wed Feb 28 19:49:30 2007 for Tesseract by

1.5.1

COPIED VERBATIM FROM http://tesseract-ocr.repairfaq.org/

Friday, April 17, 2009

Clipping accuracy

I had tried some time last year to push my matra clipping code to Tesseract-OCR upstream, but Ray Smith the lead developer of the project asked about the accuracy of the code and I never got around to calculating it. Well actually I still havent calculated it, but I did something new.
Check the set of pictures I uploaded at . The first picture is the normal picture to be OCRed. The second picture is the clipped+thresholded image. The third image is the difference of the clipped+thresholded and thresholded images.

Here is the Python code that creates a new image out of two input images:

#!/usr/local/bin/python

import ImageChops, Image

th=Image.open("benth.tif")
clip=Image.open("bentest.tif")

new=ImageChops.difference(th,clip)
new=ImageChops.invert(new)

new.save("diff.tif","TIFF")

I will now show this to Ray Smith. Lets see if he likes it.

My old training methodology

The principle on which this works is this: Tesseract needs two things to train itself, 1) An image of the character 2) The name of the character. This information is provided with the help of "box files". A box file contains the co-ordinates of the bounding boxes around characters with labels as to what those characters are. The traditional method of training the engine is to take a scanned image, meticulously create a box file using some tool such as tesseractrainer.py , edit the box file, and keep doing the same for several other images and fonts. This process was tedious enough to force me to seek new methods.

Now lets do a little reverse engineering. What if we could take a list of characters in a text file, "generate" an image out of those characters, store the co-ordinates of the bounding boxes of those generated images in a file and then feed these to the OCR engine? It would work, right?

Links:

http://tesseractindic.googlecode.com/files/tesseract_trainer.beta.tar.gz - The tar ball itself

http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/readme - The readme file

http://www.youtube.com/watch?v=vuuVwm5ZjkI - YouTube video of the tool working for Bengali

But there are problems. Tesseract-OCR has its quirks.

Tesseract wants one bounding box to enclose a single "blob" only. A blob is a wholly connected component. So ক is a blob, and ক খ are two blobs. There are cases where a consonant+vowel sign generates two blobs, for example the 3 images below have multiple blobs:

And hence Tesseract throws a "FATALITY" error during training.

So i had to change my approach a little bit. Obviously there has to be some feedback mechanism where i parse the output of Tesseract during training to see if a particular set of characters threw errors. Once I know what they are, I can separate them and train them later. To accomplish this, I changed my approach of generating a strip of character images to generating just one image per character, so I can pin point the problems better.

The downside, too many images getting generated. To train a simple font it generates 405 images+405 box files+405 tr files. And all this when I have not included conjuncts yet. It is not much of a problem though, since the images generated are not required once the training files have been generated.

Well it leads me to new challenges. I remember Prof B.B. Choudhury saying that training all the conjuncts will kill any recogniser, ie, it will work very slowly while recognising. He also told me some cool ways to get past that. May have to implement that. Lets see.

My training methodology does not work :(

As much as I hate to admit it my training methodology of generating one image per akshar does not work. I hate to say it since I put some effort into writing the Python code that does this .
Well the reason is probably that Tesseract OCR training code looks for characters on a single line during training as it also extracts base line metrics for rare/strange characters like numerals. As such it may not be able to extract all the information it needs for its training.
Or may be Tesseract OCR training code accepts a very little number of .tr files and since my code generates thousands of tr files, it becomes useless.
Let me show you an example of how miserably it failed.
I decided to test the training on the string " ভারত মাতা " (Bharat Mata which means Mother India). I generated the tiff image using Pango rendering.
Then I generated 7 images per sample of ভ র ত ম and used the subsequently generated training fils for OCR.
The result was this: " মভতভ ভভভভ "
Yes, I know. The result is absolutely outrageous.
However, what if I still autogenerate images of characters but this time in single lines adjacently? Will it work?

Tuesday, April 14, 2009

Image degradation

I added some code. The pango rendering works perfectly now. Also 1 pure image and 4 partially erased images are created per character.
The degradation has been chosen to suit the code that clips matraas. The only degradation seen is a vertical white strip overlapping certain characters.
Hence the same is done while generating training images.

[1] http://code.google.com/p/tesseractindic/source/detail?r=41

Tesseract-Indic-OCR