Monday, November 16, 2009

No unicode support in Tesseract-OCR?

If I were to point out one single issue on which this project's success depends. it would be the dictionary. The dictionary for this OCR system is not just a text file full of words, but a data structure called Directed acyclic word graph.
I decided to finally solve this blocker of a problem and delved into the mailing lists once again. I did not find any new information there and hence decided to look at the source code itself.
I soon noticed that while building the dictionary, the code is treating the words as a stream of bytes and storing each byte per node. This means that the code does not support wide characters. Wide character support requires wchar_t type instead of char.
This is a major problem. One could try to make the code wide character compatible, but it might require considerable labour. Also reading contents from the dictionary also needs to be done with wide character support.
the alternative is shifting to a new OCR engine like OCRopus, which CRBLP folks seem to have done already.

No comments:

Post a Comment