Skip to content

Add Sámi tesseract model #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

titaenstad
Copy link

I would like to share the best tesseract model from our work with Sámi OCR for the texts in the collection of the National Library of Norway. The training, validation and test data are unfortunately copyright protected and cannot be shared.

We describe the work in the paper "T Enstad, T Trosterud, MI Røsok, Y Beyer, M Roald. Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)"

Paper link: https://hdl.handle.net/10062/107202

@stweil
Copy link
Member

stweil commented Feb 28, 2025

Nice. It's a pity that parts of your ground truth data are not publicly available. Only free open data allows reproducible or new trainings, not only for Tesseract, but also for other free OCR software.

It would be interesting to repeat the training process based on Latin.traineddata. It might work better than starting with the Norwegian model. And I assume that you did not add a dictionary to your trained model? That might increase the recognition rate further.

@titaenstad
Copy link
Author

titaenstad commented Mar 5, 2025

It would be interesting to repeat the training process based on Latin.traineddata. It might work better than starting with the Norwegian model.

Why do you think that?

And I assume that you did not add a dictionary to your trained model? That might increase the recognition rate further.

Is this the WORDLIST_FILE parameter to tesstrain? Can you tell me more/link to documentation about this?

@stweil
Copy link
Member

stweil commented Mar 5, 2025

It would be interesting to repeat the training process based on Latin.traineddata. It might work better than starting with the Norwegian model.

Why do you think that?

Latin.traineddata got more training and has a larger character set than nor.traineddata. I expect that its larger character set covers more of the additional characters which are required for Sámi.

And I assume that you did not add a dictionary to your trained model? That might increase the recognition rate further.

Is this the WORDLIST_FILE parameter to tesstrain? Can you tell me more/link to documentation about this?

You need combine_tessdata, dawg2wordlist and wordlist2dawg.

This process should do the job:

# Create intermediate directory.
mkdir -p tmp

# Extract all components from smi.traineddata.
combine_tessdata -u smi.traineddata tmp/smi.

# Extract all components from nor.traineddata.
combine_tessdata -u nor.traineddata tmp/nor.

# Convert wordlist from DAWG to text.
dawg2wordlist tmp/nor.lstm-unicharset tmp/nor.lstm-word-dawg tmp/nor.lstm-word.txt

Now edit or replace the dictionary tmp/nor.lstm-word.txt to get an optimized dictionary tmp/smi.lstm-word.txt for Sámi.

# Convert wordlist from text to DAWG.
wordlist2dawg tmp/smi.lstm-word.txt tmp/smi.lstm-word-dawg tmp/smi.lstm-unicharset

# Add wordlist DAWG to smi.traineddata.
combine_tessdata -o smi.traineddata tmp/smi.lstm-word-dawg

The files tmp/nor.lstm-number-dawg and tmp/nor.lstm-punc-dawg can optionally also be converted to text, optionally edited, then converted back to DAWG for smi and added to smi.traineddata. They might help to recognize numbers and punctuation.

Hint: you can start with a small dictionary which only contains the correct spelling for some of the words which typically have OCR problems. Then you can test whether it has a positive effect. It's even simpler to test a small dictionary directly with tesseract --user-words WORDLIST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants