Skip to the content.

eScriptorium with Tesseract extension (a step-by-step guide)

version 1.0 | January 2024

“Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. [..] Tesseract can be used directly via command line, or (for programmers) by using an API to extract printed text from images. It supports a wide variety of languages. Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page. External tools, wrappers and training projects for Tesseract are listed under AddOns.”[Tesseract-OCR]

Tesseract serves as an important OCR-engine within the OCR-D project. Although it is supported by various third-party GUI extensions, there currently isn’t a transcription platform offering training options that support it.

The following step-by-step guide provides an introduction to the installation and the use of eScriptorium with tesseract.

Contents

  1. Who is this guide for?
  2. How to install and set up eScriptorium and the Tesseract extension?
    1.1 Installation of Tesseract
    1.2 Installation of eScriptorium with Tesseract extensions
  3. How to use eScriptorium with Tesseract extension
    2.1 What to consider applying Tesseract models to Kraken segmented data?
  4. How to fetch and upload a Tesseract transcription model
    3.1 Where to find Tesseract models
    3.2 How to choose a model for a specific use case
    3.3 How to upload Tesseract models to eScriptorium
  5. How to apply a Tesseract transcription model to your dataset
  6. How to train a Tesseract transcription model
    5.1 Provide or create training data (ground truth)
    - Addendum 1: How much training data (ground truth) do I need?
    - Addendum 2: Always follow transcription guidelines!
    5.2 Fine-tune a text recognition model
  7. License
  8. 0. Who is this guide for?

    This guide is for eScriptorium users which want to set up, use and train Tesseract models in eScriptorium.

Note: This guide does not provide a basic understanding of the graphical interface and functionality of the platform. Here are some resources to get you started:

Although an attempt was made to keep the guide as accessible as possible, certain technical terms could not be avoided. Where these are to be found in the guide, we try to explain them as clearly as possible.

This extension and the guide was created during the 3rd OCR-D funding phase in the module project Workflow for work-specific training based on generic models with OCR-D as well as ground truth enhancement (2021–2023) at Mannheim University Library. The module project was funded by the German Research Foundation (DFG).

Feedback is always welcome!

1. How to install and set up eScriptorium and the Tesseract extension

There are only a few deviations from the installation process of the pure eScriptorium software. Before you can start the eScriptorium installation, you need to first install the Tesseract software, and then initiate the installation of eScriptorium.

Note: For now, only the full install path is available. A docker image is planned if the extension is not merged.

1.1. Installation of Tesseract

Until the changes of Tesseract aren’t merged into the official version, you need to compile a modified tesseract version.

Preparation

sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install pkg-config
sudo apt-get install libpng-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
sudo apt-get install libwebpdemux2 libwebp-dev
sudo apt-get install libopenjp2-7-dev
sudo apt-get install libgif-dev
sudo apt-get install libarchive-dev libcurl4-openssl-dev
sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
sudo apt-get install libleptonica-dev

Clone the modified Tesseract version

git clone https://github.com/JKamlah/tesseract/ -b lstmf-writer --single-branch 

Installation (for the make command you can use -j to use multiple jobs)

cd tesseract
./autogen.sh
LIBLEPT_HEADERSDIR=$HOME/local/include ./configure \
  --prefix=$HOME/local/ --with-extra-libraries=$HOME/local/lib
make
make install
sudo ldconfig
make training
sudo make training-install

If you have any problems in the installation process, you can find help in the official Tesseract compiling documentation.

1.1. Installation of eScriptorium with Tesseract extensions

At this point, a modified version of eScriptorium must be installed.

There are several guides how to install eScriptorium:

Since further improvements are still being made to these installation instructions, it is recommend using one of the above guides for the installation and only instead of cloning the current eScriptorium repository, simply run the following command:

git clone https://github.com/JKamlah/eScriptorium/ -b extension-tesseract --single-branch 

2. How to use eScriptorium with Tesseract extension

The application of eScriptorium does not change fundamentally. All functions are still available, but the user is now able to upload, apply, train, and export Tesseract models. The uploading of images, the initiation of processes, and the transcription functionality remain the same. The Tesseract extension does not add additional functionality for layout recognition.

2.1 What to consider applying Tesseract models to Kraken segmented data?

Since Tesseract’s text recognition models heavily rely on the appropriate text line masks, it is sometimes necessary to modify the corresponding text lines. This can be done manually, by training a new layout recognition model or by applying an external program.

A possible solution would be PagePlus. It assists users in optimizing their documents for training, recognition and information extraction. However, these optimizations of the documents are not only advantageous or necessary for recognition with Tesseract, but also often with Kraken.

3. How to fetch and upload a Tesseract transcription model

3.1. Where to find Tesseract models

Especially for fine-tuning already existing text recognition models are needed.

Note: Tesseract provides two kinds of text recognition models: best and fast. While only the best models can be fine-tuned, the fast models are smaller and faster.

TL;DR: Only best models can be fine-tuned!

Here is a list of places where tesseract models can be found:

3.2. How to choose a model for a specific use case

The performance of a model depends on various factors and must be tested for each use case. For example a model trained only on traditional Chinese characters might not perform well on German documents printed in the typeface Fraktur.

Here are some points for orientation that can help with the assessment of a model:

Note: As a rule of thumb try testing generic models first for your use case. Generic or base models are usually trained on a wide variety of data (different documents, typefaces etc.) of a specific domain (e.g. printed documents in French of the 18th century). If the model name and description somewhat fits the use case at hand, try testing that generic model first.

3.3. How to upload Tesseract models to eScriptorium

All downloaded models can be uploaded to eScriptorium by clicking on “My Models” on the upper right corner of the screen. Click on “Upload a model” in the next screen and choose the model you want to upload.

4. How to apply a tesseract transcription model to your dataset

To apply a the tesseract transcription model:

  1. Click on the “Images” tab.
  2. Click on the “Select all” button.
  3. Click on the blue “Transcribe” button.
  4. Select a “Model” with the first dropdown menu .


A pop-up should open, that looks like this:


5. How to train a tesseract transcription model

Tesseract only allows training models for the OCR / transcribing tasks. With eScriptorium the Tesseract OCR models can be trained in just a few clicks.

New models can be trained (training from scratch) and existing models can be fine-tuned (fine-tuning) for specific use cases or domains. The training of OCR models is often carried out via the command line and requires appropriate knowledge. Since eScriptorium provides a graphical user interface, users without command line knowledge can also carry out trainings. It is necessary to understand the area of application of the two training variants mentioned:

In many cases, fine-tuning can be a time- and resource-efficient method for improving an existing text recognition model for a new use case. In order to carry out fine-tuning, an existing model is required, which is adapted to the new use case during the fine-tuning training process.

The first steps in the training process are the same as by the kraken models:

5.1 Provide or create training data (ground truth)

These steps are the same as for the kraken models:

  1. Create a new project and document
  2. Import your images
  3. Run layout segmentation on your data
  4. Check the layout segmentation
  5. Correct the text regions
  6. Correct the baselines and line masks
  7. Correct the layout segmentation for all pages
  8. Run text recognition on your data
  9. Check the transcriptions
  10. Improve the transcriptions and create ground truths

Note: If no base model is available for steps 3 and 8, these steps can also be carried out manually. However, experience shows that working with a model is usually much quicker.

Addendum 1: How much training data (ground truth) do I need?

Experience has shown that even a small amount of training data is enough to start fine-tuning an existing text recognition model that already works somewhat well on your data. With regard to fine-tuning, an iterative approach should be followed:

  1. Create 2 to 3 pages of training data by correcting the automatically generated transcriptions as shown in step 10.
  2. Fine-tune the text recognition model you have used in step 8 with the corrected ground truth .
  3. Test and evaluate if the fine-tuned model yields better transcriptions on your data than before.
  4. If not, repeat 1 to 3 to create more training data. Fine-tune new models and evaluate them on your data until the results are satisfactory.

An example workflow with iterations can look like this:

  1. Create 2 pages of training data by correcting transcriptions generated by a text recognition model
  2. Fine-tune the text recognition model you have already used with those 2 pages of training data
  3. Evaluate if the fine-tuned model produces better transcriptions
  4. If the results are better but still need improvement, create additional training data, e.g. another 4 pages
  5. Fine-tune the first model again with your 6 pages of ground truth
  6. Evaluate if the second fine-tuned model produces better transcriptions
  7. Continue by iterating …

Note: At first glance this process looks time consuming, as you have to repeat certain steps – creating of training data, training itself, evaluation – again. Although this is true, iteration can ultimately lead to a rapid improvement of the generated transcripts, as the fine-tuned models get better with each training and thus generate fewer transcription errors that need to be improved.

Addendum 2: Always follow transcription guidelines!

Transcription guidelines are a set of rules and instructions provided to individuals who are manually transcribing or annotating text from various sources. These guidelines serve to standardize the transcription process, ensuring consistency, accuracy, and clarity in the resulting transcribed data. Guidelines should be used for both the creation and correction of transcriptions. Refer to the chapter Ground truth guidelines for transcriptions to learn more.

Step 11: Fine-tune a text recognition model

If you have created a sufficient amount of training data (refer to section How much training data (ground truth) do I need for fine-tuning?), the fine-tuning process itself is simple.

  1. Click on the “Images” tab.
  2. Click on the “Select all” button.
  3. Click on the blue “Train” button.
  4. Click on **“Recognizer” button.


A pop-up should open, that looks like this:


The model selection can be filtered by:

  1. OCR-Engine


  2. Modelname


Lastly, click on the blue “Train” button to start the fine-tuning.

A running training is shown as below:


If you want to view the training progress, click on “My models”:


The model you are currently training will appear in this overview. By clicking on the button “Toggle versions” you can view all currently finished training epochs as well. You will be notified as soon as the training has finished.

6. License

This guide is licensed under CC0-1.0.