Skip to content

NeuralNetsInTesseract4.00

Amit D edited this page Jan 5, 2017 · 17 revisions

Overview of the new neural network system in Tesseract 4.00

Introduction

Tesseract 4.00 includes a new neural network subsystem configured as a textline recognizer. It has its origins in OCRopus' Python-based LSTM implementation, but has been totally redesigned for Tesseract in C++. The neural network system in Tesseract pre-dates Tensor Flow, but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for Tensor Flow. See https://github.com/tensorflow/models/tree/master/street

The idea of VGSL is that it is possible to build a neural network and train it without having to learn a lot of anything. There is no need to learn Python, Tensor Flow, or even write any C++ code. It is merely required to understand the VGSL specification language well enough to build syntactically correct network descriptions. Some basic knowledge of what the various neural network layer types are and how they are combined will go a very long way.

Integration with Tesseract

The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.

The neural network engine is selected by specifying OEM_LSTM_ONLY as the OcrEngineMode to TessBaseAPI::Init. (It will probably be made the default by release of 4.00.) From an API perspective, that is all that is required to use it. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13

The neural network engine has not yet been integrated to enable the multi- language mode that worked with Tesseract 3.04, but this will be improved in a future release. Vertical text is also not yet supported, so support for Japanese and Traditional Chinese for example are limited to horizontally rendered text.

Hardware and CPU Requirements

The Tesseract 4.00 neural network subsystem is heavily compute-intensive, using the order of ten times the CPU resources of the base Tesseract, but the impact is mitigated, if your platform supports it as follows:

  • Open MP allows use of four cores in parallel if your machine has them.
  • Intel/AMD processors that support SSE and/or AVX benefit from SIMD parallelization of the core matrix multiplications.

On a machine with multiple cores, and AVX, an easy English image may take twice as much real time, and use 7 times the CPU as base Tesseract, whereas Hindi takes more CPU than base Tesseract, but actually runs faster in terms of real time.

If the above components are missing, there is a slower plain C++ implementation that enables the code to still work, however, not on big-endian hardware! (Most mobile devices).

For Open Source Contributors

The initial implementation lacks the following:

  • Big-endian support. The swaps are missing from the (De)Serialize methods of the new neural network code.
  • There is a C++ implementation if the hardware does not have SSE and/or AVX, but the code could benefit from SIMD implementations for other hardware, such as ARM. See the new arch directory for where to insert the code.
  • Windows support of SSE and AVX is implemented but untested.

Basics of the Implementation

All network layer types are derived from the Network base-class. The Plumbing sub-class is a base-class for layers that manipulate other layers in some way, e.g. by reshaping their input/output or organising a group of layers.

The input/output data "Tensor" is NetworkIO and the weights are stored in a WeightMatrix, both of which contain a Tesseract GENERIC_2D_ARRAY to hold the data. LSTMRecognizer provides the higher-level abstraction of converting an image of a textline to a sequence of tesseract WERD_RES classes. LSTMTrainer likewise handles the abstraction of training a network on an image of a textline that has a UTF-8 string 'truth'. NetworkBuilder takes responsibility for converting the VGSL specification language to a graph of network elements.

Adding a new Layer Type

A new layer class must derive from Network or Plumbing and implement at least the following virtual methods:

  • spec, which returns a string corresponding to the string that generated the layer.
  • Serialize/DeSerialize to save/restore the layer to/from a TFile.
  • Forward to run the layer in the forwards direction.
  • Backward to run the layer in the backwards direction during training.

Layers that have weights must also implement Update to update the weights with a set of gradients. There are quite a few other methods that may need to be implemented, depending on the specific requirements of the new layer. See network.h for more information on the methods that may require implementing.

  • NetworkBuilder must be modified to parse the specification of the new type.
  • The NetworkType enum must be extended to include the new type.
  • A corresponding entry in Network::kTypeNames must be added for the new type.
  • Network::CreateFromFile must be modified to construct the appropriate type on deserialization.
  • As with any new code, lstm/Makefile.am needs to be updated with the new filenames.

As of 02/02/2020


These wiki pages are no longer maintained.

All pages were moved to tesseract-ocr/tessdoc.

The latest documentation is available at https://tesseract-ocr.github.io/.


Clone this wiki locally