Tesseract-UI-Tools

  1. About
  2. Installation
  3. Usage
  4. Mail Notification
  5. Troubleshooting & details

About

Tesseract UI Tools is an application that allows you to bulk apply the Tesseract OCR engine on multiple files to create PDFs with a searchable text layer. This application is designed to make it easy to process a large number of files in a batch. It allows the queue of multiple jobs and processes them in the background.

A job consists of multiple parameters: input folder, output folder, languages to recognize, preprocessing strategy, DPI, quality and minimum confidence.

Currently, the application can handle the following formats: TIFF, TIF, JPEG, JPG and PDF.

Installation

  1. Download the latest release.
  2. Unzip the file.
  3. Run the Tesseract UI Tools.exe file.
    • Note: During the first execution:
      1. .NET Desktop Runtime 6 might be automatically installed via a prompt popup.
      2. The language models will be downloaded, which might take some time.

Usage

Main Screen Form

The main screen shows a table with the queue of jobs for this session. It also allows setting an email address to receive notifications by mail. See Mail Notifications to set up it.

Each line of the table represents a job. It has a start time, the input folder and the status (which is either “Created”, “Running” or “Finished”).

Each job will populate the output folder with new PDF files generated by OCR’ing each file of the input folder. In case of any error a errors.txt file will be created with the filenames of files that could not be processed due to an error.

Adding a job

Add Job Form

After clicking on “Add Job” a new form will pop up. In this form you can set the parameters of the job:

After clicking on “Add Job” on this form it will close and the job is added to the queue. You can now add a different job. The last parameters used are saved during a session only.

Preprocessing strategies

The following table represents the steps of each strategy for preprocessing each image:

Fast Otsu Gaussian Fast & Otsu
Reduce Image Size Reduce Image Size Reduce Image Size Reduce Image Size
Tesseract Median Blur Median Blur Median Blur & Tesseract (1)
  Otsu Threshold (Global) Gaussian Threshold (Local) Otsu Threshold (Global)
  Dilate Dilate Dilate
  Tesseract Erode Tesseract (2)
    Tesseract Merge Best Tesseract

Each strategy was tested against the same input images. Follows the plots:

Regarding time Otsu was the fastest strategy, with a 4-second average. Followed by Gaussian with an 8-second average and Fast with a 10-second average. This result is mainly caused because we are reducing to half the image before any other step. Without reducing the image, the Plain strategy takes an average of 55 seconds. However, the number of words and confidence are not improved.

Mail Notification

Mail Settings Form

Tesseract UI Tools can send an email notifying you that a job has been completed.

The email consists of a report with two tables: the first table with the start time and the parameters for that job and the second table with information on the tesseract confidence and time for each file processed.

Also, this file is saved under the reports folder with the name report-{DateTime.Now}.html. See Troubleshooting & details to find this file.

Sample Mail

Parameter Value
Start Time 01/01/1970 00:00:00
InputFolder C:\…\Test\Input
OutputFolder C:\…\Test\Output
Language eng
Start Time Filename Pages Time Elapsed Words Threshold / Words Total Confidence Mean Threshold / Confidence Mean Total
01/01/1970 00:00:00 File 1 5 36s 957 / 1115 77.96429 / 68.227715
01/01/1970 00:00:36 File 2      
File 3        

Notes

The column “Words Threshold / Words Total” contains the number of words with confidence higher than the minimum confidence asked and the number of words in total. Similarly, “Confidence Mean Threshold / Confidence Mean Total” contains the mean confidence for words higher than the minimum confidence and the mean of all word confidences.

Configure Mail Server Settings

To receive this report by mail the user needs to have access to a server relay or a Google Account.

Using a server:

  1. On the main screen, click “Mail Settings”
  2. Fill in the Host, Port and From inputs.

Using a Google Account*:

  1. On the main screen, click “Mail Settings”
  2. Click on “Google”
  3. Login with a google account that will be sending the email
  4. Allow Tesseract UI Tools to send a mail with your account.

*Note: steps 1 and 2 must be done every time you open the application, steps 3 and 4 might not be required every time.

Troubleshooting & details

Files\
Reports\
exceptions.log
    {filename}\
        <page>.tiff
        <page.dpi.quality>.jpeg
        <page.strategy.lang>.tsv
    <filename.lang.strategy>.html # file report
    report-{timestamp}.html # job report 

Uninstalling

To completely uninstall the application delete the folder containing the executable created during the installation and delete the folders above.