Tesseract UI Tools is an application that allows you to bulk apply the Tesseract OCR engine on multiple files to create PDFs with a searchable text layer. This application is designed to make it easy to process a large number of files in a batch. It allows the queue of multiple jobs and processes them in the background.
A job consists of multiple parameters: input folder, output folder, languages to recognize, preprocessing strategy, DPI, quality and minimum confidence.
Currently, the application can handle the following formats: TIFF, TIF, JPEG, JPG and PDF.
Tesseract UI Tools.exe
file.
The main screen shows a table with the queue of jobs for this session. It also allows setting an email address to receive notifications by mail. See Mail Notifications to set up it.
Each line of the table represents a job. It has a start time, the input folder and the status (which is either “Created”, “Running” or “Finished”).
Each job will populate the output folder with new PDF files generated by OCR’ing each file of the input folder. In case of any error a errors.txt file will be created with the filenames of files that could not be processed due to an error.
After clicking on “Add Job” a new form will pop up. In this form you can set the parameters of the job:
After clicking on “Add Job” on this form it will close and the job is added to the queue. You can now add a different job. The last parameters used are saved during a session only.
The following table represents the steps of each strategy for preprocessing each image:
Fast | Otsu | Gaussian | Fast & Otsu |
---|---|---|---|
Reduce Image Size | Reduce Image Size | Reduce Image Size | Reduce Image Size |
Tesseract | Median Blur | Median Blur | Median Blur & Tesseract (1) |
Otsu Threshold (Global) | Gaussian Threshold (Local) | Otsu Threshold (Global) | |
Dilate | Dilate | Dilate | |
Tesseract | Erode | Tesseract (2) | |
Tesseract | Merge Best Tesseract |
Each strategy was tested against the same input images. Follows the plots:
Regarding time Otsu was the fastest strategy, with a 4-second average. Followed by Gaussian with an 8-second average and Fast with a 10-second average. This result is mainly caused because we are reducing to half the image before any other step. Without reducing the image, the Plain strategy takes an average of 55 seconds. However, the number of words and confidence are not improved.
Tesseract UI Tools can send an email notifying you that a job has been completed.
The email consists of a report with two tables: the first table with the start time and the parameters for that job and the second table with information on the tesseract confidence and time for each file processed.
Also, this file is saved under the reports folder with the name report-{DateTime.Now}.html
. See Troubleshooting & details to find this file.
Parameter | Value |
---|---|
Start Time | 01/01/1970 00:00:00 |
InputFolder | C:\…\Test\Input |
OutputFolder | C:\…\Test\Output |
Language | eng |
… | … |
Start Time | Filename | Pages | Time Elapsed | Words Threshold / Words Total | Confidence Mean Threshold / Confidence Mean Total |
---|---|---|---|---|---|
01/01/1970 00:00:00 | File 1 | 5 | 36s | 957 / 1115 | 77.96429 / 68.227715 |
01/01/1970 00:00:36 | File 2 | … | |||
… | File 3 |
The column “Words Threshold / Words Total” contains the number of words with confidence higher than the minimum confidence asked and the number of words in total. Similarly, “Confidence Mean Threshold / Confidence Mean Total” contains the mean confidence for words higher than the minimum confidence and the mean of all word confidences.
To receive this report by mail the user needs to have access to a server relay or a Google Account.
Using a server:
Using a Google Account*:
Tesseract UI Tools
to send a mail with your account.*Note: steps 1 and 2 must be done every time you open the application, steps 3 and 4 might not be required every time.
To extend the the applications functionalities see here.
The application saves information on two folders: %AllUsersProfile%\Tesseract UI Tools\Tesseract UI Tools\1.0.0\
and %APPDATA%\Tesseract UI Tools\Tesseract UI Tools\1.0.0\
.
The first folder contains the model information for the Tesseract OCR engine to run in any language.
The second folder contains two subfolders and a file exceptions.log
:
Files\
Reports\
exceptions.log
The exceptions.log contains a log of operations when the application runs and saves any errors that might occur.
Inside Files\
for each file that runs it will create a folder with the same filename. Inside that folder for each page and for each job with different parameters it will create the files:
{filename}\
<page>.tiff
<page.dpi.quality>.jpeg
<page.strategy.lang>.tsv
Reports\
for each file and job it will create the files: <filename.lang.strategy>.html # file report
report-{timestamp}.html # job report
To completely uninstall the application delete the folder containing the executable created during the installation and delete the folders above.