The tesseractwrapper Module

This module contains all of the tesseract wrapping and image-to-text code.

tesseractwrapper.TESSERACT_DIR

str

The default path to the tesseract install directory.

tesseractwrapper.DEFAULT_FORMAT

str

The default image format to send to tesseract if an image doesn’t not have a declared format. Otherwise, try to use the former format if we can.

Main Function

tesseractwrapper.get_text_from_image(image, tesseract_dir_path=u'', stderr=None, psm=3, lang=u'eng', tessdata_dir_path=None, user_words_path=None, user_patterns_path=None, config_name=None, **config_variables)

Uses tesseract to get text from an image.

Outside of image, tesseract_dir_path, and stderr, the arguments mirror the official command line’s usage. A list of the command line options can be found here: https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

Parameters:
  • image (Image.Image or str) – The image to find text from or a path to that image.
  • tesseract_dir_path (Optional[str]) – The path to the directory with the tesseract binary. Defaults to “”, which works if the binary is on the environmental PATH variable.
  • stderr (Optional[file]) – The file like object (implements write) the tesseract stderr stream will write to. Defaults to None. You can set it to sys.stdin to see all output easily.
  • psm (Optional[int]) – Page Segmentation Mode. Limits Tesseracts layout analysis (see the Tesseract docs). Default is 3, full analysis.
  • lang (Optional[str]) – The language to use. Default is ‘eng’ for English.
  • tessdata_dir_path (Optional[str]) – The path to the tessdata directory.
  • user_words_path (Optional[str]) – The path to user words file.
  • user_patterns_path (Optional[str]) – The path to the user patterns file.
  • config_name (Optional[str]) – The name of a config file.
  • **config_variables – The config variables for tesseract. A list of config variables can be found here: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
Returns:

The parsed text.

Return type:

str

Raises:

subprocess.CalledProcessError – If the tesseract exit status is not a success.

Examples

Examples assume “image” is a picture of the text “ABC123”. See piltesseract tests for working code.

>>> get_text_from_image(image)
'ABC123'
>>> get_text_from_image(image, psm=10)  #single character psm
'A'

You can use tesseract’s default configs or your own:

>>> get_text_from_image(image, config_name='digits')
'13123'

Without a config file, you can set config variables using optional keywords:

>>> text = get_text_from_image(
        image,
        tessedit_char_whitelist='1'
        tessedit_ocr_engine_mode=1,  #cube mode enum found in Tesseract-OCR docs
        )
'1  11 '

Advanced Functions

tesseractwrapper.get_tesseract_process(commands, tesseract_dir_path=u'', stdin=-1, stdout=-1, stderr=-1)

Popen and return tesseract command line utility.

Opens and returns a tesseract process to the tesseract command line utility. Uses Popen to open a process and pipes to tesseract.

Parameters:
  • commands (List[str]) – The command line strings passed into the tesseract binary. Do not include the binary name or path in this variable.
  • tesseract_dir_path (Optional[str]) – The path to the directory with the tesseract binary. Defaults to “”, which works if the binary is on the environmental PATH variable.
Returns:

The open subprocess pipe.

Return type:

subprocess.Popen

tesseractwrapper.test_tesseract_path_version(tesseract_dir_path=u'')

Tests that the correct version tesseract is installed and on the path.

Use this function to ensure that either tesseract is on a default or specified path. The function raises ImportErrors if tesseract does not work or is not the right version. The function is silent if everything passes.

Parameters:tesseract_dir_path (Optional[str]) – The path to the directory with the tesseract binary. Defaults to “”, which works if the binary is on the environmental PATH variable.
Raises:ImportError – If the tesseract requirement is not met.