The tesseractwrapper Module¶
This module contains all of the tesseract wrapping and image-to-text code.
-
tesseractwrapper.
TESSERACT_DIR
¶ str
The default path to the tesseract install directory.
-
tesseractwrapper.
DEFAULT_FORMAT
¶ str
The default image format to send to tesseract if an image doesn’t not have a declared format. Otherwise, try to use the former format if we can.
Main Function¶
-
tesseractwrapper.
get_text_from_image
(image, tesseract_dir_path=u'', stderr=None, psm=3, lang=u'eng', tessdata_dir_path=None, user_words_path=None, user_patterns_path=None, config_name=None, **config_variables)¶ Uses tesseract to get text from an image.
Outside of image, tesseract_dir_path, and stderr, the arguments mirror the official command line’s usage. A list of the command line options can be found here: https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
Parameters: - image (Image.Image or str) – The image to find text from or a path to that image.
- tesseract_dir_path (Optional[str]) – The path to the directory with the tesseract binary. Defaults to “”, which works if the binary is on the environmental PATH variable.
- stderr (Optional[file]) – The file like object (implements write) the tesseract stderr stream will write to. Defaults to None. You can set it to sys.stdin to see all output easily.
- psm (Optional[int]) – Page Segmentation Mode. Limits Tesseracts layout analysis (see the Tesseract docs). Default is 3, full analysis.
- lang (Optional[str]) – The language to use. Default is ‘eng’ for English.
- tessdata_dir_path (Optional[str]) – The path to the tessdata directory.
- user_words_path (Optional[str]) – The path to user words file.
- user_patterns_path (Optional[str]) – The path to the user patterns file.
- config_name (Optional[str]) – The name of a config file.
- **config_variables – The config variables for tesseract. A list of config variables can be found here: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
Returns: The parsed text.
Return type: str
Raises: subprocess.CalledProcessError
– If the tesseract exit status is not a success.Examples
Examples assume “image” is a picture of the text “ABC123”. See piltesseract tests for working code.
>>> get_text_from_image(image) 'ABC123' >>> get_text_from_image(image, psm=10) #single character psm 'A'
You can use tesseract’s default configs or your own:
>>> get_text_from_image(image, config_name='digits') '13123'
Without a config file, you can set config variables using optional keywords:
>>> text = get_text_from_image( image, tessedit_char_whitelist='1' tessedit_ocr_engine_mode=1, #cube mode enum found in Tesseract-OCR docs ) '1 11 '
Advanced Functions¶
-
tesseractwrapper.
get_tesseract_process
(commands, tesseract_dir_path=u'', stdin=-1, stdout=-1, stderr=-1)¶ Popen and return tesseract command line utility.
Opens and returns a tesseract process to the tesseract command line utility. Uses Popen to open a process and pipes to tesseract.
Parameters: - commands (List[str]) – The command line strings passed into the tesseract binary. Do not include the binary name or path in this variable.
- tesseract_dir_path (Optional[str]) – The path to the directory with the tesseract binary. Defaults to “”, which works if the binary is on the environmental PATH variable.
Returns: The open subprocess pipe.
Return type: subprocess.Popen
-
tesseractwrapper.
test_tesseract_path_version
(tesseract_dir_path=u'')¶ Tests that the correct version tesseract is installed and on the path.
Use this function to ensure that either tesseract is on a default or specified path. The function raises ImportErrors if tesseract does not work or is not the right version. The function is silent if everything passes.
Parameters: tesseract_dir_path (Optional[str]) – The path to the directory with the tesseract binary. Defaults to “”, which works if the binary is on the environmental PATH variable. Raises: ImportError
– If the tesseract requirement is not met.