Advanced ExampleΒΆ

In this example we will parse pypi version numbering from images.
We will do some manual image preprocessing using PIL.

You will need to have tesseract on your PATH and already done the pip install for both piltesseract and requests.

# This cell is for imports and helper functions.

import copy

# For python 2/3 compatability.
try:
    from StringIO import StringIO as BytesIO
except ImportError:
    from io import BytesIO

from PIL import Image, ImageFilter
from piltesseract import get_text_from_image
import requests


def get_image_from_url(url):
    """Gets an image from a url string.

    Args:
        url (str): The url to the image.

    Returns:
        Image: The image downloaded from the url.

    """

    response = requests.get(url)
    image = Image.open(BytesIO(response.content))
    return image


def scale_image_from_width(image, new_width):
    """Rescales an image based on a new width.

    Args:
        image (PIL.Image): The image to scale.
        new_width (int): The new width to scale to.

    Returns:
        PIL.Image: The new scaled image.

    """
    if new_width == image.width:
         return copy.copy(image)
    width_percent = new_width / float(image.width)
    new_height = int(image.height * width_percent)
    new_size = (new_width, new_height)
    image = image.resize(new_size, Image.ANTIALIAS)
    return image

First, we download the pypi image that contains the version.

url = u'https://img.shields.io/pypi/v/piltesseract.png?branch=master'
pypi_image = get_image_from_url(url)
pypi_image
_images/output_4_0.png

Now we crop out the information we do not care about.

margin_crop = 1
left = 33
upper = margin_crop
right = pypi_image.width - margin_crop
lower = pypi_image.height - margin_crop
crop_box = (left, upper, right, lower)

version_image = pypi_image.crop(box=crop_box)
version_image
_images/output_6_0.png
If we simply get the text at this point, the result will not be very accurate.
The size is smaller than desired and the white on orange does not help.
text = get_text_from_image(version_image)
text
'van:'

Because we know versions are numbers + periods and a “v”, we can use a tesseract white list, the results are more accurate.

white_list = 'v0123456789.'
text = get_text_from_image(version_image,
                          tessedit_char_whitelist=white_list)
text
'v002'

Although we can do better by manually changing the image. We should scale and smooth the image.

width = 100
preprocessed_image = scale_image_from_width(version_image, width)
preprocessed_image = preprocessed_image.filter(ImageFilter.SMOOTH_MORE)
preprocessed_image
_images/output_12_0.png
text = get_text_from_image(preprocessed_image)
text
'v0.0.2'

The new result is accurate! We can add on the white list for good measure and reliability.

white_list = 'v0123456789.'
text = get_text_from_image(preprocessed_image,
                          tessedit_char_whitelist=white_list)
text
'v0.0.2'