Automatic plaque transcription

From The Artificial Intelligence Cookbook - having some fun with A.I.

Jump to: navigation, search

The problem at hand is to automatically create a text transcription from images of English Heritage plaques (and later other types of plaque) so that the admins of OpenPlaques.org don't have to manually transcribe 1000s of plaque images by hand.

Ian is proposing to run a competition (probably every month) to see if we can continue to improve an open source transcription algorithm that runs on the OpenPlaque data.

By way of background - there are approximately 1200 transcribed blue plaques with a further 500 or so to transcribe. Rather than use a human instead we'll use algorithms to transcribe the remaining blue plaques and then move on to the remaining plaques (including green plaques - another 500-1000).

The first 1200 provide great validation data, transcribing the remaining plaques will greatly help the OpenPlaques team and will let us learn new techniques to solve an interesting vision problem. These skills can then be used in other domains - such as reading posters, street signs and the everyday text that we encounter in our environment. In total there may be 10,000 plaques to transcribe if we choose to tackle the hardest problems.


Contents

First approach

These are Ian's notes (currently very incomplete!) on how we could automatically transcribe a plaque image into text.

This work in progress blog entry shows that the open source Tesseract 2.04 OCR system is highly accurate at transcribing a plaque image if we clean the image beforehand. If does a rather bad job if we provide a noisy coloured image for recognition.

Tesseract 3 is being tried, alternative systems are listed at the end.

Error measure

I think that the Levenshtein distance metric will be a good first error measure. It calculates how many edits (removals, insertions and substitutions) are required to convert one string into another. We can use this to compare a recognised string with a manually transcribed target.

For Python there is a pure Python implementation (levenshtein.py) and a faster C module (pylevenshtein).

Image cleaning

In this blog post I outlined a manual approach to cleaning a plaque JPG such that tesseract can clearly extract the text into a file.

Circle detection

The HoughCircles function in openCV reliably detects the white circle outline for a plaque, I envisage using this to extract the plaque so the rest of the image can be discarded.

Thresholding

Thresholding lets us convert a colour image into a black and white (or greyscale) image. Jonathan Street has a nice method to convert the blue and white plaques into black and white text, it also converts all other background colours to black.

Converting to black and white

Jonathan's routine (above) converts the image to black and white. It should be noted that a greyscale conversion might help OCR packages to recognise the text it areas where blobs should be joined (e.g. thin connections between parts of a letter that otherwise look like discrete objects, not one character).

Dictionary building

Geo tags in WikiPedia

Jimmy O'Regen in the tesseract-ocr list notes that geolocating a wikipedia entry from the plaques geo-tag would allow a dictionary to be built for tesseract (and other OCR tools).

The plaque for Samuel Lake in flickr has a geo-tag at (50.3518, -3.578), a search at wikilocation using:

http://api.wikilocation.org/articles?lat=50.3518&lng=-3.578&limit=10&radius=500

reveals a set of nearby wikipedia entries:

{"articles":[
{"id":"8419","lat":"50.351","lng":"-3.579","title":"Dartmouth, Devon",
"url":"http://en.wikipedia.org/w/index.php?curid=8419","distance":"114m"},
{"id":"16640701","lat":"50.3496","lng":"-3.57519","title":"Dartmouth Passenger Ferry",
"url":"http://en.wikipedia.org/w/index.php?curid=16640701","distance":"315m"},
{"id":"16640722","lat":"50.3486","lng":"-3.57519","title":"Dartmouth Lower Ferry",
"url":"http://en.wikipedia.org/w/index.php?curid=16640722","distance":"408m"},
{"id":"6182271","lat":"50.3481","lng":"-3.57722","title":"Bayard's Cove Fort",
"url":"http://en.wikipedia.org/w/index.php?curid=6182271","distance":"415m"},
{"id":"14496836","lat":"50.3489","lng":"-3.57273","title":"Dartmouth railway station",
"url":"http://en.wikipedia.org/w/index.php?curid=14496836","distance":"494m"}
]}

The only relevant keyword from this set is 'Dartmouth' (sadly Samuel Lake doesn't appear in WikiPedia), but that's one more keyword that could be added to a dictionary. It is possible that these entries or the entries linked from these pages will contain useful named entites and dates.

Python has a nice flickr library, it looks like it can extract geo tags.

Adding to tesseract's dictionary

These two links have notes on updating tesseract's dictionary.

Jimmy (of tesseract) notes that concatenating several images together might help the overall accuracy as the classifier is adaptive - what it learns from one image could help it recognise more in the next (as long as they're all in the same big image).

Whitelist OCR characters

We can limit the characters used by tesseract by adding a 'goodchars' file. I've added:

tessedit_char_whitelist 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.,()-"

to 'goodchars' in my local directory. See Jonathan's write-up for notes on calling tesseract with the goodchars file.

Text clean-up

Removing illegal characters

Dictionary clean-up

A common error with OCR is the misrecognition of similar characters e.g. '1' for 'i' or 'l' (one, india, lima) and '0' with 'o' (zero, oscar). A simple process to fix these errors would be to cycle through a set of substitutions on each word and check a dictionary to see if the new word makes sense.

An alternative approach would be to query a dictionary to see what it suggests as the most likely legal word for the word we pass in.

Both approaches will lead to a set of suggestions, we might need to run a parser which generates a parse tree to discover which words generate legal sentences.

Download latest version from github

You can get the current code from github [1]. You can either download a zip of the source or use:

git clone git@github.com:ianozsvald/plaquereader

which will create a 'plaquereader' directory and download the source.

The github code is newer than the original demo code shown below.

Demo using older (original) code

This demo system assumes that tesseract is pre-installed along with the Python Imaging Library.

Download and prepare the plaques

The first program downloads 30 plaques and converts the images into TIFF files for tesseract. You'll need to download easy_blue_plaques.csv and put it into the same directory as the Python source.

get_plaques.py

import os
import sys
import csv
import urllib
from PIL import Image # http://www.pythonware.com/products/pil/

# get_plaques.py - downloads plaque images and converts to TIFF files
# cmdline> python get_plaques.py easy_blue_plaques.csv
# it will download images and conver them to TIFF files for tesseract

# For more details see:
# http://aicookbook.com/wiki/Automatic_plaque_transcription

def get_plaques(plaques):
    """download plaque images if we don't already have them"""
    for root_url, filename, text in plaques:
        filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
        filename_tif = filename_base + '.tif'
        if not os.path.exists(filename_tif):
            print "Downloading", filename
            urllib.urlretrieve(root_url+filename, filename)
            im = Image.open(filename)
            im.save(filename_tif, 'TIFF')
            if filename.rfind('.tif') == -1:
                os.remove(filename) # delete the original file

def load_csv(filename):
    """build plaques structure from CSV file"""
    plaques = []
    plqs = csv.reader(open(filename, 'rb'))#, delimiter=',')
    for row in plqs:
        image_url = row[1]
        text = row[2]
        # ignore id (0) and plaque url (3) for now
        last_slash = image_url.rfind('/')
        filename = image_url[last_slash+1:]
        root_url = image_url[:last_slash+1]
        plaque = [root_url, filename, text]
        plaques.append(plaque)
    return plaques

if __name__ == '__main__':
    argc = len(sys.argv)
    if argc != 2:
        print "Usage: python get_plaques.py plaques.csv (e.g. \
easy_blue_plaques.csv)"
    else:
        plaques = load_csv(sys.argv[1])
        get_plaques(plaques)

Transcribe the plaques

Next we transcribes the images (poorly!) and use the Levenshtein error metric to see how many changes we need to make to the transcribed text to make it equal to the human-supplied transcription. A result of 0 for each plaque is the best case. This result is written out to results.csv.

plaque_transcribe_demo.py

import os
import sys
import csv
import urllib
from PIL import Image # http://www.pythonware.com/products/pil/

# This recognition system depends on:
# http://code.google.com/p/tesseract-ocr/
# version 2.04, it must be installed and compiled already

# plaque_transcribe_demo.py
# run it with 'cmdline> python plaque_transcribe_demo.py easy_blue_plaques.csv'
# and it'll:
# 1) send images to tesseract
# 2) read in the transcribed text file
# 3) convert the text to lowercase
# 4) use a Levenshtein error metric to compare the recognised text with the
# human supplied transcription (in the plaques list below)
# 5) write error to file

# For more details see:
# http://aicookbook.com/wiki/Automatic_plaque_transcription

def load_csv(filename):
    """build plaques structure from CSV file"""
    plaques = []
    plqs = csv.reader(open(filename, 'rb'))#, delimiter=',')
    for row in plqs:
        image_url = row[1]
        text = row[2]
        # ignore id (0) and plaque url (3) for now
        last_slash = image_url.rfind('/')
        filename = image_url[last_slash+1:]
        filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
        filename = filename_base + '.tif'        
        root_url = image_url[:last_slash+1]
        plaque = [root_url, filename, text]
        plaques.append(plaque)
    return plaques

def levenshtein(a,b):
    """Calculates the Levenshtein distance between a and b
       Taken from: http://hetland.org/coding/python/levenshtein.py"""
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n
        
    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)
            
    return current[n]

def transcribe_simple(filename):
    """Convert image to TIF, send to tesseract, read the file back, clean and
    return"""
    # read in original image, save as .tif for tesseract
    im = Image.open(filename)
    filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
    filename_tif = filename_base + '.tif'
    im.save(filename_tif, 'TIFF')

    # call tesseract, read the resulting .txt file back in
    cmd = 'tesseract %s %s -l eng' % (filename_tif, filename_base)
    print "Executing:", cmd
    os.system(cmd)
    input_filename = filename_base + '.txt'
    input_file = open(input_filename)
    lines = input_file.readlines()
    line = " ".join([x.strip() for x in lines])
    input_file.close()
    # delete the output from tesseract
    os.remove(input_filename)

    # convert line to lowercase
    transcription = line.lower()

    return transcription
    

if __name__ == '__main__':
    argc = len(sys.argv)
    if argc != 2:
        print "Usage: python plaque_transcribe_demo.py plaques.csv (e.g. \
easy_blue_plaques.csv)"
    else:
        plaques = load_csv(sys.argv[1])

        results = open('results.csv', 'w')

        for root_url, filename, text in plaques:
            print "----"
            print "Working on:", filename
            transcription = transcribe_simple(filename)
            print "Transcription:", transcription
            error = levenshtein(text, transcription)
            assert isinstance(error, int)
            print "Error metric:", error
            results.write('%s,%d\n' % (filename, error))
            results.flush()
        results.close()

Summarise results

Finally we run summarise_results.py which loads results.csv and shows the average error across the plaques we've transcribed. The average error using this demo system with easy_blue_plaques.csv is 709.3 - this is an awful result that you'll easily beat!

summarise_results.py

import csv
import sys

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print "Usage: python summarise_results.py results.csv"
        sys.exit(0)

    filename = sys.argv[1]
    errors = []
    errors_file = csv.reader(open(filename, 'rb'))
    for (plaque_file, error) in errors_file:
        errors.append(int(error))

    chart_data = ",".join([str(x) for x in errors])

    average_error = float(sum(errors)) / len(errors)

    print "Average error", average_error

Reference material

Software

WebApps

tesseract specific

Training

Books

Personal tools