Automatic plaque transcription
From The Artificial Intelligence Cookbook - having some fun with A.I.
The problem at hand is to automatically create a text transcription from images of English Heritage plaques (and later other types of plaque) so that the admins of OpenPlaques.org don't have to manually transcribe 1000s of plaque images by hand.
Ian is proposing to run a competition (probably every month) to see if we can continue to improve an open source transcription algorithm that runs on the OpenPlaque data.
By way of background - there are approximately 1200 transcribed blue plaques with a further 500 or so to transcribe. Rather than use a human instead we'll use algorithms to transcribe the remaining blue plaques and then move on to the remaining plaques (including green plaques - another 500-1000).
The first 1200 provide great validation data, transcribing the remaining plaques will greatly help the OpenPlaques team and will let us learn new techniques to solve an interesting vision problem. These skills can then be used in other domains - such as reading posters, street signs and the everyday text that we encounter in our environment. In total there may be 10,000 plaques to transcribe if we choose to tackle the hardest problems.
Contents |
[edit] First approach
These are Ian's notes (currently very incomplete!) on how we could automatically transcribe a plaque image into text.
This work in progress blog entry shows that the open source Tesseract 2.04 OCR system is highly accurate at transcribing a plaque image if we clean the image beforehand. If does a rather bad job if we provide a noisy coloured image for recognition.
Tesseract 3 is being tried, alternative systems are listed at the end.
[edit] Error measure
I think that the Levenshtein distance metric will be a good first error measure. It calculates how many edits (removals, insertions and substitutions) are required to convert one string into another. We can use this to compare a recognised string with a manually transcribed target.
For Python there is a pure Python implementation (levenshtein.py) and a faster C module (pylevenshtein).
[edit] Image cleaning
In this blog post I outlined a manual approach to cleaning a plaque JPG such that tesseract can clearly extract the text into a file.
[edit] Circle detection
The HoughCircles function in openCV reliably detects the white circle outline for a plaque, I envisage using this to extract the plaque so the rest of the image can be discarded.
[edit] Thresholding
[edit] Converting to black and white
[edit] Dictionary building
[edit] Geo tags in WikiPedia
Jimmy O'Regen in the tesseract-ocr list notes that geolocating a wikipedia entry from the plaques geo-tag would allow a dictionary to be built for tesseract (and other OCR tools).
The plaque for Samuel Lake in flickr has a geo-tag at (50.3518, -3.578), a search at wikilocation using:
http://api.wikilocation.org/articles?lat=50.3518&lng=-3.578&limit=10&radius=500
reveals a set of nearby wikipedia entries:
{"articles":[
{"id":"8419","lat":"50.351","lng":"-3.579","title":"Dartmouth, Devon",
"url":"http://en.wikipedia.org/w/index.php?curid=8419","distance":"114m"},
{"id":"16640701","lat":"50.3496","lng":"-3.57519","title":"Dartmouth Passenger Ferry",
"url":"http://en.wikipedia.org/w/index.php?curid=16640701","distance":"315m"},
{"id":"16640722","lat":"50.3486","lng":"-3.57519","title":"Dartmouth Lower Ferry",
"url":"http://en.wikipedia.org/w/index.php?curid=16640722","distance":"408m"},
{"id":"6182271","lat":"50.3481","lng":"-3.57722","title":"Bayard's Cove Fort",
"url":"http://en.wikipedia.org/w/index.php?curid=6182271","distance":"415m"},
{"id":"14496836","lat":"50.3489","lng":"-3.57273","title":"Dartmouth railway station",
"url":"http://en.wikipedia.org/w/index.php?curid=14496836","distance":"494m"}
]}
The only relevant keyword from this set is 'Dartmouth' (sadly Samuel Lake doesn't appear in WikiPedia), but that's one more keyword that could be added to a dictionary. It is possible that these entries or the entries linked from these pages will contain useful named entites and dates.
Python has a nice flickr library, it looks like it can extract geo tags.
[edit] Adding to tesseract's dictionary
These two links have notes on updating tesseract's dictionary.
Jimmy (of tesseract) notes that concatenating several images together might help the overall accuracy as the classifier is adaptive - what it learns from one image could help it recognise more in the next (as long as they're all in the same big image).
[edit] Whitelist OCR characters
We can limit the characters used by tesseract by adding a 'goodchars' file. I've added:
tessedit_char_whitelist 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.,()-"
to 'goodchars' in my local directory. See Jonathan's write-up for notes on calling tesseract with the goodchars file.
[edit] Text clean-up
[edit] Removing illegal characters
[edit] Dictionary clean-up
A common error with OCR is the misrecognition of similar characters e.g. '1' for 'i' or 'l' (one, india, lima) and '0' with 'o' (zero, oscar). A simple process to fix these errors would be to cycle through a set of substitutions on each word and check a dictionary to see if the new word makes sense.
An alternative approach would be to query a dictionary to see what it suggests as the most likely legal word for the word we pass in.
Both approaches will lead to a set of suggestions, we might need to run a parser which generates a parse tree to discover which words generate legal sentences.
[edit] Download latest version from github
You can get the current code from github [1]. You can either download a zip of the source or use:
git clone git@github.com:ianozsvald/plaquereader
which will create a 'plaquereader' directory and download the source.
The github code is newer than the original demo code shown below.
[edit] Demo using older (original) code
This demo system assumes that tesseract is pre-installed along with the Python Imaging Library.
[edit] Download and prepare the plaques
The first program downloads 30 plaques and converts the images into TIFF files for tesseract. You'll need to download easy_blue_plaques.csv and put it into the same directory as the Python source.
[edit] get_plaques.py
import os
import sys
import csv
import urllib
from PIL import Image # http://www.pythonware.com/products/pil/
# get_plaques.py - downloads plaque images and converts to TIFF files
# cmdline> python get_plaques.py easy_blue_plaques.csv
# it will download images and conver them to TIFF files for tesseract
# For more details see:
# http://aicookbook.com/wiki/Automatic_plaque_transcription
def get_plaques(plaques):
"""download plaque images if we don't already have them"""
for root_url, filename, text in plaques:
filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
filename_tif = filename_base + '.tif'
if not os.path.exists(filename_tif):
print "Downloading", filename
urllib.urlretrieve(root_url+filename, filename)
im = Image.open(filename)
im.save(filename_tif, 'TIFF')
if filename.rfind('.tif') == -1:
os.remove(filename) # delete the original file
def load_csv(filename):
"""build plaques structure from CSV file"""
plaques = []
plqs = csv.reader(open(filename, 'rb'))#, delimiter=',')
for row in plqs:
image_url = row[1]
text = row[2]
# ignore id (0) and plaque url (3) for now
last_slash = image_url.rfind('/')
filename = image_url[last_slash+1:]
root_url = image_url[:last_slash+1]
plaque = [root_url, filename, text]
plaques.append(plaque)
return plaques
if __name__ == '__main__':
argc = len(sys.argv)
if argc != 2:
print "Usage: python get_plaques.py plaques.csv (e.g. \
easy_blue_plaques.csv)"
else:
plaques = load_csv(sys.argv[1])
get_plaques(plaques)
[edit] Transcribe the plaques
Next we transcribes the images (poorly!) and use the Levenshtein error metric to see how many changes we need to make to the transcribed text to make it equal to the human-supplied transcription. A result of 0 for each plaque is the best case. This result is written out to results.csv.
[edit] plaque_transcribe_demo.py
import os
import sys
import csv
import urllib
from PIL import Image # http://www.pythonware.com/products/pil/
# This recognition system depends on:
# http://code.google.com/p/tesseract-ocr/
# version 2.04, it must be installed and compiled already
# plaque_transcribe_demo.py
# run it with 'cmdline> python plaque_transcribe_demo.py easy_blue_plaques.csv'
# and it'll:
# 1) send images to tesseract
# 2) read in the transcribed text file
# 3) convert the text to lowercase
# 4) use a Levenshtein error metric to compare the recognised text with the
# human supplied transcription (in the plaques list below)
# 5) write error to file
# For more details see:
# http://aicookbook.com/wiki/Automatic_plaque_transcription
def load_csv(filename):
"""build plaques structure from CSV file"""
plaques = []
plqs = csv.reader(open(filename, 'rb'))#, delimiter=',')
for row in plqs:
image_url = row[1]
text = row[2]
# ignore id (0) and plaque url (3) for now
last_slash = image_url.rfind('/')
filename = image_url[last_slash+1:]
filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
filename = filename_base + '.tif'
root_url = image_url[:last_slash+1]
plaque = [root_url, filename, text]
plaques.append(plaque)
return plaques
def levenshtein(a,b):
"""Calculates the Levenshtein distance between a and b
Taken from: http://hetland.org/coding/python/levenshtein.py"""
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
def transcribe_simple(filename):
"""Convert image to TIF, send to tesseract, read the file back, clean and
return"""
# read in original image, save as .tif for tesseract
im = Image.open(filename)
filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
filename_tif = filename_base + '.tif'
im.save(filename_tif, 'TIFF')
# call tesseract, read the resulting .txt file back in
cmd = 'tesseract %s %s -l eng' % (filename_tif, filename_base)
print "Executing:", cmd
os.system(cmd)
input_filename = filename_base + '.txt'
input_file = open(input_filename)
lines = input_file.readlines()
line = " ".join([x.strip() for x in lines])
input_file.close()
# delete the output from tesseract
os.remove(input_filename)
# convert line to lowercase
transcription = line.lower()
return transcription
if __name__ == '__main__':
argc = len(sys.argv)
if argc != 2:
print "Usage: python plaque_transcribe_demo.py plaques.csv (e.g. \
easy_blue_plaques.csv)"
else:
plaques = load_csv(sys.argv[1])
results = open('results.csv', 'w')
for root_url, filename, text in plaques:
print "----"
print "Working on:", filename
transcription = transcribe_simple(filename)
print "Transcription:", transcription
error = levenshtein(text, transcription)
assert isinstance(error, int)
print "Error metric:", error
results.write('%s,%d\n' % (filename, error))
results.flush()
results.close()
[edit] Summarise results
Finally we run summarise_results.py which loads results.csv and shows the average error across the plaques we've transcribed. The average error using this demo system with easy_blue_plaques.csv is 709.3 - this is an awful result that you'll easily beat!
[edit] summarise_results.py
import csv
import sys
if __name__ == "__main__":
if len(sys.argv) != 2:
print "Usage: python summarise_results.py results.csv"
sys.exit(0)
filename = sys.argv[1]
errors = []
errors_file = csv.reader(open(filename, 'rb'))
for (plaque_file, error) in errors_file:
errors.append(int(error))
chart_data = ",".join([str(x) for x in errors])
average_error = float(sum(errors)) / len(errors)
print "Average error", average_error
[edit] Reference material
[edit] Software
- Tesseract 2.04 as pre-compiled downloads
- Tesseract 3 via svn
- GOCR
- List of OCR software
[edit] WebApps
- weOCR tesseract web service
- free-ocr probably uses tesseract 3
- ocr.aicookbook.com uses tesseract 2
[edit] tesseract specific
- tesseract's FAQ
- statistical methods for building support files

