@i

I hear and I forget. I see and I remember. I do and I understand. — Chinese Proverb

OCR Dhiraagu E-Directory Captcha with ImageMagick and tesseract-ocr

6 Comments

Dhiraagu E-Directory requires you to enter the Captcha text from the randomly generated image before it searches the directory.

This is a simple control to ensure that you are a human being and not a nasty little program that automatically queries the directory.

I am aware that a few others have come up with small hacks to bypass this or to search through the directory by other means, so this is not what this post is about.

I simply wanted to check how well this captcha control is doing its job in fulfilling its purpose. The object is to challenge so that only a human is able to read and enter the text.

However, using two very simple tools, it is possible to automate the process of identifying the text without the use of a human.

This can be done in two simple steps, 1) perform a simple threshold to get rid of the noise and 2) use an OCR engine to read the text

Thresholding is used to remove the pixels in an image based on the value (e.g. grey-level) specified. You can do this using any image processing application such as Adobe Photoshop or GIMP. But for simplicity I used ImageMagick which has a nice and powerful set of command-line tools that can be used on all platforms (Linux, Windows, and Mac). But before thresholding, I converted the image to a greyscale image, because the original images generated a in blue color.

convert -colorspace Gray input.jpg -threshold 30% processed.jpg

The above command can be used to convert the original image (input.jpg) into the pre-processed image (processed.jpg)

After thresholding, a nice image with very few noise was produced, and this pre-processed image was then passed to an OCR engine to convert the image text, to readable electronic text. For this I chose tesseract-ocr because its simple, easy to use and again available in command-line on multiple platforms. The default language is English.

tesseract processed.jpg output.txt

The above command with OCR the input image (processed.jpg) and create the output in the file output.txt.

For my small experiment, I generated a few captachas from the Dhiraagu E-Directory and saved the images.

The captcha image consists of very clearly written text with several random lines to create a noise effect. So a simple threshold was sufficient and the uniform text made it very easy for the OCR engine to recognize the text.

The following are the results. The results aren’t 100% accurate, but with a few adjustments it would be very easy to make this more accurate. With some work put into it, I bet someone could make a firefox plugin so whenever we visit the Dhiraagu E-Directory, it enters the text for us 🙂

Dhiraagu E-Directory Captcha OCR

Dhiraagu E-Directory Captcha OCR

Advertisements

6 thoughts on “OCR Dhiraagu E-Directory Captcha with ImageMagick and tesseract-ocr

  1. good job!

  2. any how to implement this by code… just curious..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s