Crack captcha with Machine Learning

3 min readOct 24, 2019

Are you a bot ?
That’s the question so many websites ask. In this world of bots and web crawlers it is really a good idea to put a filter in your site to limit the service only to the human users.
But as the field of Artificial Intelligence has grown up exponentially exploiting all the expectations it really makes sense to use CAPTCHA anymore ? (Spoilers: it’s not)

What is a captcha ?

If you don’t live under a rock you must already know what CAPTCHA is. If not then hail the God of information Google. Let me reduce your effort a bit.

A CAPTCHA is a type of challenge–response test used in computing to determine whether or not the user is human. The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford. The most common type of CAPTCHA was first invented in 1997 by two groups working in parallel. -Wikipedia

https://www.kaggle.com/fournierp/captcha-version-2-images

In simple words, CAPTCHA is a picture of multiple alphanumeric characters where are character are distorted or noise is added in the pic or both. It is considered to be non crackable by computers and it used to filter out bots and web crawlers.

The Model

Image from https://github.com/DarvinX/captchaCracker

If you are familiar with Machine Learning and Neural Network stuff then you should be able to understand the picture but if you don’t then maybe you should just skip this point( I insist you to study about machine learning, it’s really a cool stuff).

So here what’s basically happening is the model takes input(the CAPTCHA) at the input_1 layer. Then the pic(or should I say information) goes through some Convolutional layers and maxpools. Then after the layer flatten_1 there are five branches. Each branch outputs a single character of the CAPTCHA respectively(i.e the first branch will give the first character and the last layer will give the last character).

This model is written using keras and the source code is available on github.

Results

Before I say any of the results, I want you to consider that this a task that is considered(even our government thinks so) to be non-doable for computers.

I was able to get accuracy of 28%. I know it doesn’t sound a lot but as I said a bot is not considered to be able to even cracking a single one and here it is with 28% accuracy. There were some limitations, the dataset I used had only 1k CAPTCHAs( I used 920 for training, 150 for validation).

I am sure someone can get greater accuracy but I am not interested in building the perfect CAPTCHA cracker, I just wanted a proof of concept.

Conclusion

I wrote this article as I was astonished by the fact that one of the most widely used security feature can be broken that easily. Here my concerns are if I, a typical engineering student have access to this kind of technology, then who’s stopping a person with malicious intentions from bombing a government site with large number of fake requests making the site nearly or completely inaccessible.

I’m well aware that the people working for government are more smarter and more educated than me and there might be some second layer of security to prevent this kind of attacks that I am not aware of, but still it kind of concerns me. You can read about it more on Wikipedia.

DarvinX/captchaCracker

This is a keras model(with tensorflow) to decode captcha images. The dataset is collected from…

github.com

CAPTCHA Images

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

www.kaggle.com

CAPTCHA

The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford. The most common type…

en.wikipedia.org