Codementor Events

Captcha Breaker

Published Oct 17, 2020

Introduction
Captcha is computer generating text images used to distinguish interactions given by humans or machines. Normally, a captcha image consists of a fixed number of characters (e.g. digit, letter). These characters are not only distorted, scaled into multiple different sizes but also can be overlapped and crossed by multiple random lines. Two types of captcha illustrated in Fig. 1, are specified by the number of character categories (0..9, A..Z) and the text length (e.g. 5 in green and 6 in black images).
fig_1.PNG

In this blog, I present and compare two deep learning models solving this captcha recognition challenge. Each model consists of convolutional layers to learn visual features. These features are then fed to fully connected layers to compute the final captcha prediction. My architectures are powerfully used for all kinds of captcha, only C and L are varying due to corresponding captcha properties. The details of these models will be described in section 2.

Two captcha datasets called GRN and BLK, consist of 507 and 610 images respectively, have been collected to evaluate the performance of these models. The details of these datasets will be mentioned in section 3.

In general, one banking system allows users to type captcha wrongly up to 5 times continuously. In this project, I aim to solve the two mentioned captchas to pass this system over 99.9% in total cases. Assume that every captcha has the same probability of p to be recognized correctly. Then, the probability that the model gives wrong answers in 5 times continuously is (1-p)⁵. In order that, my goal is equivalent to (1-p)⁵ < 0.001, leading to p > 0.75. In other words, the good outcome for the project is defined as the global accuracy is over 75%.

Methodology
A typical solution for general captcha recognition problem is (1) find an image processing method to split a captcha image to several small images containing individual characters, then (2) employ convolutional neural network to recognize each of these images, and finally, (3) concatenate them to produce recognition for the whole captcha. While the step (2), individual character image recognition, has been popular and solved efficiently since the existence of MNIST challenge, the solution for step (1) depends on the difficulty level to separate characters inside one image. For example, it is challenging to find a general method to separate characters in each of the mentioned GRN and BLK datasets. This obviously affects the prediction in step (3). Therefore, to solve these kinds of captcha, it is necessary to find a powerful methodology that directly recognizes the whole captcha. In other words, the three above steps will be combined into only one single step.

The general idea for an end to end approach is to take input as one captcha image of size WxH and produce an output as a matrix M of size CxL where C is the number of label categories and L is the label length. In which, the value M_i,j represents the confidence score, which then feeds to a softmax layer to produce the probability that the label in position j belongs to the character category i. In the following section, I present two deep learning architectures namely Arch-A and Arch-B to produce such kind of matrix.

Architecture Design
fig_2.PNG

In general, Arch-A and Arch-B described in figure Fig. 2 consists of two main modules: convolutional (conv) module and fully connected (fc) module. Their conv module has the same design with four conv blocks which are responsible to learn visual features. Each of three first conv blocks contains two conv layer with kernel size 3x3, following by a max-pooling layer to filter high-response features. The last conv block contains only one kernel size 1x1 to transform the layer depth to C, which is the number of character categories. Additionally, batch normalization and ReLU functions are put after each conv layer.

The fully connected module of Arch-A and Arch-B are different. Similar to common recognition models, such as VGG, Arch-A (1) flattens the last the output of conv module, then (2) feed it to a multiple-layer perceptrons network, which consists of 2 layers, in this case, to produce an output size L corresponding to the label length. Meanwhile, the fully connected module in Arch-B is decomposed into two steps: vertical fc and horizontal fc. This separation forces Arch-B to maintain the spatial information of convolutional output while learning its fully connected parameters. Note that, in the fully connected modules of both Arch-A and Arch-B, a dropout of 50% is employed to create ensembling effect and avoid overfitting.

Implementation
For these classification problems, I adopt the cross entropy loss function on both two models. Let N is the number of training images and L is the label length. Denote t^_{n,l,c} as the probability score of the character at position l of an image x_n for its ground truth counting label c, then the loss function is defined as follows:
fig_3.PNG

Experimental Setup
Dataset
fig_4.PNG

The statistic of two captcha datasets GRN and BLK are described in Table 1. They are shuffled and split with a proportion of 70% : 15% : 15% to training, testing and validation sets.

Performance Evaluation
I employ two measures, local accuracy and global accuracy, for evaluation. The local accuracy compares each predicted individual character with the ground truth character in the same position in captcha string, count the number of correct pairs and then divided by the total number of characters to produce the appropriate accuracy. The global accuracy computation is based on the comparison between the whole predicted captcha string with the ground truth one.

Experimental Results
fig_5.PNG

Table 2 contrasts the performance between Arch-A and Arch-B on two datasets GRN and BLK. In terms of local accuracy, there is no significant difference between Arch-A and Arch-B. In other words, with over 90% of accuracy, they are both good at individual character recognition. This reflects the efficiency of the conv module in learning visual features. Meanwhile, the global accuracy a much better performance of Arch-B with a gap of 10% in comparison with Arch-A. This proves the efficiency of the separation of vertical fc and horizontal fc in recognition of captcha which is a sequence of chars with a given fixed length.

fig_6.PNG

fig_7.PNG

Fig. 4 describes the accuracy over character labels of Arch-B on the two datasets. Most of the characters are predicted impressively with about 100%. Besides, some of the characters such as 4, 6 in GRN and E, T in BLK are predicted less accurately with about 90% of accuracy. In GRN, characters are randomly in bold and italic shape, digit 4 is misclassified to 1 (Fig. 4a) and digit 6 is misclassified to 0 (Fig. 4b). Meanwhile, BLK dataset contains several crossing lines over characters. Letter T is misclassified to D, H, P, Y (Fig. 4c). Letter E, with one horizontal segment in the middle, is misclassified as noise and the model gives an answer as a nearby character (Fig. 4d).

fig_8.PNG

Fig. 6 shows the accuracy distribution of Arch-B over label position in the captcha sequence. While there are no significant differences in GRN, better performances on the left and right side characters in comparison with middle characters can be seen on the more challenging dataset BLK. The intuition is that the middle characters have more neighbors and suffer more noising effect from cross lines, leading to having less boundary space, in other words, fewer clues to be recognized than the left and right side characters.

Conclusion
In this blog, I have presented a captcha recognition project. The project works on two datasets GRN and BLK, which hold of noisy captcha images. Two deep learning models are proposed, which share the similarity in convolutional module but the difference in the fully connected modules. Arch-B which preserves the spatial information in the fully connected module achieves better performance. In terms of global accuracy, it shows the accuracy of about 90% on GRNand 80% on BLK which are better than the prior expectation 75%. However, because of challenging noises, some characters are misclassified. In a captcha sequence, the middle characters tend to be recognized more difficultly than the left and right side characters. Because captcha is a sequence of given characters (digits, letters) with different ways of representation inside a noisy background, to improve the recognition performance, transfer learning is a highly potential approach.

Discover and read more posts from Alan Nguyen
get started
post commentsBe the first to share your opinion
Show more replies