百度验证码识别

最新推荐文章于 2025-06-09 14:08:23 发布

转载最新推荐文章于 2025-06-09 14:08:23 发布 · 1.1k 阅读

最近打算做个自动登录系统，但是遇到了验证码的问题，于是打算尝试下破解百度的验证码，在这里记录下步骤：

工具：采用tesseract3.2和python image lib（具体安装方法以后会补上）下载工具jTessBoxEditor. http://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

tesseract要求处理的文件必须为tif文件（3.2可以识别gif和jpg，但是训练的话还是用tif比较好），所以用python先对要识别的图片进行处理

import Image

def  ImageBinaryzation(threshold, im):
    im = im.convert('L')
    table = []  
    for i in range(256):  
        if i < threshold:  
            table.append(0)  
        else:  
            table.append(1) 
    out = im.point(table,'1')  
   # out.show();

infile = 14.gif'
pathList = infile.split('.')
print(pathList[0])
print(pathList[1])
outfile = pathList[0] + '.tif'
im = Image.open(infile)
ImageBinaryzation(160, im)
im.save(outfile)

这里的代码可以讲gif图片转换成tif并且二值化，暂时先处理到这里

好了，下面拿到处理好的tif图片，我们直接使用tesseract 进行识别：

tesseract 1.gif out

然后打开out查看结果，基本没有成功的，哭。。。

于是在网上查找了一些训练tesseract语料库的方法

1 首先打开jTessBoxEditor，点击tool-》merge tiff。将刚才处理的tiff文件都合并起来，并保存成[lang].[fontname].exp[num].tif 的格式

2 调用tesseract生成box文件（坐标文件）

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

3 然后用jTessBoxEditor，box editor-》open ，打开刚才合并的tif文件，会发现已经有了坐标信息，

4 调整坐标，保存

5 新建font_properties文件，保存在同一级目录下，内容如下

baidu 0 0 0 0 0 0

6 创建如下脚本，保存为train。bat

rem 执行改批处理前先要目录下创建font_properties文件

echo Run Tesseract for Training..
tesseract.exe eng.baidu.exp0.tif eng.baidu.exp0 nobatch box.train

echo Compute the Character Set..
unicharset_extractor.exe eng.baidu.exp0.box
mftraining -F font_properties -U unicharset -O baidu.unicharset eng.baidu.exp0.tr

echo Clustering..
cntraining.exe eng.baidu.exp0.tr

echo Rename Files..
rename normproto baidu.normproto
rename inttemp baidu.inttemp
rename pffmtable baidu.pffmtable
rename shapetable baidu.shapetable 

echo Create Tessdata..
combine_tessdata.exe baidu.

会显示一个结果如：

Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 1660
Offset for type 4 is 327545
Offset for type 5 is 327781
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is –1

必须确定的是第2、4、5、6行的数据不是-1，那么一个新的字典就算生成了。

此时目录下“baidu.traineddata”的文件拷贝到tesseract程序目录下的“tessdata”目录。

就可以使用了

tesseract.exe test.jpg result –l baidu

效果比之前的好了不少