Python 3 爬虫之查询Github上哪些用户名没有被注册

最新推荐文章于 2025-04-29 11:54:53 发布

原创最新推荐文章于 2025-04-29 11:54:53 发布 · 1.7k 阅读

0 ·

CC 4.0 BY-SA版权

Python 同时被 2 个专栏收录

15 篇文章

订阅专栏

爬虫

5 篇文章

订阅专栏

本文讲述了作者使用Python 3编写爬虫，尝试从一份包含六千多个单词的文件中，逐个检查这些词汇是否已被注册为Github用户名。由于初期未采用多线程，爬虫每次请求后会暂停几秒以避免被Github限制访问。虽然最终并未实际使用这些用户名，但这次经历作为一个爬虫学习的实例。

想换个又短又有内涵还没什么人用的ID，想了几个一直被注册。于是在百度文库找了一份六千多个单词的文件，用爬虫挨个上Gibhub试。

写的时候还不会多线程，单线程发一次请求就停几秒，否则很快被拒绝访问。还好不是封IP。

抓完又觉得这样起名没意思。就当一次爬虫练习吧。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author: LostInNight
# @Date:   2015-10-27 13:26:45
# @Last Modified by:   LostInNight
# @Last Modified time: 2015-10-28 08:33:26
# 上Github查询指定用户名是否存在

import requests
import sys
import os
import time

# 设置当前目录为当前工作目录，便于读写
# os.chdir(sys.path[0])
os.chdir(r'F:\PythonWorkspace\Github-Rename')

def trans_time(sec):
    hour = int(sec / 3600)
    sec = sec % 3600
    minute = int(sec / 60)
    sec = sec % 60
    return "%s小时 %s分 %.2f秒" % (hour, minute, sec)

def get_html(url):
    try:
        time.sleep(3)
        print('正在访问网址... ', url)
        html = requests.get(url, headers=headers, timeout=10).text
    except Exception as e:
        print('出现异常，休眠十秒后重试')
        print(e)
        time.sleep(10)
        return get_html(url)
    print('成功获取网页！')
    return html


start = time.time()
url = r'https://github.com/search?utf8=%E2%9C%93&q={0}&type=Users&ref=searchresults'

headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Host':'github.com',
'Referer':'https://github.com',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36'
}
count = 0
found = 0

# 未被注册的单词少，找到再打开文件读写
with open('words.txt', 'r', encoding = 'utf-8') as input:
    while True:
        print('-' * 40)
        print('读取单词文件...')
        line = input.readline()
        if not line:
            break
        word = line.split(' ', 1)[0]
        print('成功读取取单词：', word)
        used_time = trans_time(time.time() - start)
        count += 1

        print('正在检测 %s ，已检测 %s 个单词，找出 %s 个结果，脚本已运行 %s' % (word, count, found, used_time))

        html = get_html(url.format(word))
        # 如果某用户名没人使用，就会显示'We couldn’t find any users matching xxxxx'
        print('正在写入文件...')
        if 'We couldn’t find any users matching' in html:
            found += 1
            with open('uniquea.txt', 'a', encoding = 'utf-8') as output:
                output.write(line)
        print('写入成功！')

used_time = trans_time(time.time() - start)
print('抓取完成！\n耗时 %s\n共检测 %s 个单词，其中 %s 个没有被注册！' % (used_time, count, found))

words.txt是在百度文库找的免费文档，格式如下：

abacus   n．算盘
abandon   v．n．放弃，放纵
abase   v．贬抑，使卑下
abate   v．减轻，降低
abbreviation   n．缩短，缩写
abdicate   v．让位，辞职，放弃
abdomen   n．腹，下腹（胸部到腿部的部份）
abduct   v．绑架，拐走
aberrant   adj．越轨的，异常的
abet   v．教唆，协助（罪犯）
abeyance   n．中止，暂搁
abhor   v．憎恨，嫌恶
abhorrent   adj．可恨的，可厌的
abide   v．容忍，忍受
............

结果格式同上！