IdentYwaf源码详细解析-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_52740785/article/details/143724983

声明

菜鸟一个，为了完成作业故写下这篇blog，如果有错误的地方希望大佬们在评论区平和的指出，如果感觉这篇文章有帮助，希望能点点赞和收藏 ^_^

前言

什么是waf

waf是什么可以看这篇文章：WAF是什么？一篇文章带你全面了解WAF
总结就是一个防火墙可以阻拦掉绝大部分非法的访问，例如XSS攻击和SQL注入攻击都会通过发送一些不合法的请求来实现一些非法的攻击

IdentYwaf是什么

用python开发的用于识别waf的工具，其竞品主要有wafw00f,sqlmap以及nmap
IdentYwaf的github地址：https://github.com/stamparm/identYwaf.git

IdentYwaf源码分析

IdentYwaf实现流程

在这里插入图片描述

2024.11.22更新新的identYwaf流程图，之前的太简陋了（就不删了，留作纪念吧），这里更新一个新的流程图，详细一点
在这里插入图片描述

IdentYwaf文件目录

│  
└─identYwaf-master
    │  .travis.yml
    │  data.json★
    │  identYwaf.py★★
    │  LICENSE
    │  README.md
    └─screenshots

IdentYwaf核心流程分析

data.json是关于各个WAF的指纹记录，还有关于测试payload的记录

{
    "__copyright__": "Copyright (c) 2019-2021 Miroslav Stampar (@stamparm), MIT. See the file 'LICENSE' for copying permission",
    "__notice__": "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software",

    "payloads": [
        "HTML::<img>",
        "SQLi::1 AND 1",
        ...
        "XSS::<img src=x onerror=alert('XSS')>",
        "XSS::<img onfoo=f()>",
        "XSS::<script>",
        "XSS::<script>alert('XSS')</script>",
        ...
        "LDAPi::admin*)((|userpassword=*)",
        "LDAPi::user=*)(uid=*))(|(uid=*",
        ...
        "NOSQLi::true, $where: '1 == 1'",
        ...
    ],
    "wafs": {
        "360": {
            "company": "360",
            "name": "360",
            "regex": "<title>493</title>|/wzws-waf-cgi/",
            "signatures": [
                "9778:RVZXum61OEhCWapBYKcPk4JzWOpohM4JiUcMr2RXg1uQJbX3uhdOnthtOj+hX7AB16FcPxJPdLsXo2tKaK99n+i7c4VmkwI3FZjxtDtAeq+c36A5chW1XaTC",
                "9ccc:RVZXum61OEhCWapBYKcPk4JzWOpohM4JiUcMr2RXg1uQJbX3uhdOnthtOj+hX7AB16FcPxJPdLsXo2tKaK99n+i7c4VmkwI3FZjxtDtAeq+c36A4chW1XaTC"
            ]
        },
        ...
        ....
    }
}

identYwaf.py就是IdentYwaf工具的核心

主函数

def main():
    if "--version" not in sys.argv:
        print(BANNER)

    parse_args()
    init()
    run()

load_data()

if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        exit(colorize("\r[x] Ctrl-C pressed"))

显然首先是load_data()
再parse_args
再init()
再run()

load_data()

其中load_data()就是为全局变量

SIGNATURES = {}
DATA_JSON = {}
WAF_RECOGNITION_REGEX

赋值，

def load_data():
    global WAF_RECOGNITION_REGEX

    if os.path.isfile(DATA_JSON_FILE):
        with codecs.open(DATA_JSON_FILE, "rb", encoding="utf8") as f:
            DATA_JSON.update(json.load(f))

        WAF_RECOGNITION_REGEX = ""
        for waf in DATA_JSON["wafs"]:
            if DATA_JSON["wafs"][waf]["regex"]:
                WAF_RECOGNITION_REGEX += "%s|" % ("(?P<waf_%s>%s)" % (waf, DATA_JSON["wafs"][waf]["regex"]))
            for signature in DATA_JSON["wafs"][waf]["signatures"]:
                SIGNATURES[signature] = waf
        WAF_RECOGNITION_REGEX = WAF_RECOGNITION_REGEX.strip('|')

        flags = "".join(set(_ for _ in "".join(re.findall(r"\(\?(\w+)\)", WAF_RECOGNITION_REGEX))))
        WAF_RECOGNITION_REGEX = "(?%s)%s" % (flags, re.sub(r"\(\?\w+\)", "", WAF_RECOGNITION_REGEX))  # patch for "DeprecationWarning: Flags not at the start of the expression" in Python3.7
    else:
        exit(colorize("[x] file '%s' is missing" % DATA_JSON_FILE))

其中SIGNATURES用来存每个WAF在data.json文件中记录的哈希值（这个哈希值后面我会提到怎么计算的，它有一个calc_hash函数来计算这个）， DATA_JSON显然是把整个data.json存了进去，WAF_RECOGNITION_REGEX则是存了每个WAF对应的正则表达式。WAF_RECOGNITION_REGEX和SIGNATURES是后面用来识别WAF用的

parse_args()

def parse_args():
    global options

    parser = optparse.OptionParser(version=VERSION)
    parser.add_option("--delay", dest="delay", type=int, help="Delay (sec) between tests (default: 0)")
    parser.add_option("--timeout", dest="timeout", type=int, help="Response timeout (sec) (default: 10)")
    parser.add_option("--proxy", dest="proxy", help="HTTP proxy address (e.g. \"http://127.0.0.1:8080\")")
    parser.add_option("--proxy-file", dest="proxy_file", help="Load (rotating) HTTP(s) proxy list from a file")
    parser.add_option("--random-agent", dest="random_agent", action="store_true", help="Use random HTTP User-Agent header value")
    parser.add_option("--code", dest="code", type=int, help="Expected HTTP code in rejected responses")
    parser.add_option("--string", dest="string", help="Expected string in rejected responses")
    parser.add_option("--post", dest="post", action="store_true", help="Use POST body for sending payloads")
    parser.add_option("--debug", dest="debug", action="store_true", help=optparse.SUPPRESS_HELP)
    parser.add_option("--fast", dest="fast", action="store_true", help=optparse.SUPPRESS_HELP)
    parser.add_option("--lock", dest="lock", action="store_true", help=optparse.SUPPRESS_HELP)

    # Dirty hack(s) for help message
    def _(self, *args):
        retval = parser.formatter._format_option_strings(*args)
        if len(retval) > MAX_HELP_OPTION_LENGTH:
            retval = ("%%.%ds.." % (MAX_HELP_OPTION_LENGTH - parser.formatter.indent_increment)) % retval
        return retval

    parser.usage = "python %s <host|url>" % parser.usage
    parser.formatter._format_option_strings = parser.formatter.format_option_strings
    parser.formatter.format_option_strings = type(parser.formatter.format_option_strings)(_, parser)

    for _ in ("-h", "--version"):
        option = parser.get_option(_)
        option.help = option.help.capitalize()

    try:
        #  获取所有options 以及 参数列表
        # 为了方便我们调试，我们把这个先写死
        # Input = ['baidu.com']
        options, _ = parser.parse_args()
        # {'delay': None, 'timeout': None, 'proxy': None, 'proxy_file': None, 'random_agent': None, 'code': None, 'string': None, 'post': None, 'debug': None, 'fast': None, 'lock': None}
        # ['baidu.com']
        print(options, _)
    except SystemExit:
        raise

    # 因为我们不再从控制台输入扫描网址，所以这里我们得手动给它加上这个网址
    # sys.argv += ['baidu.com']
    if len(sys.argv) > 1:
        url = sys.argv[-1]
        if not url.startswith("http"):
            url = "http://%s" % url
        options.url = url
    else:
        parser.print_help()
        raise SystemExit

    for key in DEFAULTS:
        if getattr(options, key, None) is None:
            setattr(options, key, DEFAULTS[key])
    # print(type(options), options)

这个函数就是读入我们从控制台的参数，但是比较恶心的是，下面是它的-h
在这里插入图片描述
我们可以看到，在代码中有的一些参数在-h中是没有的，点名fast参数以及lock参数，所以我其实也不太知道这个fast还有lock是做什么的，这两玩意我就不分析了，哈哈

在parse_args函数中，它主要干了以下几个操作：

读入参数
处理url，如果输入的url没有添加https://或者http://，~~默认给https://~~默认给http://
把参数赋值给全局变量options

init()

def init():
   os.chdir(os.path.abspath(os.path.dirname(__file__)))

   # Reference: http://blog.mathieu-leplatre.info/python-utf-8-print-fails-when-redirecting-stdout.html
   if not PY3 and not IS_TTY:
       sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

   print(colorize("[o] initializing handlers..."))

   # Reference: https://stackoverflow.com/a/28052583
   if hasattr(ssl, "_create_unverified_context"):
       ssl._create_default_https_context = ssl._create_unverified_context

   if options.proxy_file:
       if os.path.isfile(options.proxy_file):
           print(colorize("[o] loading proxy list..."))

           with codecs.open(options.proxy_file, "rb", encoding="utf8") as f:
               proxies.extend(re.sub(r"\s.*", "", _.strip()) for _ in f.read().strip().split('\n') if _.startswith("http"))
               random.shuffle(proxies)
       else:
           exit(colorize("[x] file '%s' does not exist" % options.proxy_file))


   cookie_jar = CookieJar()
   opener = build_opener(HTTPCookieProcessor(cookie_jar))
   install_opener(opener)

   if options.proxy:
       opener = build_opener(ProxyHandler({"http": options.proxy, "https": options.proxy}))
       install_opener(opener)

   if options.random_agent:
       revision = random.randint(20, 64)
       platform = random.sample(("X11; %s %s" % (random.sample(("Linux", "Ubuntu; Linux", "U; Linux", "U; OpenBSD", "U; FreeBSD"), 1)[0], random.sample(("amd64", "i586", "i686", "amd64"), 1)[0]), "Windows NT %s%s" % (random.sample(("5.0", "5.1", "5.2", "6.0", "6.1", "6.2", "6.3", "10.0"), 1)[0], random.sample(("", "; Win64", "; WOW64"), 1)[0]), "Macintosh; Intel Mac OS X 10.%s" % random.randint(1, 11)), 1)[0]
       user_agent = "Mozilla/5.0 (%s; rv:%d.0) Gecko/20100101 Firefox/%d.0" % (platform, revision, revision)
       HEADERS["User-Agent"] = user_agent

colorize函数不重要，就一上色的，后面也不会分析它
这个函数主要干了三件事：

如果我们选择读入代理文件配置代理池或者直接配置代理，就读入代理文件配置号代理池
cookie_jar = CookieJar()及下面两行是cookie自动更新，也就是说访问那些set-cookie改变cookie值我们都不用管了，例如session会话维持这个就很重要，不过如果是通过JS来处理cookie的网站这个就废了，所以IdentYwaf第一个雷点就是这里！！
然后就是伪随机user-agent，我们控制台希望随机user-agent，它直接在init()这里固定了一个user-agent，也就是说我们发包的agent和默认agent不同，但是后续发包将一直是这个agent，我还以为是每次发包都会有一个新的user-agent（毕竟后面会有一个45案例测试，它们将使用同一个agent，被检测然后不回复或者回复不是html的可能性太高了，第二个雷点！！）

run()是最重要的部分

run函数这部分我们将重点关注两个函数：retrieve()和check_payload()函数，并简单介绍一下calc_hash函数
最后我将介绍整个run函数的流程

retrieve()

def retrieve(url, data=None):
    global proxies_index

    retval = {}

    if proxies:
        while True:
            try:
                opener = build_opener(ProxyHandler({"http": proxies[proxies_index], "https": proxies[proxies_index]}))
                install_opener(opener)
                proxies_index = (proxies_index + 1) % len(proxies)
                urlopen(PROXY_TESTING_PAGE).read()
            except KeyboardInterrupt:
                raise
            except:
                pass
            else:
                break

    try:
        req = Request("".join(url[_].replace(' ', "%20") if _ > url.find('?') else url[_] for _ in xrange(len(url))), data, HEADERS)
        # 这里如果没有获取到内容就会进入except里面
        resp = urlopen(req, timeout=options.timeout)
        # 顺利获取到内容就往下面执行
        # retval存储有：URL、HTML文本、HTTP响应状态码、HTML请求RAW型式
        retval[URL] = resp.url
        retval[HTML] = resp.read()
        retval[HTTPCODE] = resp.code
        retval[RAW] = "%s %d %s\n%s\n%s" % (httplib.HTTPConnection._http_vsn_str, retval[HTTPCODE], resp.msg, str(resp.headers), retval[HTML])
    except Exception as ex:
        # 请求超时或者直接就是请求失败
        # 尝试给retval存储：URL（从报错内容中获取url）、HTTP响应状态码，表示请求的结果（例如，200表示成功，404表示未找到）（从报错内容中获取）、尝试获取HTML内容（如果报错内容可以直接read那么就直接read，不能就看有没有返回msg内容，这些都失败了即获取HTML内容失败了，那么就置HTML内容为空）
        retval[URL] = getattr(ex, "url", url)
        retval[HTTPCODE] = getattr(ex, "code", None)
        try:
            retval[HTML] = ex.read() if hasattr(ex, "read") else getattr(ex, "msg", str(ex))
        except:
            retval[HTML] = ""
        retval[RAW] = "%s %s %s\n%s\n%s" % (httplib.HTTPConnection._http_vsn_str, retval[HTTPCODE] or "", getattr(ex, "msg", ""), str(ex.headers) if hasattr(ex, "headers") else "", retval[HTML])

    for encoding in re.findall(r"charset=[\s\"']?([\w-]+)", retval[RAW])[::-1] + ["utf8"]:
        encoding = ENCODING_TRANSLATIONS.get(encoding, encoding)
        try:
            retval[HTML] = retval[HTML].decode(encoding, errors="replace")
            break
        except:
            pass

    match = re.search(r"<title>\s*(?P<result>[^<]+?)\s*</title>", retval[HTML], re.I)
    retval[TITLE] = match.group("result") if match and "result" in match.groupdict() else None
    retval[TEXT] = re.sub(r"(?si)<script.+?</script>|<!--.+?-->|<style.+?</style>|<[^>]+>|\s+", " ", retval[HTML])
    match = re.search(r"(?im)^Server: (.+)", retval[RAW])
    retval[SERVER] = match.group(1).strip() if match else ""
    return retval

retval是用来存储访问返回的包的信息的，这些信息主要有：

URL:
○ 键：URL（注意：在您的代码中，URL应该被引号括起来，即’URL’）
○ 值：最终请求的URL，可能是经过重定向后的URL。
HTML:
○ 键：HTML（同样，应该使用’HTML’）
○ 值：从服务器返回的HTML内容。在异常情况下，它可能是异常消息或尝试从异常对象中读取的内容（如果可用）。此外，代码还尝试根据响应内容中的字符集编码对HTML内容进行解码。
HTTPCODE:
○ 键：HTTPCODE（应为’HTTPCODE’）
○ 值：HTTP响应状态码，表示请求的结果（例如，200表示成功，404表示未找到）。
RAW:
○ 键：RAW（应为’RAW’）
○ 值：一个包含HTTP版本、状态码、响应消息、头部信息和HTML内容的字符串，格式类似于HTTP响应的原始格式。这对于调试和日志记录非常有用。
TITLE:
○ 键：TITLE（应为’TITLE’）
○ 值：从HTML内容中提取的页面标题（标签中的内容）。如果找不到标题，则为None。
TEXT:
○ 键：TEXT（应为’TEXT’）
○ 值：HTML内容中的文本部分，去除了所有HTML标签、脚本、样式和注释。这对于文本分析很有用。
Server：
○ 键：SERVER（应为’SERVER’）
○ 值：从HTTP响应头部中提取的服务器信息（Server:字段的内容）。如果找不到服务器信息，则为空字符串。

retrieve函数第一部分就是实现代理，然后就会进行请求访问，然后把访问的内容进行处理存储到retval，且处理了访问失败的情况。

这里雷点太多了，这一部分整一个就是处理HTML的操作，虽然处理的很好，但是现在访问网站如果不带正确的cookie很多都是返回一个js文件让我们处理出cookie才能获取到正确的html，IdentYwaf有一步正常访问的操作，很多时候正常访问失败后面run函数就会非常干脆的exit()，真的无力吐槽（不过老东西了，干不过现在的情况也算正常）

check_payload()

def check_payload(payload, protection_regex=GENERIC_PROTECTION_REGEX % '|'.join(GENERIC_PROTECTION_KEYWORDS)):
    global chained
    global heuristic
    global intrusive
    global locked_code
    global locked_regex

    time.sleep(options.delay or 0)
    if options.post:
        _ = "%s=%s" % ("".join(random.sample(string.ascii_letters, 3)), quote(payload))
        intrusive = retrieve(options.url, _)
    else:
        _ = "%s%s%s=%s" % (options.url, '?' if '?' not in options.url else '&', "".join(random.sample(string.ascii_letters, 3)), quote(payload))
        intrusive = retrieve(_)

    if options.lock and not payload.isdigit():
        if payload == HEURISTIC_PAYLOAD:
            match = re.search(re.sub(r"Server:|Protected by", "".join(random.sample(string.ascii_letters, 6)), WAF_RECOGNITION_REGEX, flags=re.I), intrusive[RAW] or "")
            if match:
                result = True

                for _ in match.groupdict():
                    if match.group(_):
                        waf = re.sub(r"\Awaf_", "", _)
                        locked_regex = DATA_JSON["wafs"][waf]["regex"]
                        locked_code = intrusive[HTTPCODE]
                        break
            else:
                result = False

            if not result:
                exit(colorize("[x] can't lock results to a non-blind match"))
        else:
            result = re.search(locked_regex, intrusive[RAW]) is not None and locked_code == intrusive[HTTPCODE]
    elif options.string:
        result = options.string in (intrusive[RAW] or "")
    elif options.code:
        result = options.code == intrusive[HTTPCODE]
    else:
        result = intrusive[HTTPCODE] != original[HTTPCODE] or (intrusive[HTTPCODE] != 200 and intrusive[TITLE] != original[TITLE]) or (re.search(protection_regex, intrusive[HTML]) is not None and re.search(protection_regex, original[HTML]) is None) or (difflib.SequenceMatcher(a=original[HTML] or "", b=intrusive[HTML] or "").quick_ratio() < QUICK_RATIO_THRESHOLD)

    if not payload.isdigit():
        if result:
            if options.debug:
                print("\r---%s" % (40 * ' '))
                print(payload)
                print(intrusive[HTTPCODE], intrusive[RAW])
                print("---")

            if intrusive[SERVER]:
                servers.add(re.sub(r"\s*\(.+\)\Z", "", intrusive[SERVER]))
                if len(servers) > 1:
                    chained = True
                    single_print(colorize("[!] multiple (reactive) rejection HTTP 'Server' headers detected (%s)" % ', '.join("'%s'" % _ for _ in sorted(servers))))

            if intrusive[HTTPCODE]:
                codes.add(intrusive[HTTPCODE])
                if len(codes) > 1:
                    chained = True
                    single_print(colorize("[!] multiple (reactive) rejection HTTP codes detected (%s)" % ', '.join("%s" % _ for _ in sorted(codes))))

            if heuristic and heuristic[HTML] and intrusive[HTML] and difflib.SequenceMatcher(a=heuristic[HTML] or "", b=intrusive[HTML] or "").quick_ratio() < QUICK_RATIO_THRESHOLD:
                chained = True
                single_print(colorize("[!] multiple (reactive) rejection HTML responses detected"))

    if payload == HEURISTIC_PAYLOAD:
        heuristic = intrusive

    return result

这个函数就是测试非法访问的主力，非常非常重要！！
intrusive在这里就是存储retrieve函数获取的非法访问结果的，自然，它也是基于HTML的
然后下面这部分就是判断是否存在WAF的代码：

	if options.lock and not payload.isdigit():
        if payload == HEURISTIC_PAYLOAD:
            match = re.search(re.sub(r"Server:|Protected by", "".join(random.sample(string.ascii_letters, 6)), WAF_RECOGNITION_REGEX, flags=re.I), intrusive[RAW] or "")
            if match:
                result = True

                for _ in match.groupdict():
                    if match.group(_):
                        waf = re.sub(r"\Awaf_", "", _)
                        locked_regex = DATA_JSON["wafs"][waf]["regex"]
                        locked_code = intrusive[HTTPCODE]
                        break
            else:
                result = False

            if not result:
                exit(colorize("[x] can't lock results to a non-blind match"))
        else:
            result = re.search(locked_regex, intrusive[RAW]) is not None and locked_code == intrusive[HTTPCODE]
    elif options.string:
        result = options.string in (intrusive[RAW] or "")
    elif options.code:
        result = options.code == intrusive[HTTPCODE]
    else:
        result = intrusive[HTTPCODE] != original[HTTPCODE] or (intrusive[HTTPCODE] != 200 and intrusive[TITLE] != original[TITLE]) or (re.search(protection_regex, intrusive[HTML]) is not None and re.search(protection_regex, original[HTML]) is None) or (difflib.SequenceMatcher(a=original[HTML] or "", b=intrusive[HTML] or "").quick_ratio() < QUICK_RATIO_THRESHOLD)

重点在最下面的这段代码（前面的分支正常都不会进去）

result = intrusive[HTTPCODE] != original[HTTPCODE] or 
        (intrusive[HTTPCODE] != 200 and intrusive[TITLE] != original[TITLE]) or 
        (re.search(protection_regex, intrusive[HTML]) is not None and re.search(protection_regex, original[HTML]) is None) or 
        (difflib.SequenceMatcher(a=original[HTML] or "", b=intrusive[HTML] or "").quick_ratio() < QUICK_RATIO_THRESHOLD)

我们把代码分成四个部分来看，original是我们在run函数刚开始就会执行的正常访问的结果

第一个部分：
如果非法访问的结果状态码和正常访问的结果状态码不同，那么直接判断为True，即该网站部署了WAF
这里有一雷，时代发展到现在，直接正常访问很多时候和直接非法访问结果是一样的，比如cloudflare，因为我们只要没有逆向cloudflare的cookie加密策略，就不可能正常得到一个200的html访问，那么这里的判断就失效了。

第二个部分：
非法访问状态码不是200，那么大概就是被WAF拦截了
为了保障判断是准确的，identYwaf又做了一个保险：如果非法访问的TITLE和正常访问的TITLE也不一样那么真的大概率是被WAF拦截了

第三个部分：

GENERIC_PROTECTION_KEYWORDS = ("rejected", "forbidden", "suspicious", 
                               "malicious", "captcha", "invalid", 
                               "your ip", "please contact", "terminated", 
                               "protected", "unauthorized", "blocked", 
                               "protection", "incident", "denied", "detected", 
                               "dangerous", "firewall", "fw_block", 
                               "unusual activity", "bad request", "request id", 
                               "injection", "permission", "not acceptable", 
                               "security policy", "security reasons")

总之第三个部分就是把请求到的内容进行正则匹配，看上面的这些内容能不能被匹配到，匹配到就不是None，匹配不到就是None，大概率匹配到，小概率匹配不到，因为intrusive[HTML]是攻击包获取的，攻击包访问容易返回denied，rejected等信息。而后面的original[HTML]是不容易返回上面这些内容的（因为original就是正常访问houstname），这里所以这里的意思就是看看网站是否有waf对正常访问以及非法访问进行识别，如果这里是True，那么就说明这个网站有waf

第四个部分：
这里和第三个部分是相似的，如果正常访问和非法访问相似度很高，说明没有WAF，这不好，

值的意义：quick_ratio() 返回的值越接近 1，表示两个序列越相似；值越接近 0，表示两个序列越不相似。

默认的QUIK_RATIO_THRESHOLD = 0.2
所以小于0.2时说明非常不相似，那么大概率该网站就有WAF保护

紧跟着判断的是一段输出和数据存储，不重要，跳

if payload == HEURISTIC_PAYLOAD:
        heuristic = intrusive

最后这个就是用一个全局变量heuristic来存储intrusive，后面run函数用得到heuristic

最后return results，表示该网站是否有WAF，判断results那部分确实落后了，现在只要正常访问返回不是html的基本可以判断网站部署了WAF了，没有必要在那比来比去的。

run函数主函数

def run():
    global original

    hostname = options.url.split("//")[-1].split('/')[0].split(':')[0]

    if not hostname.replace('.', "").isdigit():
        print(colorize("[i] checking hostname '%s'..." % hostname))
        try:
            socket.getaddrinfo(hostname, None)
        except socket.gaierror:
            exit(colorize("[x] host '%s' does not exist" % hostname))

    results = ""
    signature = b""
    counter = 0
    original = retrieve(options.url)

    if 300 <= (original[HTTPCODE] or 0) < 400 and original[URL]:
        original = retrieve(original[URL])

    options.url = original[URL]

    if original[HTTPCODE] is None:
        exit(colorize("[x] missing valid response"))

    if not any((options.string, options.code)) and original[HTTPCODE] >= 400:
        non_blind_check(original[RAW])
        if options.debug:
            print("\r---%s" % (40 * ' '))
            print(original[HTTPCODE], original[RAW])
            print("---")
        exit(colorize("[x] access to host '%s' seems to be restricted%s" % (hostname, (" (%d: '<title>%s</title>')" % (original[HTTPCODE], original[TITLE].strip())) if original[TITLE] else "")))

    challenge = None
    if all(_ in original[HTML].lower() for _ in ("eval", "<script")):
        match = re.search(r"(?is)<body[^>]*>(.*)</body>", re.sub(r"(?is)<script.+?</script>", "", original[HTML]))
        if re.search(r"(?i)<(body|div)", original[HTML]) is None or (match and len(match.group(1)) == 0):
            challenge = re.search(r"(?is)<script.+</script>", original[HTML]).group(0).replace("\n", "\\n")
            print(colorize("[x] anti-robot JS challenge detected ('%s%s')" % (challenge[:MAX_JS_CHALLENGE_SNAPLEN], "..." if len(challenge) > MAX_JS_CHALLENGE_SNAPLEN else "")))

    protection_keywords = GENERIC_PROTECTION_KEYWORDS
    protection_regex = GENERIC_PROTECTION_REGEX % '|'.join(keyword for keyword in protection_keywords if keyword not in original[HTML].lower())

    print(colorize("[i] running basic heuristic test..."))
    if not check_payload(HEURISTIC_PAYLOAD):
        check = False
        if options.url.startswith("https://"):
            options.url = options.url.replace("https://", "http://")
            check = check_payload(HEURISTIC_PAYLOAD)
        if not check:
            if non_blind_check(intrusive[RAW]):
                exit(colorize("[x] unable to continue due to static responses%s" % (" (captcha)" if re.search(r"(?i)captcha", intrusive[RAW]) is not None else "")))
            elif challenge is None:
                exit(colorize("[x] host '%s' does not seem to be protected" % hostname))
            else:
                exit(colorize("[x] response not changing without JS challenge solved"))

    if options.fast and not non_blind:
        exit(colorize("[x] fast exit because of missing non-blind match"))

    if not intrusive[HTTPCODE]:
        print(colorize("[i] rejected summary: RST|DROP"))
    else:
        _ = "...".join(match.group(0) for match in re.finditer(GENERIC_ERROR_MESSAGE_REGEX, intrusive[HTML])).strip().replace("  ", " ")
        print(colorize(("[i] rejected summary: %d ('%s%s')" % (intrusive[HTTPCODE], ("<title>%s</title>" % intrusive[TITLE]) if intrusive[TITLE] else "", "" if not _ or intrusive[HTTPCODE] < 400 else ("...%s" % _))).replace(" ('')", "")))

    found = non_blind_check(intrusive[RAW] if intrusive[HTTPCODE] is not None else original[RAW])

    if not found:
        print(colorize("[-] non-blind match: -"))

    for item in DATA_JSON["payloads"]:
        info, payload = item.split("::", 1)
        counter += 1
        # IS_TTY = True
        if IS_TTY:
            sys.stdout.write(colorize("\r[i] running payload tests... (%d/%d)\r" % (counter, len(DATA_JSON["payloads"]))))
            sys.stdout.flush()

        if counter % VERIFY_OK_INTERVAL == 0:
            for i in xrange(VERIFY_RETRY_TIMES):
                if not check_payload(str(random.randint(1, 9)), protection_regex):
                    break
                elif i == VERIFY_RETRY_TIMES - 1:
                    exit(colorize("[x] host '%s' seems to be misconfigured or rejecting benign requests%s" % (hostname, (" (%d: '<title>%s</title>')" % (intrusive[HTTPCODE], intrusive[TITLE].strip())) if intrusive[TITLE] else "")))
                else:
                    time.sleep(5)
        # IS_TTY = False
        last = check_payload(payload, protection_regex)
        non_blind_check(intrusive[RAW])
        signature += struct.pack(">H", ((calc_hash(payload, binary=False) << 1) | last) & 0xffff)
        results += 'x' if last else '.'

        if last and info not in blocked:
            blocked.append(info)

    _ = calc_hash(signature)
    signature = "%s:%s" % (_.encode("hex") if not hasattr(_, "hex") else _.hex(), base64.b64encode(signature).decode("ascii"))

    print(colorize("%s[=] results: '%s'" % ("\n" if IS_TTY else "", results)))

    hardness = 100 * results.count('x') // len(results)
    print(colorize("[=] hardness: %s (%d%%)" % ("insane" if hardness >= 80 else ("hard" if hardness >= 50 else ("moderate" if hardness >= 30 else "easy")), hardness)))

    if blocked:
        print(colorize("[=] blocked categories: %s" % ", ".join(blocked)))

    if not results.strip('.') or not results.strip('x'):
        print(colorize("[-] blind match: -"))

        if re.search(r"(?i)captcha", original[HTML]) is not None:
            exit(colorize("[x] there seems to be an activated captcha"))
    else:
        print(colorize("[=] signature: '%s'" % signature))

        if signature in SIGNATURES:
            waf = SIGNATURES[signature]
            print(colorize("[+] blind match: '%s' (100%%)" % format_name(waf)))
        elif results.count('x') < MIN_MATCH_PARTIAL:
            print(colorize("[-] blind match: -"))
        else:
            matches = {}
            markers = set()
            decoded = base64.b64decode(signature.split(':')[-1])
            for i in xrange(0, len(decoded), 2):
                part = struct.unpack(">H", decoded[i: i + 2])[0]
                markers.add(part)

            for candidate in SIGNATURES:
                counter_y, counter_n = 0, 0
                decoded = base64.b64decode(candidate.split(':')[-1])
                for i in xrange(0, len(decoded), 2):
                    part = struct.unpack(">H", decoded[i: i + 2])[0]
                    if part in markers:
                        counter_y += 1
                    elif any(_ in markers for _ in (part & ~1, part | 1)):
                        counter_n += 1
                result = int(round(100.0 * counter_y / (counter_y + counter_n)))
                if SIGNATURES[candidate] in matches:
                    if result > matches[SIGNATURES[candidate]]:
                        matches[SIGNATURES[candidate]] = result
                else:
                    matches[SIGNATURES[candidate]] = result

            if chained:
                for _ in list(matches.keys()):
                    if matches[_] < 90:
                        del matches[_]

            if not matches:
                print(colorize("[-] blind match: - "))
                print(colorize("[!] probably chained web protection systems"))
            else:
                matches = [(_[1], _[0]) for _ in matches.items()]
                matches.sort(reverse=True)

                print(colorize("[+] blind match: %s" % ", ".join("'%s' (%d%%)" % (format_name(matches[i][1]), matches[i][0]) for i in xrange(min(len(matches), MAX_MATCHES) if matches[0][0] != 100 else 1))))

    print()

我们先来看第一部分

	global original

    hostname = options.url.split("//")[-1].split('/')[0].split(':')[0]

    if not hostname.replace('.', "").isdigit():
        print(colorize("[i] checking hostname '%s'..." % hostname))
        try:
            socket.getaddrinfo(hostname, None)
        except socket.gaierror:
            exit(colorize("[x] host '%s' does not exist" % hostname))

    results = ""
    signature = b""
    counter = 0
    original = retrieve(options.url)

    if 300 <= (original[HTTPCODE] or 0) < 400 and original[URL]:
        original = retrieve(original[URL])

    options.url = original[URL]

    if original[HTTPCODE] is None:
        exit(colorize("[x] missing valid response"))

    if not any((options.string, options.code)) and original[HTTPCODE] >= 400:
        non_blind_check(original[RAW])
        if options.debug:
            print("\r---%s" % (40 * ' '))
            print(original[HTTPCODE], original[RAW])
            print("---")
        exit(colorize("[x] access to host '%s' seems to be restricted%s" % (hostname, (" (%d: '<title>%s</title>')" % (original[HTTPCODE], original[TITLE].strip())) if original[TITLE] else "")))

    challenge = None
    if all(_ in original[HTML].lower() for _ in ("eval", "<script")):
        match = re.search(r"(?is)<body[^>]*>(.*)</body>", re.sub(r"(?is)<script.+?</script>", "", original[HTML]))
        if re.search(r"(?i)<(body|div)", original[HTML]) is None or (match and len(match.group(1)) == 0):
            challenge = re.search(r"(?is)<script.+</script>", original[HTML]).group(0).replace("\n", "\\n")
            print(colorize("[x] anti-robot JS challenge detected ('%s%s')" % (challenge[:MAX_JS_CHALLENGE_SNAPLEN], "..." if len(challenge) > MAX_JS_CHALLENGE_SNAPLEN else "")))

这部分就是我们的第一步：先访问hostname(例如baidu.com)
显示检测连接是否有效，然后用retrieve存下正常访问的结果到original，之后进行一次盲猜，如果访问异常就直接退出？？？雷点+1

接着往下看，后面就是测试的部分：

	protection_keywords = GENERIC_PROTECTION_KEYWORDS
    protection_regex = GENERIC_PROTECTION_REGEX % '|'.join(keyword for keyword in protection_keywords if keyword not in original[HTML].lower())

    print(colorize("[i] running basic heuristic test..."))
    if not check_payload(HEURISTIC_PAYLOAD):
        check = False
        if options.url.startswith("https://"):
            options.url = options.url.replace("https://", "http://")
            check = check_payload(HEURISTIC_PAYLOAD)
        if not check:
            if non_blind_check(intrusive[RAW]):
                exit(colorize("[x] unable to continue due to static responses%s" % (" (captcha)" if re.search(r"(?i)captcha", intrusive[RAW]) is not None else "")))
            elif challenge is None:
                exit(colorize("[x] host '%s' does not seem to be protected" % hostname))
            else:
                exit(colorize("[x] response not changing without JS challenge solved"))

    if options.fast and not non_blind:
        exit(colorize("[x] fast exit because of missing non-blind match"))

    if not intrusive[HTTPCODE]:
        print(colorize("[i] rejected summary: RST|DROP"))
    else:
        _ = "...".join(match.group(0) for match in re.finditer(GENERIC_ERROR_MESSAGE_REGEX, intrusive[HTML])).strip().replace("  ", " ")
        print(colorize(("[i] rejected summary: %d ('%s%s')" % (intrusive[HTTPCODE], ("<title>%s</title>" % intrusive[TITLE]) if intrusive[TITLE] else "", "" if not _ or intrusive[HTTPCODE] < 400 else ("...%s" % _))).replace(" ('')", "")))

    found = non_blind_check(intrusive[RAW] if intrusive[HTTPCODE] is not None else original[RAW])

    if not found:
        print(colorize("[-] non-blind match: -"))

    for item in DATA_JSON["payloads"]:
        info, payload = item.split("::", 1)
        counter += 1
        # IS_TTY = True
        if IS_TTY:
            sys.stdout.write(colorize("\r[i] running payload tests... (%d/%d)\r" % (counter, len(DATA_JSON["payloads"]))))
            sys.stdout.flush()

        if counter % VERIFY_OK_INTERVAL == 0:
            for i in xrange(VERIFY_RETRY_TIMES):
                if not check_payload(str(random.randint(1, 9)), protection_regex):
                    break
                elif i == VERIFY_RETRY_TIMES - 1:
                    exit(colorize("[x] host '%s' seems to be misconfigured or rejecting benign requests%s" % (hostname, (" (%d: '<title>%s</title>')" % (intrusive[HTTPCODE], intrusive[TITLE].strip())) if intrusive[TITLE] else "")))
                else:
                    time.sleep(5)
        # IS_TTY = False
        last = check_payload(payload, protection_regex)
        non_blind_check(intrusive[RAW])
        signature += struct.pack(">H", ((calc_hash(payload, binary=False) << 1) | last) & 0xffff)
        results += 'x' if last else '.'

        if last and info not in blocked:
            blocked.append(info)

先初始化正则匹配表达式
然后检测HEURISTIC_PAYLOAD非法访问结果怎么样，
如果返回结果是True，这个分支不用管了
如果返回结果是False，这个分支就要进入里面了
因为默认情况是https://访问的，换成http://试一下
依然失败那么估计就是不行了

我没有配置过fast，接着往下走
可以看到，是存在没有返回值的情况，这种百分百是有WAF的，有呢？打印一下呗
然后盲猜一波

然后开始45案例测试（这里使得IdentYwaf的速度慢了下来）
注意这个循环中的所有check_payload的protection_regex不是默认的，而是去除了在正常访问情况出现过的词的

每测试5个案例，试一下弱非法访问（基本上就是合法访问），如果识别不出WAF，说明网站可能封了我们的request，所以睡眠一会，再试，试够3次还不行就退出。说真的测试45个案例，被WAF检测出来封IP不是很正常吗？当然，应该可以通过配置代理来缓解这个问题，但是没钱搞代理呀我！
后面就是正常check_payload，顺便计算hash值和记访问结果，如果识别到waf结果对应payload位置就打一个x

循环结束后计算hash值和输出结果

calc_hash()

def calc_hash(value, binary=True):
    value = value.encode("utf8") if not isinstance(value, bytes) else value
    result = zlib.crc32(value) & 0xffff
    if binary:
        result = struct.pack(">H", result)
    return result

后面就是一些输出匹配处理，就不详细介绍了。

总结

IdentYwaf确实有一些操作，但是也存在不少问题。最大的雷点集中表现在retrieve上面，整个工具大量使用retrieve来实现很多操作，但是retrieve却处理html返回，放到现在显然是不够的。它开发在2019年，虽然有更新，但是2024年的今天，绝大多数网站都已经有了更加强大的js反爬措施，没有js处理函数显然是这个工具的一个缺陷。

并且45案例测试如果不配置代理就非常容易输出：

[x] host ‘ip’ seems to be misconfigured or rejecting benign requests

且45案例测试非常慢，用时太久了
改进方向：并行运行45个测试案例
封ip确实没什么办法，这东西应该是只能用代理池来过。

~~一句话，IdentYwaf和wafw00f比较差的有点远了，wafw00f簿杀IdentYwaf~~

2024.11.22更新部分优化内容

优化

我对identYwaf进行了一些优化，当然不是优化它的处理策略（我对WAF识别实际上懂得并不多），我的优化方式是把45个案例串行测试给优化成了并行测试（当然，还有把urllib库的使用通通换成了requests库，额肯定会有兄弟对requests库怎么做到并行有疑虑的，给大哥们推荐一篇文章：《asyncio 系列》7. 在 asyncio 中引入多线程）
当然，也不是所有地方我都直接使用的requests库，单纯图不用改库方便（urllib改requests简单，直接改aiohttp太难了，当然requests改aiohttp其实挺简单的，不过我改完urllib成requests就没什么动力改aiohttp了）

我对identYwaf的优化：mengsiIdentYwaf

优化代码我上传github了，还有不少bug和问题，估计是不会大维护了，作业毕竟还有几天才检查，所以这段时间如果还有什么问题我能看到，可能会去改一改吧，后面肯定是不会再回来弄这个项目的，毕竟我改的挺垃圾的，第一次分享github，就当练练手了[捂脸]。