海量日志，提取出现最多的IP--python实现_python 输出日志文件中最多的ip地址-优快云博客

本文链接：https://blog.youkuaiyun.com/u010439949/article/details/8911201

本文通过Python实现海量日志数据中访问次数最多的IP提取，采用日志分割、内部堆排序、败者树归并策略，详细介绍了整个过程，包括每个步骤的代码实现和效果展示。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

看过这篇文章教你如何迅速秒杀掉：99%的海量数据处理面试题，文中的第一道题片石，海量日志数据，提取出某日访问百度次数最多的那个IP。所以本文我用自己的思路实现了这个问题。

试想一下，如果日志文件中，所有相同的 IP 都是相邻的，那是不是扫描一遍文件就可以找出数量最多的那个？这便是本文思路。

而排序正好是一个令相同 IP 相邻的不错的办法。排序就要作比较，而 IP 是诸如 "188.62.136.28" 之类的字符串，如何比较大小？其实，在 python 语言里字符串是可以比较的，规则是这样的：第一个字符大的为大，若相等，依次向后比较。所以，我们完全没必要理会 IP 中的字符 ‘.’ 位置，完全遵照 python 语言自有的“规矩”。比如 "9.131.255.66" > "255.255.255.255" 这是很正常的。

说到这里，问题的实质便是大数据的排序问题了。数据太大，有限内存不够用，那就只能是大而化小、分而治之、整合归并。本例中，为了节省时间，使用的是一个只有100w IP 的文件。

本文中，对日志文件的分割是等分的，然后分别对分割后的小文件进行内部排序，最后败者树归并。其实可以利用置换-选择排序来减少产生的小文件数量，下一篇博客便是这样实现的。

下面是完整实现：

1.生成日志文件——MakeIPs.py

代码：

#!/usr/bin/python
# Filename MakeIPs.py

__author__ = 'ihippy'
__email__ = 'ihippy@163.com'
__date__ =  '$Date: 2013.05.08 $'

import random

def makeRandom(firstNum, lastNum):
	return random.randint(firstNum, lastNum)

def makeIP(filePath, numberOfLines):
	try:
		IP = []
		file_handler = open(filePath, 'a+')

		for i in range(numberOfLines):
			IP.append(str(makeRandom(0, 255)) + '.' + str(makeRandom(0, 255)) + '.' + str(makeRandom(0, 255)) + '.' + str(makeRandom(0, 255)) + '\n')
		file_handler.writelines(IP)
		file_handler.close()
	except EOFError:
		print 'Operate Failed!'

if __name__ == '__main__':
	import sys

	try:
		filePath = sys.argv[1]
		lineNum = int(sys.argv[2])
	except:
		print 'Wrong Arguments!'
		print '''You need 2 Parameters in total.
	1. The path of the target file.
	2. The Number of lines in the target file.
You Should do like this:
	python /root/hehe/file 1000000'''
		sys.exit()

	from time import ctime
	print 'The time now is: ',
	print ctime()
	print 'Start...'

	if lineNum > 1000000:
		a = lineNum / 1000000
		b = lineNum % 1000000
		for i in range(0, a):
			makeIP(filePath, 1000000)
		makeIP(filePath, b)
	else:
		makeIP(filePath, lineNum)
	
	print 'Work Down, and the time now is: ',
	print ctime()

运行截图：