※字典
一系列“键-值(key-value)”对,通过“键”查找对应的“值”
◎创建字典
1.使用{ }创建字典
2.使用:指明键:值对
3.键必须是不可变且不重复,值可以是任意类型
◎访问字典:使用[ ]运算符,键作为索引
>>> my_dict = {'John':4533344,'Bob':54562554}
>>> print(my_dict['John'])
4533344
>>> print(my_dict['Tom'])
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
print(my_dict['Tom'])
KeyError: 'Tom'
>>> my_dict['Tom'] = 97799
>>> print(my_dict['Tom'])
97799
>>> my_dict
{'John': 4533344, 'Bob': 54562554, 'Tom': 97799}
◎字典运算符和方法:
1.len(my_dict):字典中键-值对的数量
2.key in my_dict:快速判断key是否为字典中的键:O(1);等价于my_dict_has_key(key)
3.for key in my_dict:枚举字典中的键(键是无序的)
4.my_dict.items() 全部的键-值
5.my_dict.keys() 全部的键
6.my_dict.values() 全部的值
7.my_dict.clear() 清空字典
实例:字母计数:读取一个字符串,计算每个字母出现的个数
方法一:生成26个变量,代表每个字母出现个数
方法二:生成具有26个元素的列表,将每个字母转化成相应的索引值
s = 'ansgloememfutuv'
lst = [0] * 26
for i in s:
lst[ord(i) - 97] += 1
print(lst)
方法三:生成一个字典,字母做键,对应出现的次数做值
s = 'ansgloememfutuv'
d = {}
for i in s:
if i in d:
d[i] += 1
else:
d[i] = 1
print(d)
实例:单词计数:读取小说”emma.txt”,打印前10个最常见单词
f = open('emma.txt')
word_freq = {}
for line in f:
words = line.strip().split()
for word in words:
if word in word_freq:
word_freq[word] += 1
else:
word_freq[word] = 1
freq_word = []
for word,freq in word_freq.items():
freq_word.append((freq,word))
freq_word.sort(reverse = True)
for freq,word in freq_word[:10]:
print(word)
f.close()
实例:翻转字典:生成一个新字典,其键为原字典,值为原字典的键(同一个值,可能对应多个键,需要用列表存储)
d1 = {'Zhang':123,'Wang':456,'Li':123,'Zhao':456}
d2 = {}
for name,room in d1.items():
if room in d2:
d2[room].append(name)
else:
d2[room] = [name]
print(d2)
结果为:{123: ['Zhang', 'Li'], 456: ['Wang', 'Zhao']}
◎集合(Set):无序不重复元素(键)集,和字典类似,但是无“值”
创建:x = set();x = {key1,key2,...}
添加和删除:x.add( ),x.remove( )
- 差集;& 交集;| 并集;!= 不等于;= = 等于;in成员;for key in set 枚举
实例:中文分词:我爱北京天安门→我/爱/北京/天安门
算法:正向最大匹配
def load_dict(filename):
word_dict = set()
max_len = 1
f = open(filename)
for line in f:
word = unicode(line.strip(),'utf-8')
word_dict.add(word)
if len(word) > max_len:
max_len = len(word)
return max_len,word_dict
def fmm_word_seg(sent,max_len,word_dict):
begin = 0
words = []
sent = unicode(sent,'utf-8')
while begin < len(sent):
for end in range(begin + max_len,begin,-1):
if sent[begin:end] in word_dict:
words.append(sent[begin:end])
break
begin = end
return words
max_len,word_dict = load_dict('lexicon.txt')
sent = input("Input a sentence : ")
words = fmm_word_seg(sent,max_len,word_dict)
for word in words:
print(word)