先记录一个报错 unindent does not match any outer indentation level,这里可能是缩进有的使用的空格,有的使用的是tab不一致。(改了半天,人🐎了)
1.导入库
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np
2.数据准备
这里的data是一篇文章,用\n分割每一句话。
tokenizer = Tokenizer()
data="In the town of Athy one Jeremy Lanigan \n Battered away til he hadnt a pound. \nHis father died and made him a man again \n Left him a farm and ten acres of ground. \nHe gave a grand party for friends and relations \nWho didnt forget him when come to the wall, \nAnd if youll but listen Ill make your eyes glisten \nOf the rows and the ructions of Lanigans Ball. \nMyself to be sure got free invitation, \nFor all the nice girls and boys I might ask, \nAnd just in a minute both friends and relations \nWere dancing round merry as bees round a cask. \nJudy ODaly, that nice little milliner, \nShe tipped me a wink for to give her a call, \nAnd I soon arrived with Peggy McGilligan \nJust in time for Lanigans Ball. \nThere were lashings of punch and wine for the ladies, \nPotatoes and cakes; there was bacon and tea, \nThere were the Nolans, Dolans, OGradys \nCourting the girls and dancing away. \nSongs they went round as plenty as water, \nThe harp that once sounded in Taras old hall,\nSweet Nelly Gray and The Rat Catchers Daughter,\nAll singing together at Lanigans Ball. \nThey were doing all kinds of nonsensical polkas \nAll round the room in a whirligig. \nJulia and I, we banished their nonsense \nAnd tipped them the twist of a reel and a jig. \nAch mavrone, how the girls got all mad at me \nDanced til youd think the ceiling would fall. \nFor I spent three weeks at Brooks Academy \nLearning new steps for Lanigans Ball. \nThree long weeks I spent up in Dublin, \nThree long weeks to learn nothing at all,\n Three long weeks I spent up in Dublin, \nLearning new steps for Lanigans Ball. \nShe stepped out and I stepped in again, \nI stepped out and she stepped in again, \nShe stepped out and I stepped in again, \nLearning new steps for Lanigans Ball. \nBoys were all merry and the girls they were hearty \nAnd danced all around in couples and groups, \nTil an accident happened, young Terrance McCarthy \nPut his right leg through miss Finnertys hoops. \nPoor creature fainted and cried Meelia murther, \nCalled for her brothers and gathered them all. \nCarmody swore that hed go no further \nTil he had satisfaction at Lanigans Ball. \nIn the midst of the row miss Kerrigan fainted, \nHer cheeks at the same time as red as a rose. \nSome of the lads declared she was painted, \nShe took a small drop too much, I suppose. \nHer sweetheart, Ned Morgan, so powerful and able, \nWhen he saw his fair colleen stretched out by the wall, \nTore the left leg from under the table \nAnd smashed all the Chaneys at Lanigans Ball. \nBoys, oh boys, twas then there were runctions. \nMyself got a lick from big Phelim McHugh. \nI soon replied to his introduction \nAnd kicked up a terrible hullabaloo. \nOld Casey, the piper, was near being strangled. \nThey squeezed up his pipes, bellows, chanters and all. \nThe girls, in their ribbons, they got all entangled \nAnd that put an end to Lanigans Ball."
#将所有数据转为小写,并且用\n分割
corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
#+1是考虑陌生词汇
total_words = len(tokenizer.word_index)+1
#打印单词-索引对
print(tokenizer.word_index)
print(total_words)
print(corpus)
{‘and’: 1, ‘the’: 2, ‘a’: 3, ‘in’: 4, ‘all’: 5, ‘i’: 6, ‘for’: 7, ‘of’: 8, ‘lanigans’: 9, ‘ball’: 10, ‘were’: 11, ‘at’: 12, ‘to’: 13, ‘she’: 14, ‘stepped’: 15, ‘his’: 16, ‘girls’: 17, ‘as’: 18, ‘they’: 19, ‘til’: 20, ‘he’: 21, ‘again’: 22, ‘got’: 23, ‘boys’: 24, ‘round’: 25, ‘that’: 26, ‘her’: 27, ‘there’: 28, ‘three’: 29, ‘weeks’: 30, ‘up’: 31, ‘out’: 32, ‘him’: 33, ‘was’: 34, ‘spent’: 35, ‘learning’: 36, ‘new’: 37, ‘steps’: 38, ‘long’: 39, ‘away’: 40, ‘left’: 41, ‘friends’: 42, ‘relations’: 43, ‘when’: 44, ‘wall’: 45, ‘myself’: 46, ‘nice’: 47, ‘just’: 48, ‘dancing’: 49, ‘merry’: 50, ‘tipped’: 51, ‘me’: 52, ‘soon’: 53, ‘time’: 54, ‘old’: 55, ‘their’: 56, ‘them’: 57, ‘danced’: 58, ‘dublin’: 59, ‘an’: 60, ‘put’: 61, ‘leg’: 62, ‘miss’: 63, ‘fainted’: 64, ‘from’: 65, ‘town’: 66, ‘athy’: 67, ‘one’: 68, ‘jeremy’: 69, ‘lanigan’: 70, ‘battered’: 71, ‘hadnt’: 72, ‘pound’: 73, ‘father’: 74, ‘died’: 75, ‘made’: 76, ‘man’: 77, ‘farm’: 78, ‘ten’: 79, ‘acres’: 80, ‘ground’: 81, ‘gave’: 82, ‘grand’: 83, ‘party’: 84, ‘who’: 85, ‘didnt’: 86, ‘forget’: 87, ‘come’: 88, ‘if’: 89, ‘youll’: 90, ‘but’: 91, ‘listen’: 92, ‘ill’: 93, ‘make’: 94, ‘your’: 95, ‘eyes’: 96, ‘glisten’: 97, ‘rows’: 98, ‘ructions’: 99, ‘be’: 100, ‘sure’: 101, ‘free’: 102, ‘invitation’: 103, ‘might’: 104, ‘ask’: 105, ‘minute’: 106, ‘both’: 107, ‘bees’: 108, ‘cask’: 109, ‘judy’: 110, ‘odaly’: 111, ‘little’: 112, ‘milliner’: 113, ‘wink’: 114, ‘give’: 115, ‘call’: 116, ‘arrived’: 117, ‘with’: 118, ‘peggy’: 119, ‘mcgilligan’: 120, ‘lashings’: 121, ‘punch’: 122, ‘wine’: 123, ‘ladies’: 124, ‘potatoes’: 125, ‘cakes’: 126, ‘bacon’: 127, ‘tea’: 128, ‘nolans’: 129, ‘dolans’: 130, ‘ogradys’: 131, ‘courting’: 132, ‘songs’: 133, ‘went’: 134, ‘plenty’: 135, ‘water’: 136, ‘harp’: 137, ‘once’: 138, ‘sounded’: 139, ‘taras’: 140, ‘hall’: 141, ‘sweet’: 142, ‘nelly’: 143, ‘gray’: 144, ‘rat’: 145, ‘catchers’: 146, ‘daughter’: 147, ‘singing’: 148, ‘together’: 149, ‘doing’: 150, ‘kinds’: 151, ‘nonsensical’: 152, ‘polkas’: 153, ‘room’: 154, ‘whirligig’: 155, ‘julia’: 156, ‘we’: 157, ‘banished’: 158, ‘nonsense’: 159, ‘twist’: 160, ‘reel’: 161, ‘jig’: 162, ‘ach’: 163, ‘mavrone’: 164, ‘how’: 165, ‘mad’: 166, ‘youd’: 167, ‘think’: 168, ‘ceiling’: 169, ‘would’: 170, ‘fall’: 171, ‘brooks’: 172, ‘academy’: 173, ‘learn’: 174, ‘nothing’: 175, ‘hearty’: 176, ‘around’: 177, ‘couples’: 178, ‘groups’: 179, ‘accident’: 180, ‘happened’: 181, ‘young’: 182, ‘terrance’: 183, ‘mccarthy’: 184, ‘right’: 185, ‘through’: 186, ‘finnertys’: 187, ‘hoops’: 188, ‘poor’: 189, ‘creature’: 190, ‘cried’: 191, ‘meelia’: 192, ‘murther’: 193, ‘called’: 194, ‘brothers’: 195, ‘gathered’: 196, ‘carmody’: 197, ‘swore’: 198, ‘hed’: 199, ‘go’: 200, ‘no’: 201, ‘further’: 202, ‘had’: 203, ‘satisfaction’: 204, ‘midst’: 205, ‘row’: 206, ‘kerrigan’: 207, ‘cheeks’: 208, ‘same’: 209, ‘red’: 210, ‘rose’: 211, ‘some’: 212, ‘lads’: 213, ‘declared’: 214, ‘painted’: 215, ‘took’: 216, ‘small’: 217, ‘drop’: 218, ‘too’: 219, ‘much’: 220, ‘suppose’: 221, ‘sweetheart’: 222, ‘ned’: 223, ‘morgan’: 224, ‘so’: 225, ‘powerful’: 226, ‘able’: 227, ‘saw’: 228, ‘fair’: 229, ‘colleen’: 230, ‘stretched’: 231, ‘by’: 232, ‘tore’: 233, ‘under’: 234, ‘table’: 235, ‘smashed’: 236, ‘chaneys’: 237, ‘oh’: 238, ‘twas’: 239, ‘then’: 240, ‘runctions’: 241, ‘lick’: 242, ‘big’: 243, ‘phelim’: 244, ‘mchugh’: 245, ‘replied’: 246, ‘introduction’: 247, ‘kicked’: 248, ‘terrible’: 249, ‘hullabaloo’: 250, ‘casey’: 251, ‘piper’: 252, ‘near’: 253, ‘being’: 254, ‘strangled’: 255, ‘squeezed’: 256, ‘pipes’: 257, ‘bellows’: 258, ‘chanters’: 259, ‘ribbons’: 260, ‘entangled’: 261, ‘end’: 262}
263
['in the town of athy one jeremy lanigan ', ’ battered away til he hadnt a pound. ', 'his father died and made him a man again ', ’ left him a farm and ten acres of ground. ', 'he gave a grand party for friends and relations ', 'who didnt forget him when come to the wall, ', 'and if youll but listen ill make your eyes glisten ', 'of the rows and the ructions of lanigans ball. ', 'myself to be sure got free invitation, ', 'for all the nice girls and boys i might ask, ', 'and just in a minute both friends and relations ', 'were dancing round merry as bees round a cask. ', 'judy odaly, that nice little milliner, ', 'she tipped me a wink for to give her a call, ', 'and i soon arrived with peggy mcgilligan ', 'just in time for lanigans ball. ', 'there were lashings of punch and wine for the ladies, ', 'potatoes and cakes; there was bacon and tea, ', 'there were the nolans, dolans, ogradys ', 'courting the girls and dancing away. ', 'songs they went round as plenty as water, ', ‘the harp that once sounded in taras old hall,’, ‘sweet nelly gray and the rat catchers daughter,’, 'all singing together at lanigans ball. ', 'they were doing all kinds of nonsensical polkas ', 'all round the room in a whirligig. ', 'julia and i, we banished their nonsense ', 'and tipped them the twist of a reel and a jig. ', 'ach mavrone, how the girls got all mad at me ', 'danced til youd think the ceiling would fall. ', 'for i spent three weeks at brooks academy ', 'learning new steps for lanigans ball. ', 'three long weeks i spent up in dublin, ', ‘three long weeks to learn nothing at all,’, ’ three long weeks i spent up in dublin, ', 'learning new steps for lanigans ball. ', 'she stepped out and i stepped in again, ', 'i stepped out and she stepped in again, ', 'she stepped out and i stepped in again, ', 'learning new steps for lanigans ball. ', 'boys were all merry and the girls they were hearty ', 'and danced all around in couples and groups, ', 'til an accident happened, young terrance mccarthy ', 'put his right leg through miss finnertys hoops. ', 'poor creature fainted and cried meelia murther, ', 'called for her brothers and gathered them all. ', 'carmody swore that hed go no further ', 'til he had satisfaction at lanigans ball. ', 'in the midst of the row miss kerrigan fainted, ', 'her cheeks at the same time as red as a rose. ', 'some of the lads declared she was painted, ', 'she took a small drop too much, i suppose. ', 'her sweetheart, ned morgan, so powerful and able, ', 'when he saw his fair colleen stretched out by the wall, ', 'tore the left leg from under the table ', 'and smashed all the chaneys at lanigans ball. ', 'boys, oh boys, twas then there were runctions. ', 'myself got a lick from big phelim mchugh. ', 'i soon replied to his introduction ', 'and kicked up a terrible hullabaloo. ', 'old casey, the piper, was near being strangled. ', 'they squeezed up his pipes, bellows, chanters and all. ', 'the girls, in their ribbons, they got all entangled ', ‘and that put an end to lanigans ball.’]
3.序列化
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
print(token_list)
[4, 2, 66, 8, 67, 68, 69, 70]
[71, 40, 20, 21, 72, 3, 73]
[16, 74, 75, 1, 76, 33, 3, 77, 22]
[41, 33, 3, 78, 1, 79, 80, 8, 81]
[21, 82, 3, 83, 84, 7, 42, 1, 43]
[85, 86, 87, 33, 44, 88, 13, 2, 45]
[1, 89, 90, 91, 92, 93, 94, 95, 96, 97]
[8, 2, 98, 1, 2, 99, 8, 9, 10]
[46, 13, 100, 101, 23, 102, 103]
[7, 5, 2, 47, 17, 1, 24, 6, 104, 105]
[1, 48, 4, 3, 106, 107, 42, 1, 43]
[11, 49, 25, 50, 18, 108, 25, 3, 109]
[110, 111, 26, 47, 112, 113]
[14, 51, 52, 3, 114, 7, 13, 115, 27, 3, 116]
[1, 6, 53, 117, 118, 119, 120]
[48, 4, 54, 7, 9, 10]
[28, 11, 121, 8, 122, 1, 123, 7, 2, 124]
[125, 1, 126, 28, 34, 127, 1, 128]
[28, 11, 2, 129, 130, 131]
[132, 2, 17, 1, 49, 40]
[133, 19, 134, 25, 18, 135, 18, 136]
[2, 137, 26, 138, 139, 4, 140, 55, 141]
[142, 143, 144, 1, 2, 145, 146, 147]
[5, 148, 149, 12, 9, 10]
[19, 11, 150, 5, 151, 8, 152, 153]
[5, 25, 2, 154, 4, 3, 155]
[156, 1, 6, 157, 158, 56, 159]
[1, 51, 57, 2, 160, 8, 3, 161, 1, 3, 162]
[163, 164, 165, 2, 17, 23, 5, 166, 12, 52]
[58, 20, 167, 168, 2, 169, 170, 171]
[7, 6, 35, 29, 30, 12, 172, 173]
[36, 37, 38, 7, 9, 10]
[29, 39, 30, 6, 35, 31, 4, 59]
[29, 39, 30, 13, 174, 175, 12, 5]
[29, 39, 30, 6, 35, 31, 4, 59]
[36, 37, 38, 7, 9, 10]
[14, 15, 32, 1, 6, 15, 4, 22]
[6, 15, 32, 1, 14, 15, 4, 22]
[14, 15, 32, 1, 6, 15, 4, 22]
[36, 37, 38, 7, 9, 10]
[24, 11, 5, 50, 1, 2, 17, 19, 11, 176]
[1, 58, 5, 177, 4, 178, 1, 179]
[20, 60, 180, 181, 182, 183, 184]
[61, 16, 185, 62, 186, 63, 187, 188]
[189, 190, 64, 1, 191, 192, 193]
[194, 7, 27, 195, 1, 196, 57, 5]
[197, 198, 26, 199, 200, 201, 202]
[20, 21, 203, 204, 12, 9, 10]
[4, 2, 205, 8, 2, 206, 63, 207, 64]
[27, 208, 12, 2, 209, 54, 18, 210, 18, 3, 211]
[212, 8, 2, 213, 214, 14, 34, 215]
[14, 216, 3, 217, 218, 219, 220, 6, 221]
[27, 222, 223, 224, 225, 226, 1, 227]
[44, 21, 228, 16, 229, 230, 231, 32, 232, 2, 45]
[233, 2, 41, 62, 65, 234, 2, 235]
[1, 236, 5, 2, 237, 12, 9, 10]
[24, 238, 24, 239, 240, 28, 11, 241]
[46, 23, 3, 242, 65, 243, 244, 245]
[6, 53, 246, 13, 16, 247]
[1, 248, 31, 3, 249, 250]
[55, 251, 2, 252, 34, 253, 254, 255]
[19, 256, 31, 16, 257, 258, 259, 1, 5]
[2, 17, 4, 56, 260, 19, 23, 5, 261]
[1, 26, 61, 60, 262, 13, 9, 10]
4.准备训练数据
将每一行数据进行切分,前两个,前三个,。。。都作为一个训练样本,可以让网络学习更多知识
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
print(input_sequences)
[[4, 2], [4, 2, 66], [4, 2, 66, 8], [4, 2, 66, 8, 67], [4, 2, 66, 8, 67, 68], [4, 2, 66, 8, 67, 68, 69], [4, 2, 66, 8, 67, 68, 69, 70], [71, 40], [71, 40, 20], [71, 40, 20, 21], [71, 40, 20, 21, 72], [71, 40, 20, 21, 72, 3], [71, 40, 20, 21, 72, 3, 73], [16, 74], [16, 74, 75], [16, 74, 75, 1], [16, 74, 75, 1, 76], [16, 74, 75, 1, 76, 33], [16, 74, 75, 1, 76, 33, 3], [16, 74, 75, 1, 76, 33, 3, 77], [16, 74, 75, 1, 76, 33, 3, 77, 22], [41, 33], [41, 33, 3], [41, 33, 3, 78], [41, 33, 3, 78, 1], [41, 33, 3, 78, 1, 79], [41, 33, 3, 78, 1, 79, 80], [41, 33, 3, 78, 1, 79, 80, 8], [41, 33, 3, 78, 1, 79, 80, 8, 81], [21, 82], [21, 82, 3], [21, 82, 3, 83], [21, 82, 3, 83, 84], [21, 82, 3, 83, 84, 7], [21, 82, 3, 83, 84, 7, 42], [21, 82, 3, 83, 84, 7, 42, 1], [21, 82, 3, 83, 84, 7, 42, 1, 43], [85, 86], [85, 86, 87], [85, 86, 87, 33], [85, 86, 87, 33, 44], [85, 86, 87, 33, 44, 88], [85, 86, 87, 33, 44, 88, 13], [85, 86, 87, 33, 44, 88, 13, 2], [85, 86, 87, 33, 44, 88, 13, 2, 45], [1, 89], [1, 89, 90], [1, 89, 90, 91], [1, 89, 90, 91, 92], [1, 89, 90, 91, 92, 93], [1, 89, 90, 91, 92, 93, 94], [1, 89, 90, 91, 92, 93, 94, 95], [1, 89, 90, 91, 92, 93, 94, 95, 96], [1, 89, 90, 91, 92, 93, 94, 95, 96, 97], [8, 2], [8, 2, 98], [8, 2, 98, 1], [8, 2, 98, 1, 2], [8, 2, 98, 1, 2, 99], [8, 2, 98, 1, 2, 99, 8], [8, 2, 98, 1, 2, 99, 8, 9], [8, 2, 98, 1, 2, 99, 8, 9, 10], [46, 13], [46, 13, 100], [46, 13, 100, 101], [46, 13, 100, 101, 23], [46, 13, 100, 101, 23, 102], [46, 13, 100, 101, 23, 102, 103], [7, 5], [7, 5, 2], [7, 5, 2, 47], [7, 5, 2, 47, 17], [7, 5, 2, 47, 17, 1], [7, 5, 2, 47, 17, 1, 24], [7, 5, 2, 47, 17, 1, 24, 6], [7, 5, 2, 47, 17, 1, 24, 6, 104], [7, 5, 2, 47, 17, 1, 24, 6, 104, 105], [1, 48], [1, 48, 4], [1, 48, 4, 3], [1, 48, 4, 3, 106], [1, 48, 4, 3, 106, 107], [1, 48, 4, 3, 106, 107, 42], [1, 48, 4, 3, 106, 107, 42, 1], [1, 48, 4, 3, 106, 107, 42, 1, 43], [11, 49], [11, 49, 25], [11, 49, 25, 50], [11, 49, 25, 50, 18], [11, 49, 25, 50, 18, 108], [11, 49, 25, 50, 18, 108, 25], [11, 49, 25, 50, 18, 108, 25, 3], [11, 49, 25, 50, 18, 108, 25, 3, 109], [110, 111], [110, 111, 26], [110, 111, 26, 47], [110, 111, 26, 47, 112], [110, 111, 26, 47, 112, 113], [14, 51], [14, 51, 52], [14, 51, 52, 3], [14, 51, 52, 3, 114], [14, 51, 52, 3, 114, 7], [14, 51, 52, 3, 114, 7, 13], [14, 51, 52, 3, 114, 7, 13, 115], [14, 51, 52, 3, 114, 7, 13, 115, 27], [14, 51, 52, 3, 114, 7, 13, 115, 27, 3], [14, 51, 52, 3, 114, 7, 13, 115, 27, 3, 116], [1, 6], [1, 6, 53], [1, 6, 53, 117], [1, 6, 53, 117, 118], [1, 6, 53, 117, 118, 119], [1, 6, 53, 117, 118, 119, 120], [48, 4], [48, 4, 54], [48, 4, 54, 7], [48, 4, 54, 7, 9], [48, 4, 54, 7, 9, 10], [28, 11], [28, 11, 121], [28, 11, 121, 8], [28, 11, 121, 8, 122], [28, 11, 121, 8, 122, 1], [28, 11, 121, 8, 122, 1, 123], [28, 11, 121, 8, 122, 1, 123, 7], [28, 11, 121, 8, 122, 1, 123, 7, 2], [28, 11, 121, 8, 122, 1, 123, 7, 2, 124], [125, 1], [125, 1, 126], [125, 1, 126, 28], [125, 1, 126, 28, 34], [125, 1, 126, 28, 34, 127], [125, 1, 126, 28, 34, 127, 1], [125, 1, 126, 28, 34, 127, 1, 128], [28, 11], [28, 11, 2], [28, 11, 2, 129], [28, 11, 2, 129, 130], [28, 11, 2, 129, 130, 131], [132, 2], [132, 2, 17], [132, 2, 17, 1], [132, 2, 17, 1, 49], [132, 2, 17, 1, 49, 40], [133, 19], [133, 19, 134], [133, 19, 134, 25], [133, 19, 134, 25, 18], [133, 19, 134, 25, 18, 135], [133, 19, 134, 25, 18, 135, 18], [133, 19, 134, 25, 18, 135, 18, 136], [2, 137], [2, 137, 26], [2, 137, 26, 138], [2, 137, 26, 138, 139], [2, 137, 26, 138, 139, 4], [2, 137, 26, 138, 139, 4, 140], [2, 137, 26, 138, 139, 4, 140, 55], [2, 137, 26, 138, 139, 4, 140, 55, 141], [142, 143], [142, 143, 144], [142, 143, 144, 1], [142, 143, 144, 1, 2], [142, 143, 144, 1, 2, 145], [142, 143, 144, 1, 2, 145, 146], [142, 143, 144, 1, 2, 145, 146, 147], [5, 148], [5, 148, 149], [5, 148, 149, 12], [5, 148, 149, 12, 9], [5, 148, 149, 12, 9, 10], [19, 11], [19, 11, 150], [19, 11, 150, 5], [19, 11, 150, 5, 151], [19, 11, 150, 5, 151, 8], [19, 11, 150, 5, 151, 8, 152], [19, 11, 150, 5, 151, 8, 152, 153], [5, 25], [5, 25, 2], [5, 25, 2, 154], [5, 25, 2, 154, 4], [5, 25, 2, 154, 4, 3], [5, 25, 2, 154, 4, 3, 155], [156, 1], [156, 1, 6], [156, 1, 6, 157], [156, 1, 6, 157, 158], [156, 1, 6, 157, 158, 56], [156, 1, 6, 157, 158, 56, 159], [1, 51], [1, 51, 57], [1, 51, 57, 2], [1, 51, 57, 2, 160], [1, 51, 57, 2, 160, 8], [1, 51, 57, 2, 160, 8, 3], [1, 51, 57, 2, 160, 8, 3, 161], [1, 51, 57, 2, 160, 8, 3, 161, 1], [1, 51, 57, 2, 160, 8, 3, 161, 1, 3], [1, 51, 57, 2, 160, 8, 3, 161, 1, 3, 162], [163, 164], [163, 164, 165], [163, 164, 165, 2], [163, 164, 165, 2, 17], [163, 164, 165, 2, 17, 23], [163, 164, 165, 2, 17, 23, 5], [163, 164, 165, 2, 17, 23, 5, 166], [163, 164, 165, 2, 17, 23, 5, 166, 12], [163, 164, 165, 2, 17, 23, 5, 166, 12, 52], [58, 20], [58, 20, 167], [58, 20, 167, 168], [58, 20, 167, 168, 2], [58, 20, 167, 168, 2, 169], [58, 20, 167, 168, 2, 169, 170], [58, 20, 167, 168, 2, 169, 170, 171], [7, 6], [7, 6, 35], [7, 6, 35, 29], [7, 6, 35, 29, 30], [7, 6, 35, 29, 30, 12], [7, 6, 35, 29, 30, 12, 172], [7, 6, 35, 29, 30, 12, 172, 173], [36, 37], [36, 37, 38], [36, 37, 38, 7], [36, 37, 38, 7, 9], [36, 37, 38, 7, 9, 10], [29, 39], [29, 39, 30], [29, 39, 30, 6], [29, 39, 30, 6, 35], [29, 39, 30, 6, 35, 31], [29, 39, 30, 6, 35, 31, 4], [29, 39, 30, 6, 35, 31, 4, 59], [29, 39], [29, 39, 30], [29, 39, 30, 13], [29, 39, 30, 13, 174], [29, 39, 30, 13, 174, 175], [29, 39, 30, 13, 174, 175, 12], [29, 39, 30, 13, 174, 175, 12, 5], [29, 39], [29, 39, 30], [29, 39, 30, 6], [29, 39, 30, 6, 35], [29, 39, 30, 6, 35, 31], [29, 39, 30, 6, 35, 31, 4], [29, 39, 30, 6, 35, 31, 4, 59], [36, 37], [36, 37, 38], [36, 37, 38, 7], [36, 37, 38, 7, 9], [36, 37, 38, 7, 9, 10], [14, 15], [14, 15, 32], [14, 15, 32, 1], [14, 15, 32, 1, 6], [14, 15, 32, 1, 6, 15], [14, 15, 32, 1, 6, 15, 4], [14, 15, 32, 1, 6, 15, 4, 22], [6, 15], [6, 15, 32], [6, 15, 32, 1], [6, 15, 32, 1, 14], [6, 15, 32, 1, 14, 15], [6, 15, 32, 1, 14, 15, 4], [6, 15, 32, 1, 14, 15, 4, 22], [14, 15], [14, 15, 32], [14, 15, 32, 1], [14, 15, 32, 1, 6], [14, 15, 32, 1, 6, 15], [14, 15, 32, 1, 6, 15, 4], [14, 15, 32, 1, 6, 15, 4, 22], [36, 37], [36, 37, 38], [36, 37, 38, 7], [36, 37, 38, 7, 9], [36, 37, 38, 7, 9, 10], [24, 11], [24, 11, 5], [24, 11, 5, 50], [24, 11, 5, 50, 1], [24, 11, 5, 50, 1, 2], [24, 11, 5, 50, 1, 2, 17], [24, 11, 5, 50, 1, 2, 17, 19], [24, 11, 5, 50, 1, 2, 17, 19, 11], [24, 11, 5, 50, 1, 2, 17, 19, 11, 176], [1, 58], [1, 58, 5], [1, 58, 5, 177], [1, 58, 5, 177, 4], [1, 58, 5, 177, 4, 178], [1, 58, 5, 177, 4, 178, 1], [1, 58, 5, 177, 4, 178, 1, 179], [20, 60], [20, 60, 180], [20, 60, 180, 181], [20, 60, 180, 181, 182], [20, 60, 180, 181, 182, 183], [20, 60, 180, 181, 182, 183, 184], [61, 16], [61, 16, 185], [61, 16, 185, 62], [61, 16, 185, 62, 186], [61, 16, 185, 62, 186, 63], [61, 16, 185, 62, 186, 63, 187], [61, 16, 185, 62, 186, 63, 187, 188], [189, 190], [189, 190, 64], [189, 190, 64, 1], [189, 190, 64, 1, 191], [189, 190, 64, 1, 191, 192], [189, 190, 64, 1, 191, 192, 193], [194, 7], [194, 7, 27], [194, 7, 27, 195], [194, 7, 27, 195, 1], [194, 7, 27, 195, 1, 196], [194, 7, 27, 195, 1, 196, 57], [194, 7, 27, 195, 1, 196, 57, 5], [197, 198], [197, 198, 26], [197, 198, 26, 199], [197, 198, 26, 199, 200], [197, 198, 26, 199, 200, 201], [197, 198, 26, 199, 200, 201, 202], [20, 21], [20, 21, 203], [20, 21, 203, 204], [20, 21, 203, 204, 12], [20, 21, 203, 204, 12, 9], [20, 21, 203, 204, 12, 9, 10], [4, 2], [4, 2, 205], [4, 2, 205, 8], [4, 2, 205, 8, 2], [4, 2, 205, 8, 2, 206], [4, 2, 205, 8, 2, 206, 63], [4, 2, 205, 8, 2, 206, 63, 207], [4, 2, 205, 8, 2, 206, 63, 207, 64], [27, 208], [27, 208, 12], [27, 208, 12, 2], [27, 208, 12, 2, 209], [27, 208, 12, 2, 209, 54], [27, 208, 12, 2, 209, 54, 18], [27, 208, 12, 2, 209, 54, 18, 210], [27, 208, 12, 2, 209, 54, 18, 210, 18], [27, 208, 12, 2, 209, 54, 18, 210, 18, 3], [27, 208, 12, 2, 209, 54, 18, 210, 18, 3, 211], [212, 8], [212, 8, 2], [212, 8, 2, 213], [212, 8, 2, 213, 214], [212, 8, 2, 213, 214, 14], [212, 8, 2, 213, 214, 14, 34], [212, 8, 2, 213, 214, 14, 34, 215], [14, 216], [14, 216, 3], [14, 216, 3, 217], [14, 216, 3, 217, 218], [14, 216, 3, 217, 218, 219], [14, 216, 3, 217, 218, 219, 220], [14, 216, 3, 217, 218, 219, 220, 6], [14, 216, 3, 217, 218, 219, 220, 6, 221], [27, 222], [27, 222, 223], [27, 222, 223, 224], [27, 222, 223, 224, 225], [27, 222, 223, 224, 225, 226], [27, 222, 223, 224, 225, 226, 1], [27, 222, 223, 224, 225, 226, 1, 227], [44, 21], [44, 21, 228], [44, 21, 228, 16], [44, 21, 228, 16, 229], [44, 21, 228, 16, 229, 230], [44, 21, 228, 16, 229, 230, 231], [44, 21, 228, 16, 229, 230, 231, 32], [44, 21, 228, 16, 229, 230, 231, 32, 232], [44, 21, 228, 16, 229, 230, 231, 32, 232, 2], [44, 21, 228, 16, 229, 230, 231, 32, 232, 2, 45], [233, 2], [233, 2, 41], [233, 2, 41, 62], [233, 2, 41, 62, 65], [233, 2, 41, 62, 65, 234], [233, 2, 41, 62, 65, 234, 2], [233, 2, 41, 62, 65, 234, 2, 235], [1, 236], [1, 236, 5], [1, 236, 5, 2], [1, 236, 5, 2, 237], [1, 236, 5, 2, 237, 12], [1, 236, 5, 2, 237, 12, 9], [1, 236, 5, 2, 237, 12, 9, 10], [24, 238], [24, 238, 24], [24, 238, 24, 239], [24, 238, 24, 239, 240], [24, 238, 24, 239, 240, 28], [24, 238, 24, 239, 240, 28, 11], [24, 238, 24, 239, 240, 28, 11, 241], [46, 23], [46, 23, 3], [46, 23, 3, 242], [46, 23, 3, 242, 65], [46, 23, 3, 242, 65, 243], [46, 23, 3, 242, 65, 243, 244], [46, 23, 3, 242, 65, 243, 244, 245], [6, 53], [6, 53, 246], [6, 53, 246, 13], [6, 53, 246, 13, 16], [6, 53, 246, 13, 16, 247], [1, 248], [1, 248, 31], [1, 248, 31, 3], [1, 248, 31, 3, 249], [1, 248, 31, 3, 249, 250], [55, 251], [55, 251, 2], [55, 251, 2, 252], [55, 251, 2, 252, 34], [55, 251, 2, 252, 34, 253], [55, 251, 2, 252, 34, 253, 254], [55, 251, 2, 252, 34, 253, 254, 255], [19, 256], [19, 256, 31], [19, 256, 31, 16], [19, 256, 31, 16, 257], [19, 256, 31, 16, 257, 258], [19, 256, 31, 16, 257, 258, 259], [19, 256, 31, 16, 257, 258, 259, 1], [19, 256, 31, 16, 257, 258, 259, 1, 5], [2, 17], [2, 17, 4], [2, 17, 4, 56], [2, 17, 4, 56, 260], [2, 17, 4, 56, 260, 19], [2, 17, 4, 56, 260, 19, 23], [2, 17, 4, 56, 260, 19, 23, 5], [2, 17, 4, 56, 260, 19, 23, 5, 261], [1, 26], [1, 26, 61], [1, 26, 61, 60], [1, 26, 61, 60, 262], [1, 26, 61, 60, 262, 13], [1, 26, 61, 60, 262, 13, 9], [1, 26, 61, 60, 262, 13, 9, 10]]
找到最长数据的的长度,然后将句子转换成词袋序列,padding='pre’可以将0填充到前面
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
将每组数据最后一个值作为标签,前面的作为样本。然后将标签进行独热编码
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
打印一个例子
print(xs[5])
print(ys[5])
[ 0 0 0 0 4 2 66 8 67 68]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
5.搭建网络并训练
embedding之前有介绍过,参考“tensorflow实现循环神经网络”这篇博客
LSTM层参数如下:
keras.layers.recurrent.LSTM(
units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True,
kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True,
kernel_regularizer=None, recurrent_regularizer=None,
bias_regularizer=None, activity_regularizer=None,
kernel_constraint=None, recurrent_constraint=None,
bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
- units:输出维度
- input_dim:输入维度,当使用该层为模型首层时,应指定该值(或等价的指定input_shape)
- return_sequences:布尔值,默认False,控制返回类型。若为True则返回整个序列,否则仅返回输出序列的最后一个输出
- input_length:当输入序列的长度固定时,该参数为输入序列的长度。当需要在该层后连接Flatten层,然后又要连接Dense层时,需要指定该参数,否则全连接的输出无法计算出来。
输入shape
形如(samples,timesteps,input_dim)的3D张量
输出shape
如果return_sequences=True:返回形如(samples,timesteps,output_dim)的3D张量否则,返回形如(samples,output_dim)的2D张量
代码如下
#Embedding(total_words, 64, input_length=max_sequence_len-1)表示嵌入维度为64,输入的长度减1是因为#除去了被作为label的那个维度
#Bidirectional(LSTM(20))表示使用的是双向LSTM,输出维度为20
model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(20)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(xs, ys, epochs=500, verbose=1)
网络结构如下,因为LSTM是双向的,所以是40
训练结果:
6.绘制acc和loss
acc= history.history['accuracy']
loss = history.history['loss']
epochs = range(len(acc)) # Get number of epochs
# Plot training and validation accuracy per epoch
plt.plot (epochs,acc)
plt.title ('Training accuracy')
plt.figure()
plt.plot ( epochs,loss )
plt.title ('Training loss' )
7.用训练的模型进行生成新的句子
这里给出seed_text,让模型根据这句话生成后面30个单词的句子。
predicted是一个1*263维度的向量,每一个值是预测是该索引对应单词的概率
示例:
[[8.50096421e-06 7.16286252e-07 5.50950803e-02 1.15030166e-02
4.10492672e-03 6.49216175e-02 1.59416499e-03 5.00726055e-05
2.89412728e-05 1.50592905e-05 6.30908689e-05 4.15113143e-04
1.18053191e-04 3.42939282e-04 3.32375166e-05 1.08170134e-04
1.19209439e-01 7.48806167e-04 1.81267442e-05 1.46877894e-03
3.19624989e-04 7.57356174e-04 8.08515506e-06 3.29396166e-02
8.31917132e-05 2.32121465e-03 3.91389476e-04 5.43151582e-05
8.00796144e-04 6.18991209e-03 4.91656642e-03 1.11376476e-02
7.95460510e-05 3.45665576e-05 1.08460328e-02 1.60136464e-04
9.02625106e-06 4.37607559e-06 4.61141963e-06 4.98679256e-05
2.21002251e-01 3.90422065e-05 5.24130974e-06 6.26474844e-07
2.82854948e-04 2.35293337e-05 1.23894906e-05 8.20880232e-05
5.83870336e-03 8.25720467e-03 1.38092749e-02 4.37006215e-03
5.87564500e-06 8.01434071e-05 5.22225164e-03 6.49573121e-06
5.18256391e-04 8.38314882e-05 7.53371092e-03 1.04579849e-04
5.18555229e-04 2.37061431e-05 8.74641119e-05 1.45962422e-05
3.94322633e-05 1.88900856e-06 3.36848403e-04 1.57946633e-04
1.60712693e-06 1.35742454e-07 5.36794460e-07 1.05729132e-05
3.47209658e-04 7.25795473e-07 2.66393545e-05 1.58166557e-04
2.78033805e-03 7.80517198e-08 4.84984639e-06 3.50487608e-06
3.98596711e-08 6.86123849e-06 7.31361390e-04 4.85728742e-05
2.84225308e-07 1.14155619e-05 1.33413587e-05 1.01249555e-07
1.06525258e-05 5.03363321e-03 2.94635754e-07 2.55679737e-07
5.95104848e-06 2.57208876e-06 3.27769817e-06 1.66623659e-06
3.10719912e-07 2.39030010e-06 4.42960882e-04 1.28107949e-05
3.82946059e-02 3.91261819e-05 1.76561749e-04 2.66154075e-07
3.50948517e-06 1.89449487e-08 2.82306501e-05 1.18025773e-05
1.38865988e-04 2.08469169e-06 1.04158971e-05 2.73571804e-05
2.49022605e-06 2.56342491e-06 5.91095159e-06 4.15685754e-05
2.78132734e-06 5.66080814e-07 5.74803153e-06 1.37921313e-06
4.89223282e-07 3.67819972e-04 6.07045695e-05 4.02224250e-05
4.01963276e-04 7.05351795e-06 5.55546314e-04 1.53641668e-06
3.33369448e-04 7.94432071e-06 1.02056529e-05 5.03903766e-06
6.76978198e-06 8.31940179e-06 2.98991479e-04 5.95539495e-05
2.35950385e-04 4.95159620e-05 7.09227209e-07 1.40506359e-07
3.28283932e-05 1.03509464e-07 6.92275944e-06 2.81599467e-04
6.26122210e-06 6.21493791e-06 1.68048098e-06 2.87782532e-05
9.04235814e-04 1.66400729e-04 1.56990043e-03 1.16822694e-03
1.08010048e-04 2.25603549e-06 1.59127827e-04 1.24626479e-06
8.17185992e-06 1.61629487e-05 4.49765321e-05 4.49585241e-05
8.16381362e-06 8.76086972e-07 2.29440019e-07 8.08453024e-06
2.05958917e-04 2.24268320e-03 1.57039077e-03 9.68437744e-05
1.97600573e-04 1.06652014e-05 1.37939071e-06 6.86154294e-07
1.39618723e-03 2.54979695e-05 2.82213911e-02 1.64733123e-04
1.49078378e-05 6.10939553e-03 3.04850773e-03 1.72930805e-03
1.33842159e-05 3.20463405e-05 5.78149593e-06 4.00839563e-05
8.28964676e-06 3.46615775e-06 1.80110226e-06 1.02997528e-05
4.62294323e-04 9.91126126e-06 2.95325535e-05 5.52704267e-04
1.67603403e-07 1.02480868e-07 8.10750771e-06 2.80657230e-04
6.29159342e-03 6.17944215e-06 2.45039155e-06 1.71662862e-06
2.71430906e-07 1.80769888e-07 7.94417610e-06 1.03276817e-03
1.64518246e-06 1.73040928e-04 8.52121593e-06 2.89383593e-06
8.28177705e-02 3.55319607e-05 3.85063286e-05 1.44227999e-06
6.62803723e-06 6.87699066e-05 6.11362106e-04 1.85374950e-06
3.88421991e-04 1.92207208e-05 8.83871820e-08 3.83531301e-07
1.61578532e-06 8.90478077e-06 1.48065314e-01 2.73840669e-05
2.88753826e-02 7.55408546e-05 3.08748822e-05 2.90845419e-05
1.38593838e-04 1.00296552e-06 3.13110809e-06 1.60839281e-05
5.44386603e-06 8.04555748e-06 2.39647534e-07 6.58317504e-07
7.95926992e-03 8.48773681e-03 3.52568110e-04 2.24254516e-04
5.57300064e-06 6.52001700e-06 5.77207793e-05 1.38716402e-07
1.26101611e-07 2.81371655e-07 6.51019354e-06 4.04193344e-07
7.13499030e-03 3.00459396e-06 3.42631978e-07 2.97395140e-03
5.54446946e-04 1.13440816e-04 6.67652557e-06 1.68618179e-04
1.60850340e-03 1.09789397e-07 1.50303845e-06 8.38445885e-06
3.51204799e-05 7.89098121e-05 8.75113710e-06]]
seed_text = "Laurence went to dublin"
next_words = 30
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = model.predict(token_list, verbose=0)
predicted2 = np.argmax(predicted, axis=1)
output_word = ""
#这个循环是将词典的索引对的索引与预测的标签进行匹配,找到预测的单词索引后,将该单词加入到句子后
for word, index in tokenizer.word_index.items():
if index == predicted2:
output_word = word
break
seed_text += " " + output_word
print(seed_text)
注:
- 1.因为使用的tf 2.6版本没有predict_classes函数
- 2.predict_classes函数和predict函数区别:
predict()方法进行预测时,返回值是数值,表示样本属于每一个类别的概率
predict_classes()方法进行预测时,返回的是类别的索引,即该样本所属的类别标签 - 3.使用numpy.argmax()函数从predict()方法找到样本以最大概率所属的类别作为样本的预测标签
numpy.argmax(array, axis),该函数返回的是沿轴axis最大值的索引。
如果是一维数组无需指定axis,直接返回最大值的索引。
如果是二维数组,需指定axis的值,当axis=0时,表示返回的是每一列的最大值的索引;
当axis=1时,返回的是每一行的最大值的索引。
输出结果为:
Laurence went to dublin away til youd think the would fall might ask glisten might ask ask might ask ask might ask ask might ask ask might ask ask might ask ask might ask
可以看到前几个单词预测的还行,有一定的逻辑,但是到后面模型的输出就模糊了,这也是因为数据量太小的原因。