cp16_2_建模顺序数据_RNNs_Bidirectional LSTM_gz_colab_gpu_movie_eager_prefetchDataset_text泛化_Self-Attention

最新推荐文章于 2024-11-15 09:46:05 发布

原创

最新推荐文章于 2024-11-15 09:46:05 发布 · 1.7k 阅读

1 ·

CC 4.0 BY-SA版权

本文介绍了RNN及LSTM在情感分析任务中的应用，并实现了一个完整的文本生成项目。此外，还深入探讨了Transformer模型及其核心机制——自注意力机制。

cp16_Model Sequential_Output_Hidden_Recurrent NNs_LSTM_aclImdb_IMDb_Embed_token_py_function_GRU_Gate:
https://blog.youkuaiyun.com/Linli522362242/article/details/113846940

from tensorflow.keras.layers import GRU

model = Sequential()
model.add( Embedding(10000,32) )
model.add( GRU(32, return_sequences=True) )
model.add( GRU(32) )
model.add(Dense(1))
model.summary()

Building an RNN model for the sentiment analysis task

Since we have very long sequences, we are going to use an LSTM(Long short-term memory) layer to account for long-term effects. In addition, we will put the LSTM layer inside a Bidirectional([ˌbaɪdɪˈrɛkʃənəl]双向的) wrapper, which will make the recurrent layers pass through the input sequences from both directions, start to end, as well as the reverse direction:

A Bidirectional LSTM, or biLSTM, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow and precede a word in a sentence)

Bidirectional layer wrapper provides the implementation of Bidirectional LSTMs in Keras

tf.keras.layers.Bidirectional(
    layer, merge_mode="concat", weights=None, backward_layer=None, **kwargs
)

Bidirectional wrapper for RNNs.

Arguments

layer: keras.layers.RNN instance, such as keras.layers.LSTM or keras.layers.GRU. It could also be a keras.layers.Layer instance that meets the following criteria:
1. Be a sequence-processing layer (accepts 3D+ inputs).
2. Have a go_backwards, return_sequences and return_state attribute (with the same semantics as for the RNN class).
3. Have an input_spec attribute.
4. Implement serialization via get_config() and from_config(). Note that the recommended way to create new RNN layers is to write a custom RNN cell and use it with keras.layers.RNN, instead of subclassing keras.layers.Layer directly.
merge_mode: Mode by which outputs of the forward and backward RNNs will be combined. One of {'sum', 'mul', 'concat', 'ave', None}. If None, the outputs will not be combined, they will be returned as a list. Default value is 'concat'.
backward_layer: Optional keras.layers.RNN, or keras.layers.Layer instance to be used to handle backwards input processing. If backward_layer is not provided, the layer instance passed as the layer argument will be used to generate the backward layer automatically. Note that the provided backward_layer layer should have properties matching those of the layer argument, in particular it should have the same values for stateful, return_states, return_sequence, etc. In addition, backward_layer and layer should have different go_backwards argument values. A ValueError will be raised if these requirements are not met

It takes a recurrent layer (first LSTM layer) as an argument and you can also specify the merge mode, that describes how forward and backward outputs should be merged before being passed on to the coming layer. The options are:

– ‘sum‘: The results are added together.

– ‘mul‘: The results are multiplied together.

– ‘concat‘(the default): The results are concatenated together ,providing double the number of outputs to the next layer.

– ‘ave‘: The average of the results is taken.

###################
embedding_dim:

###################

embedding_dim = 20
vocab_size = len( token_counts ) + 2 # Vocab-size: 87007

tf.random.set_seed(1)

# build the model
bi_lstm_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim = vocab_size, # n+2
        output_dim = embedding_dim, #use a vector of length=embedding_dim to represent each word
        name = 'embed-layer'
    ),# Output Shape ==> (None_batch_size, None_each_input_length, output_dim=20)
    
    tf.keras.layers.Bidirectional(
        # lstm-layer:
        # return_sequences=False == many-to-one： (None_each_input_length, output_dim=64)==>(64)
        tf.keras.layers.LSTM(64, name='lstm-layer'),
        name = 'bidir-lstm', # default merge_mode='concat' ==> (64)==>(128) 
    ),# Output Shape ==> (None_batch_size, output_dim=128) 
    
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

bi_lstm_model.summary()

(None_batch_size, None_each_input_length, output_dim=20) ==>2 x LSTM(return_sequences=False) ==> 2x (None_batch_size, output_dim=64)==>2x(64) ==>Bidirectional(merge_mode='concat') ==>(128)

# compile and train:
bi_lstm_model.compile(
    optimizer = tf.keras.optimizers.Adam(1e-3),
    loss = tf.keras.losses.BinaryCrossentropy( from_logits=False),
    metrics=['accuracy']
)

history = bi_lstm_model.fit(
    train_data,
    validation_data = valid_data,
    epochs=10
)

# compile and train:
bi_lstm_model.compile(
    optimizer = tf.keras.optimizers.Adam(1e-3),
    loss = tf.keras.losses.BinaryCrossentropy( from_logits=False),
    metrics=['accuracy']
)

history = bi_lstm_model.fit(
    train_data,
    validation_data = valid_data,
    epochs=10
)

... ...

##################################################################since my computer runs previous code very slowly, so I use colab

Left figure is from colab without gpu, right figure is from my computer

colab with gpu

==>

!pip install -U -q PyDrive 
  
from pydrive.auth import GoogleAuth 
from pydrive.drive import GoogleDrive 
from google.colab import auth 
from oauth2client.client import GoogleCredentials 
  
  
# Authenticate and create the PyDrive client. 
auth.authenticate_user() 
gauth = GoogleAuth() 
gauth.credentials = GoogleCredentials.get_application_default() 
drive = GoogleDrive(gauth)

##############################################
link : https://drive.google.com/file/d/1gClJCGP-l3Byp2y1p6lEGbq3q2rIKMZg/view?usp=sharing

OR https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch08/movie_data.csv.gz

!wget https://drive.google.com/file/d/1gClJCGP-l3Byp2y1p6lEGbq3q2rIKMZg/view?usp=sharing

so, I use following code

!pip install PyDrive googledrivedownloader

from google_drive_downloader import GoogleDriveDownloader

link='https://drive.google.com/file/d/1gClJCGP-l3Byp2y1p6lEGbq3q2rIKMZg/view?usp=sharing'
GoogleDriveDownloader.download_file_from_google_drive(file_id=link,
                                                      dest_path="./movie_data.csv.gz",
                                                      unzip=True)

Unzipping...
/usr/local/lib/python3.7/dist-packages/google_drive_downloader/google_drive_downloader.py:78: UserWarning: Ignoring `unzip` since "https://drive.google.com/file/d/1gClJCGP-l3Byp2y1p6lEGbq3q2rIKMZg/view?usp=sharing" does not look like a valid zip file warnings.warn('Ignoring `unzip` since "{}" does not look like a valid zip file'.format(file_id))

import tarfile

if not os.path.isdir('movie_data.csv'):
    with tarfile.open('movie_data.csv.gz', 'r:gz') as tar:
        tar.extractall()

ReadError: not a gzip file

import os
import gzip
import shutil
import pandas as pd

with gzip.open('movie_data.csv.gz', 'rb') as f_in, open ('movie_data.csv', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)

##############################################

Solution : download the movie_data.csv.gz file then upzip it by using your tools(e.g. WinRAR or 7zip), upload to google drive

link = 'https://drive.google.com/file/d/17AYKouLCv7oOxloZ52COW0kEpVLRZikS/view?usp=sharing'
 
import pandas as pd 
  
# to get the id part of the file
# link.split("/")
# ['https:',
#  '',
#  'drive.google.com',
#  'file',
#  'd',
#  '17AYKouLCv7oOxloZ52COW0kEpVLRZikS',
#  'view?usp=sharing'] 
id = link.split("/")[-2] #==>'17AYKouLCv7oOxloZ52COW0kEpVLRZikS'
  
downloaded = drive.CreateFile({'id':id})  
downloaded.GetContentFile('movie_data.csv')   
  
df = pd.read_csv('movie_data.csv', encoding='utf-8') 
df.tail()

import tensorflow as tf

# Step 1: Create a dataset

target = df.pop('sentiment') #series # key: [value_list]
ds_raw = tf.data.Dataset.from_tensor_slices(
    (df.values, # array([ [...string...], [...string...], ...]
     target.values)
)# <TensorSliceDataset shapes: ((1,), ()), types: (tf.string, tf.int64)>

## inspection:
for item in ds_raw.take(3):
    # item[0].numpy() : array([...string...])
    tf.print( item[0].numpy()[0][:50], item[1] )

A -- An alternative way to get the dataset: using tensorflow_datasets


imdb_bldr = tfds.builder('imdb_reviews')
print(imdb_bldr.info)

imdb_bldr.download_and_prepare()

datasets = imdb_bldr.as_dataset(shuffle_files=False)

datasets.keys()

imdb_test = datasets['test']   # 25000,
imdb_train_valid = datasets['train'] # 25000,
    
# ## inspection:
for item in imdb_train_valid.take(1):
     tf.print( item )

prefetchDataset

imdb_test = datasets['test']   # 25000,
imdb_train_valid = datasets['train'] # 25000,

# tf.random.set_seed(1)

# ds_raw = imdb_train_valid.take(25000).shuffle(
#     25000, reshuffle_each_iteration=False
# ) # 25000 <== 0~24999

def transform_prefetchDataset(prefetchDataset):
    textList=[]
    labelList=[]
    for example in prefetchDataset:
        textList.append( example['text'].numpy() )
        labelList.append( example['label'].numpy() )
    textArr=np.array(textList)
    labelArr=np.array(labelList)
    
    return tf.data.Dataset.from_tensor_slices( ( textArr[..., np.newaxis],
                                                 labelArr
                                               ) )
ds_raw_train_valid = transform_prefetchDataset(imdb_train_valid)
ds_raw_test = transform_prefetchDataset(imdb_test)

ds_raw_train = ds_raw_train_valid.take(20000)
ds_raw_valid = ds_raw_train_valid.skip(20000)

# ## inspection:
for item in ds_raw_train.take(3):
     tf.print( item[0].numpy()[0][:50], item[1] )

###############################################

tf.random.set_seed(1)

ds_raw = ds_raw.shuffle(
    50000, reshuffle_each_iteration=False
) # 50000 <== 0~49999

ds_raw_test = ds_raw.take(25000)
ds_raw_train_valid = ds_raw.skip(25000)
ds_raw_train = ds_raw_train_valid.take(20000)
ds_raw_valid = ds_raw_train_valid.skip(20000)

# Step 2: find unique tokens  (words)

from collections import Counter
import tensorflow_datasets as tfds

token_counts = Counter()
tokenizer = tfds.features.text.Tokenizer() ##########################

for example in ds_raw_train:
    tokens = tokenizer.tokenize(example[0].numpy()[0])#numpy()[0] get first element in arr
    token_counts.update(tokens)

print('Vocab-size:', len(token_counts))

AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'
###############################################
solutions: tokenizer = tfds.deprecated.text.Tokenizer()

tf.random.set_seed(1)

ds_raw = ds_raw.shuffle(
    50000, reshuffle_each_iteration=False
) # 50000 <== 0~49999

ds_raw_test = ds_raw.take(25000)
ds_raw_train_valid = ds_raw.skip(25000)
ds_raw_train = ds_raw_train_valid.take(20000)
ds_raw_valid = ds_raw_train_valid.skip(20000)

# Step 2: find unique tokens  (words)

from collections import Counter
import tensorflow_datasets as tfds

token_counts = Counter()
tokenizer = tfds.deprecated.text.Tokenizer()

for example in ds_raw_train:
    tokens = tokenizer.tokenize(example[0].numpy()[0])#numpy()[0] get first element in arr
    token_counts.update(tokens)

print('Vocab-size:', len(token_counts))

is different with Vocab-size: 87007, since I take data from buffer twice.
solutions: re-run your code from link = 'https://drive.google.com/file/d/17AYKouLCv7oOxloZ52COW0kEpVLRZikS/view?usp=sharing'

# Step 3: encoding each unique token into integers
######encoder = tfds.features.text.TokenTextEncoder( token_counts )
encoder = tfds.deprecated.text.TokenTextEncoder( token_counts )

# Step 3-A: define the function for transformation

# function will treat the input tensors as if the eager execution mode is enabled
def encode(text_tensor, label):
    text = text_tensor.numpy()[0]
    encoded_text = encoder.encode(text) # encoder = tfds.features.text.TokenTextEncoder( token_counts )
    return encoded_text, label

# Step 3-B: wrap the encode function to a TF Op that executes it eagerly
def encode_map_fn( text, label ):
    return tf.py_function( encode, inp=[text, label],
                                  Tout=(tf.int64, tf.int64))
    
ds_train = ds_raw_train.map( encode_map_fn ) # during mapping:  the eager execution will be disabled
ds_valid = ds_raw_valid.map( encode_map_fn ) # so wrap the encode function to a TF operator that executes it eagerly
ds_test = ds_raw_test.map( encode_map_fn )

tf.random.set_seed(1)
for example in ds_train.shuffle(1000).take(5):
    print('Sequence length:', example[0].shape)

example

# batching the datasets
train_data = ds_train.padded_batch(
    32, padded_shapes=([-1], # batch_size, here is 32
                       [])   # unset, all dimensions of all components are padded to the maximum size in the batch
)

valid_data = ds_valid.padded_batch(
    32, padded_shapes=([-1],
                       [])
)

test_data = ds_test.padded_batch(
    32, padded_shapes=([-1],
                       [])
)

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Dense

embedding_dim = 20
vocab_size = len( token_counts ) + 2 # Vocab-size: 87007

tf.random.set_seed(1)

# build the model
bi_lstm_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim = vocab_size, # n+2
        output_dim = embedding_dim, #use a vector of length=embedding_dim to represent each word
        name = 'embed-layer'
    ),# Output Shape ==> (None_batch_size, None_each_input_length, output_dim=20)
    
    tf.keras.layers.Bidirectional(
        # lstm-layer:
        # return_sequences=False == many-to-one： (None_each_input_length, output_dim=64)==>(64)
        tf.keras.layers.LSTM(64, name='lstm-layer'),
        name = 'bidir-lstm', # default merge_mode='concat' ==> (64)==>(128) 
    ),# Output Shape ==> (None_batch_size, output_dim=128) 
    
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

bi_lstm_model.summary()

# compile and train:
bi_lstm_model.compile(
    optimizer = tf.keras.optimizers.Adam(1e-3),
    loss = tf.keras.losses.BinaryCrossentropy( from_logits=False),
    metrics=['accuracy']
)

history = bi_lstm_model.fit(
    train_data,
    validation_data = valid_data,
    epochs=10
)

# evaluate on the test data
test_results = bi_lstm_model.evaluate( test_data )
print( 'Test Acc.: {:.2f}%'.format(test_results[1]*100) )

After training this model for 10 epochs, evaluation on the test data shows 82 percent accuracy. (Note that this result is not the best when compared to the state-of-the-art
methods used on the IMDb dataset. The goal was simply to show how RNN works.)

# if not os.path.exists('models'):
#     os.mkdir('models')
!mkdir models
    
bi_lstm_model.save('models/Birdir-LSTM-full-length-seq.h5')

from google.colab import drive
drive.mount('/content/gdrive')

==>

最低0.47元/天解锁文章