《python深度学习》笔记—6.1-3、word embedding-使用预训练的词嵌入

2023年4月13日上午1:01 • 深度学习

一、总结

一句话总结：

【将文本转换为能处理的格式】：将原始文本转换为神经网络能够处理的格式。

【Keras 模型的 Embedding 层】：使用 Keras 模型的 Embedding 层来学习针对特定任务的标记嵌入。

【预训练词嵌入提升在小型自然语言处理问题】：使用预训练词嵌入在小型自然语言处理问题上获得额外的性能提升

1、使用预训练的词嵌入的情况？

【可用的训练数据很少】：有时可用的训练数据很少，以至于只用手头数据无法学习适合特定任务的词嵌入。

2、在自然语言处理中使用预训练的词嵌入，其背后的原理与在图像分类中使用预训练的卷积神经网络是一样的？

【特征通用】：没有足够的数据来自己学习真正强大的特征，但你需要的特征应该是非常通用的，比如常见的视觉特征或语义特征。

3、预训练的词嵌入是怎么计算出来的？

【词频统计】：预训练的词嵌入通常是利用词频统计计算得出的（观察哪些词共同出现在句子或文档中），用到的技术很多，有些涉及神经网络，有些则不涉及。

4、词嵌入word2vec 算法特点？

【其维度抓住了特定的语义属性】：word2vec 算法由 Google 的 Tomas Mikolov 于 2013 年开发，其维度抓住了特定的语义属性，比如性别

二、word embedding-使用预训练的词嵌入

博客对应课程的视频位置：

Using pre-trained word embeddings

Sometimes, you have so little training data available that could never use your data alone to learn an appropriate task-specific embedding of your vocabulary. What to do then?

Instead of learning word embeddings jointly with the problem you want to solve, you could be loading embedding vectors from a pre-computed embedding space known to be highly structured and to exhibit useful properties -- that captures generic aspects of language structure. The rationale behind using pre-trained word embeddings in natural language processing is very much the same as for using pre-trained convnets in image classification: we don't have enough data available to learn truly powerful features on our own, but we expect the features that we need to be fairly generic, i.e. common visual features or semantic features. In this case it makes sense to reuse features learned on a different problem.

Such word embeddings are generally computed using word occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of techniques, some involving neural networks, others not. The idea of a dense, low-dimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s, but it only started really taking off in research and industry applications after the release of one of the most famous and successful word embedding scheme: the Word2Vec algorithm, developed by Mikolov at Google in 2013. Word2Vec dimensions capture specific semantic properties, e.g. gender.

There are various pre-computed databases of word embeddings that can download and start using in a Keras Embedding layer. Word2Vec is one of them. Another popular one is called "GloVe", developed by Stanford researchers in 2014. It stands for "Global Vectors for Word Representation", and it is an embedding technique based on factorizing a matrix of word co-occurrence statistics. Its developers have made available pre-computed embeddings for millions of English tokens, obtained from Wikipedia data or from Common Crawl data.

Let's take a look at how you can get started using GloVe embeddings in a Keras model. The same method will of course be valid for Word2Vec embeddings or any other word embedding database that you can download. We will also use this example to refresh the text tokenization techniques we introduced a few paragraphs ago: we will start from raw text, and work our way up.

Putting it all together: from raw text to word embeddings

We will be using a model similar to the one we just went over -- embedding sentences in sequences of vectors, flattening them and training a Dense layer on top. But we will do it using pre-trained word embeddings, and instead of using the pre-tokenized IMDB data packaged in Keras, we will start from scratch, by downloading the original text data.

Download the IMDB data as raw text

First, head to http://ai.stanford.edu/~amaas/data/sentiment/ and download the raw IMDB dataset (if the URL isn't working anymore, just Google "IMDB dataset"). Uncompress it.

Now let's collect the individual training reviews into a list of strings, one string per review, and let's also collect the review labels (positive / negative) into a labels list:

In [6]:

import os

imdb_dir = '/home/ubuntu/data/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

Tokenize the data

Let's vectorize the texts we collected, and prepare a training and validation split. We will merely be using the concepts we introduced earlier in this section.

Because pre-trained word embeddings are meant to be particularly useful on problems where little training data is available (otherwise, task-specific embeddings are likely to outperform them), we will add the following twist: we restrict the training data to its first 200 samples. So we will be learning to classify movie reviews after looking at just 200 examples...

In [7]:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # We will cut reviews after 100 words
training_samples = 200  # We will be training on 200 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)

Download the GloVe word embeddings

Head to https://nlp.stanford.edu/projects/glove/ (where you can learn more about the GloVe algorithm), and download the pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens). Un-zip it.

Pre-process the embeddings

Let's parse the un-zipped file (it's a txt file) to build an index mapping words (as strings) to their vector representation (as number vectors).

In [9]:

glove_dir = '/home/ubuntu/data/'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.

Now let's build an embedding matrix that we will be able to load into an Embedding layer. It must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index (built during tokenization). Note that the index 0 is not supposed to stand for any word or token -- it's a placeholder.

In [10]:

embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

Define a model

We will be using the same model architecture as before:

In [15]:

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_3 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 33        
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________

Load the GloVe embeddings in the model

The Embedding layer has a single weight matrix: a 2D float matrix where each entry i is the word vector meant to be associated with index i. Simple enough. Let's just load the GloVe matrix we prepared into our Embedding layer, the first layer in our model:

In [16]:

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

Additionally, we freeze the embedding layer (we set its trainable attribute to False), following the same rationale as what you are already familiar with in the context of pre-trained convnet features: when parts of a model are pre-trained (like our Embedding layer), and parts are randomly initialized (like our classifier), the pre-trained parts should not be updated during training to avoid forgetting what they already know. The large gradient update triggered by the randomly initialized layers would be very disruptive to the already learned features.

Train and evaluate

Let's compile our model and train it:

In [17]:

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

Train on 200 samples, validate on 10000 samples
Epoch 1/10
200/200 [==============================] - 1s - loss: 1.9075 - acc: 0.5050 - val_loss: 0.7027 - val_acc: 0.5102
Epoch 2/10
200/200 [==============================] - 0s - loss: 0.7329 - acc: 0.7100 - val_loss: 0.8200 - val_acc: 0.5000
Epoch 3/10
200/200 [==============================] - 0s - loss: 0.4876 - acc: 0.7400 - val_loss: 0.6917 - val_acc: 0.5616
Epoch 4/10
200/200 [==============================] - 0s - loss: 0.3640 - acc: 0.8400 - val_loss: 0.7005 - val_acc: 0.5557
Epoch 5/10
200/200 [==============================] - 0s - loss: 0.2673 - acc: 0.8950 - val_loss: 1.2560 - val_acc: 0.4999
Epoch 6/10
200/200 [==============================] - 0s - loss: 0.1936 - acc: 0.9400 - val_loss: 0.7294 - val_acc: 0.5704
Epoch 7/10
200/200 [==============================] - 0s - loss: 0.2455 - acc: 0.8800 - val_loss: 0.7187 - val_acc: 0.5659
Epoch 8/10
200/200 [==============================] - 0s - loss: 0.0591 - acc: 0.9950 - val_loss: 0.7393 - val_acc: 0.5723
Epoch 9/10
200/200 [==============================] - 0s - loss: 0.0399 - acc: 1.0000 - val_loss: 0.8691 - val_acc: 0.5522
Epoch 10/10
200/200 [==============================] - 0s - loss: 0.0283 - acc: 1.0000 - val_loss: 0.9322 - val_acc: 0.5413

Let's plot its performance over time:

In [18]:

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

《python深度学习》笔记---6.1-3、word embedding-使用预训练的词嵌入

The model quickly starts overfitting, unsurprisingly given the small number of training samples. Validation accuracy has high variance for the same reason, but seems to reach high 50s.

Note that your mileage may vary: since we have so few training samples, performance is heavily dependent on which exact 200 samples we picked, and we picked them at random. If it worked really poorly for you, try picking a different random set of 200 samples, just for the sake of the exercise (in real life you don't get to pick your training data).

We can also try to train the same model without loading the pre-trained word embeddings and without freezing the embedding layer. In that case, we would be learning a task-specific embedding of our input tokens, which is generally more powerful than pre-trained word embeddings when lots of data is available. However, in our case, we have only 200 training samples. Let's try it:

In [20]:

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_4 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 33        
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Train on 200 samples, validate on 10000 samples
Epoch 1/10
200/200 [==============================] - 1s - loss: 0.6941 - acc: 0.4750 - val_loss: 0.6920 - val_acc: 0.5213
Epoch 2/10
200/200 [==============================] - 0s - loss: 0.5050 - acc: 0.9900 - val_loss: 0.6949 - val_acc: 0.5138
Epoch 3/10
200/200 [==============================] - 0s - loss: 0.2807 - acc: 1.0000 - val_loss: 0.7131 - val_acc: 0.5125
Epoch 4/10
200/200 [==============================] - 0s - loss: 0.1223 - acc: 1.0000 - val_loss: 0.6997 - val_acc: 0.5214
Epoch 5/10
200/200 [==============================] - 0s - loss: 0.0584 - acc: 1.0000 - val_loss: 0.7043 - val_acc: 0.5183
Epoch 6/10
200/200 [==============================] - 0s - loss: 0.0307 - acc: 1.0000 - val_loss: 0.7051 - val_acc: 0.5248
Epoch 7/10
200/200 [==============================] - 0s - loss: 0.0166 - acc: 1.0000 - val_loss: 0.7345 - val_acc: 0.5282
Epoch 8/10
200/200 [==============================] - 0s - loss: 0.0098 - acc: 1.0000 - val_loss: 0.7173 - val_acc: 0.5199
Epoch 9/10
200/200 [==============================] - 0s - loss: 0.0058 - acc: 1.0000 - val_loss: 0.7201 - val_acc: 0.5253
Epoch 10/10
200/200 [==============================] - 0s - loss: 0.0035 - acc: 1.0000 - val_loss: 0.7244 - val_acc: 0.5264

In [22]:

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Validation accuracy stalls in the low 50s. So in our case, pre-trained word embeddings does outperform jointly learned embeddings. If you increase the number of training samples, this will quickly stop being the case -- try it as an exercise.

Finally, let's evaluate the model on the test data. First, we will need to tokenize the test data:

In [24]:

test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

And let's load and evaluate the first model:

In [25]:

model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

24736/25000 [============================>.] - ETA: 0s

Out[25]:

[0.93747248332977295, 0.53659999999999997]

We get an appalling test accuracy of 54%. Working with just a handful of training samples is hard!

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：《python深度学习》笔记—6.1-3、word embedding-使用预训练的词嵌入 - Python技术站

深度学习

0 0 打赏

微信扫一扫

支付宝扫一扫

《python深度学习》笔记—6.1、one-hot-encoding

上一篇 2023年4月13日

《python深度学习》笔记—5.3-4、猫狗分类（使用预训练网络-微调模型）

下一篇 2023年4月13日

深度学习在高德POI鲜活度提升中的演进

1.导读高德地图拥有着数千万的POI（Point of Interest）兴趣点，如学校、酒店、加油站、超市等。其中伴随着众多POI创建的同时，会有大量的POI过期，如停业、拆迁、搬迁、更名。这部分POI对地图鲜活度和用户体验有着严重的负面影响，需要及时有效地识别并处理。由于实地采集的方式成本高且时效性低，挖掘算法则显得格外重要。其中基于趋势大数据的时序…

深度学习 2023年4月11日
000
深度学习（三）之LSTM写诗

目录数据预处理构建数据集模型结构生成诗根据上文生成诗生成藏头诗参考根据前文生成诗：机器学习业，圣贤不可求。临戎辞蜀计，忠信尽封疆。天子咨两相，建章应四方。自疑非俗态，谁复念鹪鹩。生成藏头诗：国步平生不愿君，古人今在古人风。科公既得忘机者，白首空山道姓名。大道不应无散处，未曾进退却还征。环境： python：3.9.7 pytorc…

深度学习 2023年4月12日
000
深度学习面试题15：卷积核需要旋转180度

　　举例　　结论　　参考资料在一些书籍和博客中所讲的卷积（一个卷积核和输入的对应位置相乘，然后累加）不是真正意义上的卷积。根据离散卷积的定义，卷积核是需要旋转180的。按照定义来说，一个输入和一个卷积核做卷积操作的流程是： ①卷积核旋转180 ②对应位置相乘，然后累加举例下面这个图是常见的卷积运算图：中间的卷积核，其实是已经旋转过180度的…

深度学习 2023年4月12日
000
深度学习循环神经网络 LSTM 示例

最近在网上找到了一个使用LSTM 网络解决世界银行中各国 GDP预测的一个问题，感觉比较实用，毕竟这是找到的唯一一个可以正确运行的程序。 #encoding:UTF-8 import pandas as pd from pandas_datareader import wb import torch import torch.nn impo…

深度学习 2023年4月13日
000
吴恩达《深度学习》第二门课（1）深度学习的实用层面

1.1训练，验证，测试集（Train/Dev/Test sets）（1）深度学习是一个按照下图进行循环的快速迭代的过程，往往需要多次才能为应用程序找到一个称心的神经网络。（2）在机器学习中，通常将样本分成训练集，验证集和测试集三部分，数据规模相对较小，适合传统的划分比例（如6:2:2），数据集规模比较大的，验证集和测试集要小于数据总量的20%或者10%甚…

深度学习 2023年4月11日
000
深度强化学习介绍【PPT】 Human-level control through deep reinforcement learning （DQN） – Hello_BeautifulWorld

深度强化学习介绍【PPT】 Human-level control through deep reinforcement learning （DQN）这个是平时在实验室讲reinforcement learning 的时候用到PPT，交期末作业、汇报都是一直用的这个，觉得比较不错，保存一下，也为分享，最早该PPT源于师弟汇报所做。 …

深度学习 2023年4月11日
000
【计算机视觉】【神经网络与深度学习】YOLO v2 detection训练自己的数据2

关于用yolo训练自己VOC格式数据的博文真的不少，但是当我按照他们的方法一步一步走下去的时候发现出了其他作者没有提及的问题。这里就我自己的经验讲讲如何训练自己的数据集。 2.数据集这里建议大家用VOC和ILSVRC比赛的数据集，因为xml文件都是现成的，省去很多功夫。当然除非你是个执着的孩子就想凭借着非人的毅力和追逐斗鸡眼这种个性特征而自己去标记…

深度学习 2023年4月13日
000
TVM:一个端到端的用于开发深度学习负载以适应多种硬件平台的IR栈

本文对TVM的论文进行了翻译整理深度学习如今无处不在且必不可少。这次创新部分得益于可扩展的深度学习系统，比如 TensorFlow、MXNet、Caffe 和 PyTorch。大多数现有系统针对窄范围的服务器级 GPU 进行了优化，并且需要在其他平台（如手机、IoT 设备和专用加速器（FPGA、 ASIC））上部署大量工作。随着深度学习框架和硬件后端数量不…

深度学习 2023年4月11日
000