一、总结
一句话总结:
用的texts_to_matrix方法:one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
这里的one-hot是binary结构:[0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
[0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
1、one-hot 散列技巧(one-hot hashing trick)?
【如果词表中唯一标记的数量太大而无法直接处理】:one-hot 编码的一种变体是所谓的one-hot 散列技巧(one-hot hashing trick),如果词表中唯 一标记的数量太大而无法直接处理,就可以使用这种技巧。
【没有为每个单词显式分配一个索引并将这些索引保存在一个字典中】:这种方法没有为每个单词显式分配 一个索引并将这些索引保存在一个字典中,而是将单词散列编码为固定长度的向量,通常用一 个非常简单的散列函数来实现。
【节省内存并允许数据的在线编码】:这种方法的主要优点在于,它避免了维护一个显式的单词索引, 从而节省内存并允许数据的在线编码(在读取完所有数据之前,你就可以立刻生成标记向量)。
【散列冲突(hash collision)】:这种方法有一个缺点,就是可能会出现散列冲突(hash collision),即两个不同的单词可能具有 相同的散列值,随后任何机器学习模型观察这些散列值,都无法区分它们所对应的单词。如果 散列空间的维度远大于需要散列的唯一标记的个数,散列冲突的可能性会减小。
二、6.1、one-hot-encoding
博客对应课程的视频位置:
¶
import numpy as np
# This is our initial data; one entry per "sample"
# (in this toy example, a "sample" is just a sentence, but
# it could be an entire document).
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# First, build an index of all tokens in the data.
token_index = {}
for sample in samples:
# We simply tokenize the samples via the `split` method.
# in real life, we would also strip punctuation and special characters
# from the samples.
for word in sample.split():
if word not in token_index:
# Assign a unique index to each unique word
token_index[word] = len(token_index) + 1
# Note that we don't attribute index 0 to anything.
# Next, we vectorize our samples.
# We will only consider the first `max_length` words in each sample.
max_length = 10
# This is where we store our results:
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = token_index.get(word)
results[i, j, index] = 1.
print(results)
2、Character level one-hot encoding (toy example)
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable # All printable ASCII characters.
token_index = dict(zip(characters, range(1, len(characters) + 1)))
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, character in enumerate(sample[:max_length]):
index = token_index.get(character)
results[i, j, index] = 1.
print(results.shape)
print(results)
# 101因为:All printable ASCII characters.
# 50因为:指定的max_length = 50
# 2因为:2个句子
3、Using Keras for word-level one-hot encoding:
from tensorflow.keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# We create a tokenizer, configured to only take
# into account the top-1000 most common words
tokenizer = Tokenizer(num_words=1000)
# This builds the word index
tokenizer.fit_on_texts(samples)
# This turns strings into lists of integer indices.
sequences = tokenizer.texts_to_sequences(samples)
print(sequences)
# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
print(one_hot_results[0][0:20])
print(one_hot_results[1][0:20])
# This is how you can recover the word index that was computed
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print(one_hot_results)
4、Word-level one-hot encoding with hashing trick (toy example):
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# We will store our words as vectors of size 1000.
# Note that if you have close to 1000 words (or more)
# you will start seeing many hash collisions, which
# will decrease the accuracy of this encoding method.
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
# Hash the word into a "random" integer index
# that is between 0 and 1000
index = abs(hash(word)) % dimensionality
results[i, j, index] = 1.
print(results.shape)
print(results[0][0])
print(results)
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:《python深度学习》笔记—6.1、one-hot-encoding - Python技术站