（实战篇）使用Python清理机器学习的文本数据

在自然语言处理（NLP）的过程中，我们不可能直接从原始文本转到拟合机器学习或深度学习模型，我们必须要首先清理文本，这意味着将其拆分为单词并处理标点符号和大小写。

事实上，您可能需要使用一整套文本准备方法，方法的选择实际上取决于您的自然语言处理任务。

在本教程中，您将了解如何清理和准备文本，以便使用机器学习进行建模。具体内容如下：

从如何通过开发自己的非常简单的文本清理工具开始。
更进一步使用 NLTK 库中更复杂的方法。
使用现代文本表示方法（如单词嵌入）时准备文本。

本文关于清理数据的教程内容分为6个部分：

数据集：弗朗茨·卡夫卡的《变形记》
文本清理是特定于某个任务的
手动标记化
使用 NLTK 进行标记化和清理
其他文本清理注意事项
清理文本以进行词嵌入的提示

让我们开始吧。

弗朗茨·卡夫卡的《变形记》

让我们从选择一个数据集开始。

在本教程中，我们将使用弗朗茨·卡夫卡（Franz Kafka）的《变形记》一书中的文本。它很短，我喜欢它，你可能也喜欢它。

《变形记》的全文可从古腾堡计划免费获得：

弗朗茨·卡夫卡在古腾堡计划的《变形记》

您可以在此处下载文本的 ASCII 文本版本：

Metamorphosis by Franz Kafka Plain Text UTF-8

下载该文件并将其放在当前的工作目录中，文件名为“metamorphosis.txt”。

该文件包含我们不感兴趣的页眉和页脚信息，特别是版权和许可信息。打开文件并删除页眉和页脚信息，并将文件另存为“metamorphosis_clean.txt”。

干净文件的开头应如下所示：

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

该文件应以以下结尾：

And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.

文本清理特定于任务

在实际掌握文本数据之后，清理文本数据的第一步是强烈了解您要实现的目标，并在该上下文中查看文本以了解可能有所帮助的内容。

花点时间看一下文本。你注意到了什么？

这是我看到的：

它是纯文本，因此没有要解析的标记（耶！
德语原文的翻译使用英式英语（例如“travelling”）。
这些行被人为地换行，换行大约 70 个字符（meh）。
没有明显的错别字或拼写错误。
有逗号、撇号、引号、问号等标点符号。
有连字符的描述，如“盔甲状”。
有很多使用em破折号（“-”）来继续句子（也许用逗号代替？
有名字（例如“萨姆萨先生”）
似乎没有需要处理的数字（例如1999）
有部分标记（例如“II”和“III”），我们删除了第一个“I”。

我敢肯定，训练有素的眼睛还有很多事情要做。

我们将在本教程中介绍常规文本清理步骤。

不过，请考虑我们在处理此文本文档时可能遇到的一些可能的目标。

例如：

如果我们有兴趣开发一个卡夫卡式的语言模型，我们可能希望保留所有的大小写、引号和其他标点符号。
如果我们有兴趣将文档分类为“Kafka”和“Not Kafka”，也许我们会想要去掉大小写、标点符号，甚至将单词修剪回它们的词干。
使用任务作为镜头来选择如何准备文本数据。

手动标记化

文本清理很难，但我们选择使用的文本已经很干净了。

我们可以编写一些 Python 代码来手动清理它，这对于您遇到的那些简单问题来说是一个很好的练习。正则表达式和拆分字符串等工具可以带您走很长一段路。

加载数据

让我们加载文本数据，以便我们可以使用它。

文本很小，可以快速轻松地加载到内存中。情况并非总是如此，您可能需要将代码写入内存映射文件。像NLTK（在下一节中介绍）这样的工具将使处理大文件变得更加容易。

我们可以将整个“metamorphosis_clean.txt”加载到内存中，如下所示：

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

运行该示例会将整个文件加载到内存中，以便使用。

按空格分割

干净的文本通常意味着我们可以在机器学习模型中使用的单词或标记列表。

这意味着将原始文本转换为单词列表并再次保存。

一种非常简单的方法是按空格拆分文档，包括“ ”，新行，制表符等。我们可以在 Python 中使用加载字符串上的 split（）函数来做到这一点。

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
print(words[:100])

运行该示例将文档拆分为一长串单词，并打印前 100 个单词供我们查看。

我们可以看到标点符号被保留了（例如“was't”和“armour-like”），这很好。我们还可以看到，句尾标点符号与最后一个单词（例如“思想”）保持在一起，这不是很好。

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']

选择单词

另一种方法可能是使用正则表达式模型（re）并通过选择字母数字字符字符串（a-z、A-Z、0-9 和“_”）将文档拆分为单词。

例如：

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split based on words only
import re
words = re.split(r'\W+', text)
print(words[:100])

同样，运行示例我们可以看到我们得到了单词列表。这一次，我们可以看到“类似盔甲”现在是两个词“盔甲”和“喜欢”（很好），但像“What's”这样的缩略词也是两个词“什么”和“s”（不是很好）。

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']

按空格拆分并删除标点符号

注意：此示例是为 Python 3 编写的。

我们可能想要单词，但没有逗号和引号等标点符号。我们也希望保持宫缩在一起。

一种方法是通过空格将文档拆分为单词（如“2.被空格分割“），然后使用字符串翻译将所有标点符号替换为任何标点符号（例如删除它）。

Python 提供了一个名为string.punctuation的常量，它提供了一个很棒的标点符号列表。例如：

print(string.punctuation)

结果：

!"#$%&'()*+,-./:;<=>?@[\]^_{|}~`

Python 提供了一个名为translate（）的函数，它将一组字符映射到另一组字符。

我们可以使用函数maketrans（）创建一个映射表。我们可以创建一个空的映射表，但是该函数的第三个参数允许我们列出在翻译过程中要删除的所有字符。例如：

table = str.maketrans('', '', string.punctuation)

我们可以将所有这些放在一起，加载文本文件，用空格将其拆分为单词，然后翻译每个单词以删除标点符号。

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped[:100])

我们可以看到，这在很大程度上达到了预期的效果。

像“什么”这样的收缩变成了“什么”，但“盔甲样”变成了“盔甲样”。

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

如果您对正则表达式有所了解，那么您就会知道事情会从这里变得复杂。

规范化案例

通常将所有单词转换为一个案例。

这意味着词汇量会缩小，但会丢失一些区别（例如，“苹果”公司与“苹果”水果是一个常用的例子）。

我们可以通过对每个单词调用 lower（）函数将所有单词转换为小写。

例如：

filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

运行该示例，我们可以看到所有单词现在都是小写的。

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']

注意

清理文本非常困难，特定于问题，并且充满了权衡。

记住，简单更好。

更简单的文本数据，更简单的模型，更小的词汇表。你总是可以在以后让事情变得更复杂，看看它是否会带来更好的模型技能。

接下来，我们将介绍 NLTK 库中提供的功能不仅仅是简单的字符串拆分的一些工具

使用 NLTK 进行标记化和清理

Natural Language Toolkit（自然语言工具包，简称NLTK），是一个为工作和建模文本而编写的Python库。

它提供了很好的工具来加载和清理文本，我们可以使用这些工具来为使用机器学习和深度学习算法准备好数据。

安装 NLTK

您可以使用自己喜欢的包管理器安装 NLTK，例如 pip：

sudo pip install -U nltk

安装后，您将需要安装与库一起使用的数据，包括一组大量文档，稍后可以使用它们在 NLTK 中测试其他工具。

有几种方法可以执行此操作，例如从脚本中：

import nltk
nltk.download()

或者从命令行：

python -m nltk.downloader all

拆分成句子

一个有用的第一步是将文本拆分为句子。

一些建模任务更喜欢段落或句子形式的输入，例如word2vec。您可以先将文本拆分为句子，将每个句子拆分为单词，然后将每个句子保存到文件中，每行一个。

NLTK提供了sent_tokenize（）函数将文本拆分为句子。

下面的示例将“metamorphosis_clean.txt”文件加载到内存中，将其拆分为句子，然后打印第一个句子。

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into sentences
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])

运行该示例，我们可以看到，尽管文档被拆分为句子，但每个句子仍然保留了原始文档中行的人为换行中的新行。

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.

拆分为单词

NLTK提供了一个名为word_tokenize（）的函数，用于将字符串拆分为标记（名义上是单词）。

它根据空格和标点符号拆分令牌。例如，逗号和句点被视为单独的标记。宫缩被分开（例如，“What's”变成“What”“'s”）。保留引号，依此类推。

例如：

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

运行代码，我们可以看到标点符号现在是标记，然后我们可以决定专门过滤掉。

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']

过滤掉标点符号

我们可以过滤掉所有我们不感兴趣的标记，例如所有独立的标点符号。

这可以通过遍历所有令牌并仅保留那些全部按字母顺序排列的令牌来完成。Python 有可以使用的函数isalpha（）。例如：

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

运行该示例，您可以看到不仅标点符号，而且“类似盔甲”和“的”之类的示例也被过滤掉了。

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']

过滤掉停用词

停用词是那些对短语的深层含义没有贡献的词。

它们是最常见的单词，例如：“the”、“a”和“is”。

对于某些应用程序（如文档分类），删除停用词可能是有意义的。

NLTK为多种语言（如英语）提供了通常商定的停用词列表。可以按如下方式加载它们：

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

您可以看到完整列表，如下所示：

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

您可以看到它们都是小写的，并且删除了标点符号。

您可以将标记与停用词进行比较并过滤掉它们，但必须确保以相同的方式准备文本。

让我们通过一个小的文本准备管道来演示这一点，包括：

加载原始文本。
拆分为令牌。
转换为小写。
删除每个标记中的标点符号。
筛选出未按字母顺序排列的剩余标记。
筛选出作为停用词的标记。

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

运行此示例，我们可以看到除了所有其他转换之外，还删除了诸如“a”和“to”之类的停用词。

我注意到我们仍然留下像“nt”这样的标记。兔子洞很深;我们总能做的更多。

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer']

词干词

词干提取是指将每个单词减少到其根或基的过程。

例如，“钓鱼”，“钓鱼”，“渔夫”都归结为茎“鱼”。

一些应用程序（如文档分类）可能会从词干中受益，以便减少词汇量并专注于文档的意义或情感，而不是更深层次的含义。

有许多词干提取算法，尽管一种流行且长期存在的方法是波特词干算法。此方法在 NLTK 中可通过PorterStemmer类获得。

例如：

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

运行示例，您可以看到单词已简化为词干，例如“麻烦”已变为“troubl”。您还可以看到词干提取实现还将标记减少为小写，可能是用于单词表中的内部查找。

您还可以看到词干提取实现还将标记减少为小写，可能是用于单词表中的内部查找。

['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', '``', 'what', "'s", 'happen', 'to'

在NLTK中有一套不错的词干提取和词形还原算法可供选择，如果你的项目需要将单词减少到它们的根。

其他文本清理注意事项

我们才刚刚开始。

由于本教程的源文本一开始就相当干净，因此我们跳过了许多可能需要在自己的项目中处理的文本清理问题。

以下是清理文本时其他注意事项的简短列表：

处理不适合内存的大型文档和大量文本文档。
从标记（如 HTML、PDF 或其他结构化文档格式）中提取文本。
将其他语言的字符音译为英语。
将 Unicode 字符解码为规范化形式，例如 UTF8。
处理特定于域的单词、短语和首字母缩略词。
处理或删除数字，例如日期和金额。
查找并更正常见的拼写错误和拼写错误。
...

这个清单可以继续下去。

希望你能看到，获得真正干净的文本是不可能的，我们真的在根据我们所拥有的时间、资源和知识尽我们所能。

“干净”的概念实际上是由项目的特定任务或关注点定义的。

专业提示是在每次转换后不断查看您的令牌。我试图在本教程中展示这一点，希望您牢记这一点。

理想情况下，您将在每次转换后保存一个新文件，以便您可以花时间处理新表单中的所有数据。当花时间查看数据时，事情总是会跳出来。

你以前做过一些文本清理吗？您首选的转换管道是什么？
在下面的评论中让我知道。

清理文本以进行词嵌入的提示
最近，自然语言处理领域已经从词袋模型和词编码转向词嵌入。

词嵌入的好处是，它们将每个单词编码为一个密集的向量，该向量捕获其在训练文本中的相对含义。

这意味着大小写、拼写、标点符号等单词的变体将自动学习为嵌入空间中的相似之处。反过来，这可能意味着文本所需的清理量可能更少，并且可能与经典文本清理有很大不同。

例如，词干或删除标点符号以进行收缩可能不再有意义。

Tomas Mikolov是word2vec的开发者之一，word2vec是一种流行的词嵌入方法。他建议在学习单词嵌入模型时只需要非常少的文本清理。

以下是他在被问及如何最好地为 word2vec 准备文本数据时的反应。

没有普遍的答案。这完全取决于您计划将载体用于什么目的。根据我的经验，通常最好断开（或删除）单词中的标点符号，有时还可以将所有字符转换为小写。还可以将所有数字（可能大于某个常量）替换为一些单个标记，例如 .
所有这些预处理步骤都旨在减少词汇量而不删除任何重要内容（在某些情况下，当您将某些单词小写时，情况可能并非如此，即。“布什”与“布什”不同，而“另一个”通常与“另一个”具有相同的含义）。词汇量越小，记忆复杂度越低，估计的单词参数就越可靠。您还必须以相同的方式预处理测试数据。
...
简而言之，如果您进行实验，您将更好地理解所有这些。