使用Regex从Dataframe的指定列中提取标点符号

2023年3月27日下午2:58 • python-answer

使用Regex从Dataframe的指定列中提取标点符号的步骤如下：

导入必要的库

首先需要导入pandas库和re库，其中pandas库用于读取和处理数据，re库用于进行正则表达式匹配。

import pandas as pd
import re

读取数据

使用pandas库读取数据，例如读取名为"example.csv"的表格数据。假设表格中有一列名为"text"，需要从中提取标点符号。

df = pd.read_csv("example.csv")

编写正则表达式

正则表达式是用于匹配文本中模式的一种语法，需要编写正确的正则表达式才能从文本中提取所需的信息。在本例中，需要提取标点符号，可以使用如下正则表达式：[\p{P}]+。

该正则表达式中，\p{P}表示匹配所有的标点符号，+表示匹配至少一个字符。

编写函数

在pandas库中，可以使用apply函数对指定列的每个元素应用相同的函数。因此，可以编写一个函数来实现从文本中提取标点符号的功能，并将该函数应用于"data"列上。

def extract_punctuation(text):
    punctuation = re.findall(r"[\p{P}]+", text)
    return " ".join(punctuation)

df["punctuation"] = df["text"].apply(extract_punctuation)

该函数先使用re.findall函数匹配所有符合正则表达式的标点符号，并将其存储在一个列表中，然后使用" ".join函数将列表中的标点符号连接成一个字符串。

输出结果

使用pandas库的to_csv函数输出结果。

df.to_csv("result.csv", index = False)

完整代码及测试数据示例：

import pandas as pd
import re

# 读入数据
df = pd.read_csv("testdata.csv")

# 编写函数
def extract_punctuation(text):
    punctuation = re.findall(r"[\p{P}]+", text)
    return " ".join(punctuation)

# 提取标点符号
df["punctuation"] = df["text"].apply(extract_punctuation)

# 输出结果
df.to_csv("result.csv", index = False)

testdata.csv内容示例：

text
This is a test sentence.
This is another sentence, with some more punctuation? And quotes "!"
A third sentence doesn't have any punctuation or other symbols

result.csv内容示例：

text	punctuation
This is a test sentence.	.
This is another sentence, with some more punctuation? And quotes "!"	, ? " !
A third sentence doesn't have any punctuation or other symbols