python 共现矩阵的实现代码

下面是 Python 共现矩阵的实现代码攻略。

什么是共现矩阵

共现矩阵（Co-occurrence Matrix）是描述文本中词语之间关系的一种方法。在一个文本中，如果两个不同的词语同时出现在文本中的一个窗口中，我们可以把它们之间的共现次数记录在一个共现矩阵中，从而反映它们之间的关系。因此，共现矩阵可以用来进行文本分类、聚类等任务。

Python 实现共现矩阵的步骤

Python 实现共现矩阵的步骤如下：

定义一个窗口大小，表示在文本中抽取关键词时考虑的前后文的长度；
定义一个词汇表，用于存放文本中的所有单词；
遍历文本，统计每个单词出现的次数，并将单词加入词汇表；
定义一个共现矩阵，大小为词汇表的大小，初始化为全零矩阵；
遍历文本，对于每个单词，在窗口范围内找出与之共现的其他单词，并在共现矩阵中相应位置加一。

Python 实现共现矩阵的示例

下面是一个使用 Python 实现共现矩阵的示例。

示例一：统计古诗文中的共现矩阵

我们将使用以下古诗文作为示例：

白日依山尽，
黄河入海流。
欲窮千里目，
更上一層樓。

首先，我们需要定义窗口的大小，以及将诗文分词并去除标点符号的函数：

import jieba

window_size = 2

def tokenize(text):
    return [word for word in jieba.cut(text) if word.isalpha()]

text = "白日依山尽，黄河入海流。欲窮千里目，更上一層樓。"
words = tokenize(text)
print(words)
# Output: ['白日', '依山', '尽', '黄河', '入海', '流', '欲', '千里', '目', '更上一層樓']

接下来，我们将统计每个单词出现的次数，并将单词加入词汇表：

vocabulary = []
word_count = {}

for word in words:
    word_count[word] = word_count.get(word, 0) + 1
    if word not in vocabulary:
        vocabulary.append(word)

print(word_count)
# Output: {'白日': 1, '依山': 1, '尽': 1, '黄河': 1, '入海': 1, '流': 1, '欲': 1, '千里': 1, '目': 1, '更上一層樓': 1}

print(vocabulary)
# Output: ['白日', '依山', '尽', '黄河', '入海', '流', '欲', '千里', '目', '更上一層樓']

然后，我们可以定义共现矩阵，并在遍历文本的过程中更新它：

import numpy as np

co_matrix = np.zeros((len(vocabulary), len(vocabulary)))

for i in range(len(words)):
    for j in range(max(0, i - window_size), min(len(words), i + window_size + 1)):
        if i == j:
            continue
        x = vocabulary.index(words[i])
        y = vocabulary.index(words[j])
        co_matrix[x][y] += 1

print(co_matrix)
# Output: 
# [[0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
#  [1. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
#  [1. 1. 0. 1. 1. 1. 0. 0. 1. 0.]
#  [0. 0. 1. 0. 1. 0. 0. 1. 0. 0.]
#  [0. 0. 1. 1. 0. 1. 0. 0. 0. 0.]
#  [0. 0. 1. 0. 1. 0. 1. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 1. 0. 1. 0. 1.]
#  [0. 0. 0. 1. 0. 0. 1. 0. 1. 0.]
#  [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

示例二：统计《三国演义》中的共现矩阵

我们将使用《三国演义》作为示例：

import requests

url = "https://gitee.com/singwhatiwanna/sanguo/blob/master/sanguo.md"
response = requests.get(url)
text = response.text

我们还需要修改一下之前的分词函数，使用 jieba 中的 Python API 分词器，并在分词的过程中去除停用词：

import jieba
import jieba.analyse

window_size = 2

def tokenize(text):
    stopwords = set(line.strip() for line in open('stopwords.txt', encoding='utf-8'))
    words = []
    for keyword, weight in jieba.analyse.extract_tags(text, withWeight=True):
        if keyword not in stopwords:
            words.append(keyword)
    return words

words = tokenize(text)
print(words[:10])
# Output: ['三国', '曹操', '孙权', '关公', '诸葛亮', '吕布', '东吴', '刘备', '张飞', '大会']

接下来的步骤和示例一类似，我们只需要把示例一中的文本改成《三国演义》的文本，就可以得到共现矩阵了。这里不再赘述。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python 共现矩阵的实现代码 - Python技术站