文本表示 - TF IDF - 计算词频逆文档频率

文本表示 - TF - IDF - 计算词频逆文档频率

一、引言

在自然语言处理（NLP）中，文本表示是一个基础且关键的任务，它旨在将文本数据转化为计算机能够理解和处理的数值形式。TF - IDF（Term Frequency - Inverse Document Frequency）是一种常用的文本特征提取方法，用于评估一个词在文档集合中的重要性。通过计算词频（TF）和逆文档频率（IDF），TF - IDF 能够突出那些在特定文档中频繁出现，但在整个文档集合中相对较少出现的词，这些词通常更能代表文档的主题。

二、TF - IDF 原理

2.1 词频（Term Frequency, TF）

词频指的是一个词在文档中出现的频率。计算公式为：
[ TF{t,d} = \frac{n{t,d}}{\sum{i} n{i,d}} ]
其中，(n{t,d}) 是词 (t) 在文档 (d) 中出现的次数，(\sum{i} n_{i,d}) 是文档 (d) 中所有词的出现次数之和。

2.2 逆文档频率（Inverse Document Frequency, IDF）

逆文档频率衡量的是一个词的普遍重要性。如果一个词在很多文档中都出现，那么它的 IDF 值就会较低；反之，如果一个词只在少数文档中出现，它的 IDF 值就会较高。计算公式为：
[ IDF{t} = \log \frac{N}{df{t}} ]
其中，(N) 是文档集合中的文档总数，(df_{t}) 是包含词 (t) 的文档数。

2.3 TF - IDF 值

将词频和逆文档频率相乘，就得到了 TF - IDF 值：
[ TF - IDF{t,d} = TF{t,d} \times IDF_{t} ]

三、TF - IDF 计算示例

假设我们有以下三个文档的文档集合：

文档 1：“The cat sat on the mat.”
文档 2：“The dog chased the cat.”
文档 3：“The bird flew in the sky.”

下面我们手动计算 “cat” 这个词在文档 1 中的 TF - IDF 值。

3.1 计算词频（TF）

在文档 1 中，“cat” 出现了 1 次，文档 1 总共有 6 个词。所以 “cat” 在文档 1 中的词频为：
[ TF_{cat,doc1} = \frac{1}{6} \approx 0.167 ]

3.2 计算逆文档频率（IDF）

文档总数 (N = 3)，包含 “cat” 的文档数 (df{cat} = 2)。所以 “cat” 的逆文档频率为：
[ IDF{cat} = \log \frac{3}{2} \approx 0.405 ]

3.3 计算 TF - IDF 值

“cat” 在文档 1 中的 TF - IDF 值为：
[ TF - IDF{cat,doc1} = TF{cat,doc1} \times IDF_{cat} \approx 0.167 \times 0.405 \approx 0.068 ]

四、Python 代码演示

import math
from collections import Counter
# 文档集合
documents = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The bird flew in the sky."
]
# 计算词频（TF）
def compute_tf(document):
    word_counts = Counter(document.lower().split())
    total_words = len(document.lower().split())
    tf = {word: count / total_words for word, count in word_counts.items()}
    return tf
# 计算逆文档频率（IDF）
def compute_idf(documents):
    num_docs = len(documents)
    word_doc_count = {}
    for doc in documents:
        unique_words = set(doc.lower().split())
        for word in unique_words:
            word_doc_count[word] = word_doc_count.get(word, 0) + 1
    idf = {word: math.log(num_docs / count) for word, count in word_doc_count.items()}
    return idf
# 计算 TF - IDF
def compute_tf_idf(documents):
    tf_list = [compute_tf(doc) for doc in documents]
    idf = compute_idf(documents)
    tf_idf_list = []
    for tf in tf_list:
        tf_idf = {word: tf[word] * idf[word] for word in tf}
        tf_idf_list.append(tf_idf)
    return tf_idf_list
# 计算 TF - IDF 值
tf_idf_result = compute_tf_idf(documents)
# 输出结果
for i, doc_tf_idf in enumerate(tf_idf_result):
    print(f"文档 {i + 1} 的 TF - IDF 值:")
    for word, score in doc_tf_idf.items():
        print(f"  {word}: {score:.4f}")

五、总结

TF - IDF 是一种简单而有效的文本表示方法，它通过结合词频和逆文档频率，能够突出文档中的重要词汇。以下是 TF - IDF 的优缺点总结：

优点	缺点
简单易懂，计算效率高	没有考虑词的语义信息
能够有效过滤常见词，突出关键词	对于短文本效果可能不佳
不需要额外的训练数据	对于新出现的词，IDF 值可能不准确

TF - IDF 在信息检索、文本分类、关键词提取等任务中都有广泛的应用。通过合理使用 TF - IDF，我们可以更好地处理和分析文本数据。

.bat程序教程	python入门基础教程	Pandas教程	Pygame教程
Django3.2.9教程	Flask1.1.1教程	python3.X - 区块链教程	Java教程
Spring教程	C#教程	PHP教程	R教程
Node.js教程	mysql数据库教程	Redis数据库教程	MongoDB数据库教程
RabbitMQ教程	Lua教程	FindBI教程	HTML5教程
CSS教程	Javascript教程	jQuery教程	微信小程序教程
微信小游戏教程	Vue.js教程	服务器教程	TensorFlow教程
PyTorch教程	Unity教程	Objective-C教程	Android教程
AppleScript教程	Mac - SHELL教程	算法教程	Python教程
数据库教程	运维工具教程	Nginx教程	Docker教程

文本表示 - TF IDF - 计算词频 逆文档频率