对斗破苍穹进行python文本分析

手机

iphone11大小尺寸是多少？苹果iPhone11和iPhone13的区别是什么？

警方通报辅警执法直播中被撞飞：犯罪嫌疑人已投案

家电

对斗破苍穹进行python文本分析

2023-03-21 14:00:03 来源：博客园

(相关资料图)

对斗破苍穹进行python文本分析

用python分析该小说的分词，词频，词云，小说人物出场次数排序等等。

1、分词

对文本进行分词，将分词结果输出到文本文件中。

自己创建一个txt文本文件，形成自定义词库，如下

import jiebaimport reimport string# 使用 load_userdict 方法加载自定义词库jieba.load_userdict("1.txt")with open("doupo.txt","r",encoding="gbk") as f:    #读取文件    text = f.read()# 使用正则表达式去掉标点符号和空格text = re.sub("[\s+\.\!\/_,$%^*(+\"\"]+|[+——！，。？、~@#￥%……&*（）：；《）《》“”()»〔〕-]+", "", text)# 对文本进行分词words = jieba.cut(text)# 去掉无意义的空字符串result = " ".join(word for word in words if word.strip())# 将分词结果保存到txt文档中with open("output.txt", "w", encoding="utf-8") as f:    f.write(result)

结果显示：

2、词频

import jiebaimport jieba.posseg as psegfrom wordcloud import WordCloudfrom collections import Counterimport matplotlib.pyplot as pltimport rewith open("output.txt","r",encoding="utf-8") as f:    #读取文件    text = f.read()# 对文本进行分词words = jieba.lcut(text)# 统计词频word_count = {}for word in words:    if len(word) > 1:  # 只统计长度大于1的词        word_count[word] = word_count.get(word, 0) + 1# 按照词频排序sorted_word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)# 打印结果for word, count in sorted_word_count:  print(word, count)

结果显示：

由于分词太多就不一一展示

3、绘制词云

import jiebaimport jieba.posseg as psegfrom wordcloud import WordCloudfrom collections import Counterimport matplotlib.pyplot as pltimport rewith open("doupo.txt","r",encoding="gbk") as f:    #读取文件    text = f.read()# 对文本进行分词words = jieba.lcut(text)# 统计词频word_count = {}for word in words:    if len(word) > 1:  # 只统计长度大于1的词        word_count[word] = word_count.get(word, 0) + 1# 按照词频排序sorted_word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)#创建词云对象wc = WordCloud(    font_path="云峰静龙行书.ttf",    background_color= "white",    max_words=500,    max_font_size=200,    width=1000,    margin=5,    height=800).generate_from_frequencies(word_count)plt.imshow(wc)plt.axis("off")plt.show()

结果显示：

4、画饼状图

注：小说前10人物出场次数排序的饼状图

import jiebaimport matplotlib.pyplot as pltfrom collections import Counterimport re# 打开人物姓名词库文件with open("person_names.txt", "r", encoding="utf-8") as f:    person_names = f.read().splitlines()# 添加人物姓名词库到结巴分词器jieba.load_userdict("person_names.txt")# 打开文本文件with open("doupo.txt", "r", encoding="gbk") as f:    text = f.read()# 对文本进行分词words = jieba.lcut(text)# 统计人物姓名词频name_freq = {}for word in words:    if word in person_names:        if word in name_freq:            name_freq[word] += 1        else:            name_freq[word] = 1# 输出人物姓名及其词频#for name, freq in name_freq.items():    #print(name, freq)sorted_dict = sorted(name_freq.items(), key=lambda x: x[1], reverse=True)top_words = sorted_dict[:10] # 取出前 10 个值labels = [word[0] for word in top_words]sizes = [word[1] for word in top_words]# 画图fig, ax = plt.subplots()plt.rcParams["font.sans-serif"] = ["SimHei"]  # 设置字体为黑体plt.pie(sizes, labels=labels, autopct="%1.1f%%", startangle=150)plt.rcParams["font.size"] = 16# 调整图形大小fig.set_size_inches(8, 8)# 添加标题plt.title("斗破苍穹人物出场次数饼状图")# 显示图表plt.show()

结果显示：

关键词：