全球百事通！Python 应用 - jieba 分词 1：进行批量文本分词_艽野尘梦 better 的博客 - CSDN 博客

2023-03-31 13:06:03 来源：博客园

知识点小结

os.walk()

os.walk() 方法用于通过在目录树中游走输出在目录中的文件名，向上或者向下。os.walk() 方法是一个简单易用的文件、目录遍历器，可以帮助我们高效的处理文件、目录方面的事情。

import pandas as pdimport matplotlib.pyplot as pltimport jiebafrom PIL import Imagefrom wordcloud import WordCloud,ImageColorGeneratorfrom imageio import imreadimport numpy as np%matplotlib inline#内嵌绘图，省略掉plt.show()plt.rcParams["font.sans-serif"]=["SimHei"]plt.rcParams["axes.unicode_minus"]=False%config InlineBackend.figure_format = "svg"#矢量图设置，设定显示图片的分辨率

win32com 库

参考：

Python 使用 win32com 模块对 word 文件进行操作_Python 热爱者的博客 - CSDN 博客 https://blog.csdn.net/qdPython/article/details/114439716

(资料图片)

jieba 分词

参考：

jieba 分词自定义词表简介_结巴库自定义分词_feng98ren 的博客 - CSDN 博客 https://blog.csdn.net/feng98ren/article/details/80436791

词云图

参考：

Python 生成词云图太简单了 | 拿来就用能的 Python 词云图代码 - 知乎 (zhihu.com)https://zhuanlan.zhihu.com/p/353795160

导入第三方库

主要使用 pandas、jieba、wordcloud、numpy、matplotlib、imageio 库

from win32com import client as wcdef Translate(filein,fileout):    # 转换    wordapp = wc.Dispatch("Word.Application")    doc = wordapp.Documents.Open(filein)    # 为了让python可以在后续操作中r方式读取txt和不产生乱码，参数为4    doc.SaveAs(fileout, 4)    doc.Close()

读取文件夹中的所有文件名

使用 os 库中的 walk 方法获得指定文件夹中的所有文件名

import osdef Translate_all(file_dir):#传入文件夹路径,转换docx为txt,并返回所有txt文件名    paths_tuple=os.walk(file_dir)#返回三元组，包括路径，文件名    for root, dirs, files in paths_tuple:        paths=[]        for file in files:            split_file=os.path.splitext(file)#拆分成文件名和类型            if split_file[1] == ".docx":                Translate(os.path.join(root, file),os.path.join(root, split_file[0]+".txt"))                paths.append(split_file[0]+".txt")#os.path.join(root, file)            elif split_file[1] == ".txt":                paths.append(split_file[0]+".txt")    return paths

word 文档批量转为 txt

使用 win32com 实现格式转换，首先编写单个 word 文件转换的函数：

files=Translate_all(r"政策txt\25政策txt\\")for file in files:    with open(r"政策txt\25政策txt\\"+file, "r", encoding="utf-8", errors="ignore") as f:        text = f.read()        jieba.load_userdict(r"cn_stopwords.txt")#添加中文停用词字典        jieba.load_userdict(r"userdict.txt")#添加用户停用词字典        seg_list=jieba.lcut(text,use_paddle=True)#使用paddle模式进行分词        stop_words_counts={}        for word in seg_list:            if len(word)==1:#去掉单字词                continue            elif r"\u" in repr(word):#不记录转义字符串                continue            else:                stop_words_counts[word]=stop_words_counts.get(word,0)+1#对word出现的频率进行统计，当word不在seg_list时，返回值是0，当word在seg_list中时，返回+1        stop_words_counts=list(stop_words_counts.items())#将字典转化为列表        stop_words_counts.sort(key=lambda x:x[1],reverse=True)#根据词频进行降序排序        data=pd.DataFrame(stop_words_counts,columns=["词","词频"])        #print(data.head())        data.to_csv(r"词频\\"+os.path.splitext(file)[0]+"词频.txt",encoding="utf_8_sig",index=False)

将函数嵌入到一个循环中，当判断为 docx 文件时就转换，当判定为 txt 文件时不处理，最终函数返回文件夹中的所有 txt 文件的名称：

# 绘制词云jpg = Image.open(r"人才引进.png")#图片形状mask = np.array(jpg) #将图片转换为数组print(mask)# 显示生成的词云图片my_cloud = WordCloud(    background_color="white",  # 设置背景颜色白色    width=1000, height=500,   #宽度1000像素,高度860像素    scale=2,        # 比列放大  数值越大  词云越清晰    font_path="simhei.ttf",   # 设置字体为黑体    mask=mask,    max_words=500,    #random_state=5          # 设置随机生成状态，即多少种配色方案).generate_from_frequencies(dict(stop_words_counts))#image_colors = ImageColorGenerator(jpg)#根据背景图片设置颜色plt.subplots(figsize=(12,8),dpi=500)plt.imshow(my_cloud , interpolation="bilinear")# 用plt显示图片# 显示设置词云图中无坐标轴plt.axis("off")#去除坐标轴#plt.show()my_cloud.to_file("mywordcloud.png")

文本分词

最后，使用结巴分词载入中文分词表和我自定义的用户词表进行分词，这里使用的是 paddle 模式，其他三种模式的介绍，可以看最上面的知识点总结。我们去除了所有的单字词，并将词和词频按降序保存到了新的 txt 文件中。

files=Translate_all(r"政策txt\25政策txt\\")for file in files:    with open(r"政策txt\25政策txt\\"+file, "r", encoding="utf-8", errors="ignore") as f:        text = f.read()        jieba.load_userdict(r"cn_stopwords.txt")#添加中文停用词字典        jieba.load_userdict(r"userdict.txt")#添加用户停用词字典        seg_list=jieba.lcut(text,use_paddle=True)#使用paddle模式进行分词        stop_words_counts={}        for word in seg_list:            if len(word)==1:#去掉单字词                continue            elif r"\u" in repr(word):#不记录转义字符串                continue            else:                stop_words_counts[word]=stop_words_counts.get(word,0)+1#对word出现的频率进行统计，当word不在seg_list时，返回值是0，当word在seg_list中时，返回+1        stop_words_counts=list(stop_words_counts.items())#将字典转化为列表        stop_words_counts.sort(key=lambda x:x[1],reverse=True)#根据词频进行降序排序        data=pd.DataFrame(stop_words_counts,columns=["词","词频"])        #print(data.head())        data.to_csv(r"词频\\"+os.path.splitext(file)[0]+"词频.txt",encoding="utf_8_sig",index=False)

扩展——词云图绘制

将分词结果（词和词频）根据给定的图片样式生成词云图（注意原图色调要分明，这里给个小建议：如果是按几个文字展示词云图，向本文这样的，可以使用 word 做艺术字之后截图），首先将图片转为 numpy 矩阵，每个元素是一个三维数组（rgb）组成，将矩阵作为参数传入绘制词云图的方法中。

# 绘制词云jpg = Image.open(r"人才引进.png")#图片形状mask = np.array(jpg) #将图片转换为数组print(mask)# 显示生成的词云图片my_cloud = WordCloud(    background_color="white",  # 设置背景颜色白色    width=1000, height=500,   #宽度1000像素,高度860像素    scale=2,        # 比列放大  数值越大  词云越清晰    font_path="simhei.ttf",   # 设置字体为黑体    mask=mask,    max_words=500,    #random_state=5          # 设置随机生成状态，即多少种配色方案).generate_from_frequencies(dict(stop_words_counts))#image_colors = ImageColorGenerator(jpg)#根据背景图片设置颜色plt.subplots(figsize=(12,8),dpi=500)plt.imshow(my_cloud , interpolation="bilinear")# 用plt显示图片# 显示设置词云图中无坐标轴plt.axis("off")#去除坐标轴#plt.show()my_cloud.to_file("mywordcloud.png")

关键词：