中文词频统计

中文词频统计

安全运维
21年12月12日
编辑

aqzt

释放双眼，带上耳机，听听看~！

作业要求：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2773

1. 下载一长篇中文小说。

来自红楼梦的一小章内容：

中文词频统计

2. 从文件读取待分析文本。


1
2
1text=open(&#x27;123.txt&#x27;,&#x27;r&#x27;,encoding=&#x27;utf-8&#x27;).read()

2

3. 安装并使用jieba进行中文分词。

pip install jieba

import jieba

jieba.lcut(text)


1
2
3
1import jieba

2wordsls=jieba.lcut(text)

3

4. 更新词库，加入所分析对象的专业词汇。

jieba.add_word('天罡北斗阵') #逐个添加

jieba.load_userdict(word_dict) #词库文本文件

参考词库下载地址：https://pinyin.sogou.com/dict/

转换代码：scel_to_text

词库：

中文词频统计


1
2
3
1worddict1=[line.strip() for line in open(&#x27;23.txt&#x27;,encoding=&#x27;utf-8&#x27;).readlines()]

2jieba.load_userdict(worddict1)

3

5. 生成词频统计


1
2
3
4
5
6
7
8
9
1wcdict={}

2

3for word in wordsls:

4    if word not in worddict2:（7.排除语法型）

5      if len(word)==1:

6        continue

7      else:

8        wcdict[word]=wcdict.get(word,0)+1

9

6. 排序


1
2
3
1wcls=list(wcdict.items())

2wcls.sort(key=lambda  x:x[1],reverse=True)　

3

7. 排除语法型词汇，代词、冠词、连词等停用词。

文件：

中文词频统计

stops


1
2
1worddict2=[line.strip() for line in open(&#x27;stops_chinese.txt&#x27;,encoding=&#x27;utf-8&#x27;).readlines()]

2

8. 输出词频最大TOP20，把结果存放到文件里


1
2
3
1import pandas as pd

2pd.DataFrame(data=word).to_csv(&#x27;E:/1234.csv&#x27;,encoding=&#x27;utf-8&#x27;)

3

9. 生成词云。


1
2
3
4
5
6
7
8
9
10
11
1wl_split=&quot; &quot;.join(wordsls) 

2

3from wordcloud import WordCloud

4import matplotlib.pyplot as plt

5

6mywc = WordCloud().generate(wl_split)

7

8plt.imshow(mywc)

9plt.axis(&quot;off&quot;)

10plt.show()

11

10.最总代码总和和截图：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
1import jieba

2text=open(&#x27;D://123.txt&#x27;,&#x27;r&#x27;,encoding=&#x27;utf-8&#x27;).read()

3

4worddict1=open(&#x27;D://23.txt&#x27;,&#x27;r&#x27;,encoding=&#x27;utf-8&#x27;).read()

5

6worddict2=open(&#x27;D://stops_chinese.txt&#x27;,&#x27;r&#x27;,encoding=&#x27;utf-8&#x27;).read()

7

8wordsls=jieba.lcut(text)

9

10wcdict={}

11

12for word in wordsls:

13    if word not in worddict2:

14      if len(word)==1:

15        continue

16      else:

17        wcdict[word]=wcdict.get(word,0)+1

18

19wcls=list(wcdict.items())

20wcls.sort(key=lambda  x:x[1],reverse=True)

21

22for i in range(25):

23    print(wcls[i])

24

25wl_split=&quot; &quot;.join(wordsls) 

26

27from wordcloud import WordCloud

28import matplotlib.pyplot as plt

29

30

31mywc = WordCloud().generate(wl_split)

32

33plt.imshow(mywc)

34plt.axis(&quot;off&quot;)

35plt.show()

36

中文词频统计

{{userData.name}}已认证

基于spring boot和mongodb打造一套完整的权限架构（五）【集成用户模块、菜单模块、角色模块】

Ubuntu上NFS的安装配置

{{userData.name}}已认证

Related posts:

基于spring boot和mongodb打造一套完整的权限架构（五）【集成用户模块、菜单模块、角色模块】

Ubuntu上NFS的安装配置

中文词频统计与词云生成

深度学习----NLP结巴分词基础

深度学习----NLP-TextRank的textrank4zh模块源码解读

Hadoop的Python框架指南