完整的中英文词频统计

释放双眼，带上耳机，听听看~！

步骤：

1.准备utf-8编码的文本文件file

2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=)

7.排除语法型词汇，代词、冠词、连词等无语义词

8.输出TOP(20)

一、.英文歌曲词频统计


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
1str2=&#x27;&#x27;&#x27;I will run, I will climb, I will soar

2I&#x27;m undefeated

3Jumpiing out of my skin, pull the chord

4Yeah I believe it

5The past, is everything we were

6don&#x27;t make us who we are

7So I&#x27;ll dream, until I make it real,

8and all I see is stars

9Its not until you fall that you fly

10When your dreams come alive you&#x27;re unstoppable

11Take a shot, chase the sun, find the beautiful

12We will glow in the dark turning dust to gold

13And we&#x27;ll dream it possible

14possible

15And we&#x27;ll dream it possible

16I will chase, I will reach, I will fly

17Until I&#x27;m breaking, until I&#x27;m breaking

18Out of my cage, like a bird in the night

19I know I&#x27;m changing, I know I&#x27;m changing

20In, into something big, better than before

21And if it takes, takes a thousand lives

22Then it&#x27;s worth fighting for

23Its not until you fall that you fly

24When your dreams come alive you&#x27;re unstoppable

25Take a shot, chase the sun, find the beautiful

26We will glow in the dark turning dust to gold

27And we&#x27;ll dream it possible

28it possible

29From the bottom to the top

30We&#x27;re sparking wild fire&#x27;s

31Never quit and never stop

32The rest of our lives

33From the bottom to the top

34We&#x27;re sparking wild fire&#x27;s

35Never quit and never stop

36Its not until you fall that you fly

37When your dreams come alive you&#x27;re unstoppable

38Take a shot, chase the sun, find the beautiful

39We will glow in the dark turning dust to gold

40And we&#x27;ll dream it possible

41possible

42And we&#x27;ll dream it possible&#x27;&#x27;&#x27;.lower()

43#aa = &#x27;&#x27;&#x27;.&quot;?!&#x27;&#x27;&#x27;

44#for word in aa:

45#   str2 =str2.replace(&#x27;word&#x27;,&#x27;&#x27;)

46str2 =str2.replace(&#x27;\n&#x27;,&#x27; &#x27;)

47str2 =str2.replace(&#x27;,&#x27;,&#x27; &#x27;)

48print(str2)#去除特殊符号

49 

50str2 = str2.strip()#去掉首尾空格

51str2 = str2.split()#通过指定分隔符对字符串进行切片

52print(str2)

53 

54print(&#x27;统计每个单词出现的次数为:&#x27;)

55for word in str2:

56   print(word,str2.count(word))

57 

58strSet=set(str2)

59newSet={&#x27;a&#x27;,&#x27;will&#x27;,&#x27;it&#x27;,&#x27;out&#x27;,&#x27;of&#x27;,&#x27;my&#x27;,&#x27;the&#x27;,&#x27;i&#x27;,&#x27;in&#x27;,&#x27;to&#x27;,&#x27;when&#x27;,&#x27;and&#x27;}

60strSet1=strSet-newSet#去除介词和其他

61print(strSet1)

62 

63 

64strdict={}          #单词计数字典

65for word in strSet1:

66    strdict[word] = str2.count(word)

67print(len(strdict),strdict)

68 

69strList = list(strdict.items())

70def takesecond(elem):#定义函数

71        return elem[1]

72#strList.sort(key=lambda x:x[1],reverse=True)#匿名函数

73strList.sort(key=takesecond,reverse=True)#按照数值大小进行排序

74print(strList)

75 

76 

77for i in range(20):

78    print (strList[i])#前二十

79

完整的中英文词频统计

2.中文小说词频统计


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1import jieba;

2#准备utf-8编码的文本文件file

3f = open(&#x27;doupo.txt&#x27;,&#x27;r&#x27;,encoding=&#x27;utf-8&#x27;)

4#通过文件读取字符串 str,对文本进行预处理

5fo=f.read()

6f.close()

7print(fo)

8 

9#用字典形式统计每个词的字数

10doupols = jieba.lcut(fo)

11doupodict = {}

12for word in doupols:

13    if len(word)==1:

14        continue

15    else:

16        doupodict[word]=doupodict.get(word,0)+1

17print(doupodict)

18#cut

19print(list(jieba.cut(fo)))      #精确模式，将句子最精确的分开，适合文本分析

20print(list(jieba.cut(fo,cut_all=True)))  #全模式，把句子中所有的可以成词的词语都扫描出来

21print(list(jieba.cut_for_search(fo)))    #搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词

22 

23#以列表返回可遍历的(键, 值) 元组数组

24wcList = list(doupodict.items())

25wcList.sort(key = lambda x:x[1],reverse=True)   #出现词汇次数由高到低排序

26print(wcList)

27 

28#第一个词循环遍历输出5次

29for i in range(5):

30    print(wcList[1])

31

完整的中英文词频统计

{{userData.name}}已认证

完整的中英文词频统计

MongoDB最简单的入门教程之四：使用Spring Boot操作MongoDB

Ubuntu上NFS的安装配置

{{userData.name}}已认证

Related posts:

MongoDB最简单的入门教程之四：使用Spring Boot操作MongoDB

Ubuntu上NFS的安装配置

BP神经网络算法

NLP之关键词提取

Docker 核心技术与实现原理

自然语言处理之word2vec