完整的中英文词频统计

释放双眼,带上耳机,听听看~!

步骤:

1.准备utf-8编码的文本文件file

2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=)

7.排除语法型词汇,代词、冠词、连词等无语义词

8.输出TOP(20)

一、.英文歌曲 词频统计


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
1str2='''I will run, I will climb, I will soar
2I'm undefeated
3Jumpiing out of my skin, pull the chord
4Yeah I believe it
5The past, is everything we were
6don't make us who we are
7So I'll dream, until I make it real,
8and all I see is stars
9Its not until you fall that you fly
10When your dreams come alive you're unstoppable
11Take a shot, chase the sun, find the beautiful
12We will glow in the dark turning dust to gold
13And we'll dream it possible
14possible
15And we'll dream it possible
16I will chase, I will reach, I will fly
17Until I'm breaking, until I'm breaking
18Out of my cage, like a bird in the night
19I know I'm changing, I know I'm changing
20In, into something big, better than before
21And if it takes, takes a thousand lives
22Then it's worth fighting for
23Its not until you fall that you fly
24When your dreams come alive you're unstoppable
25Take a shot, chase the sun, find the beautiful
26We will glow in the dark turning dust to gold
27And we'll dream it possible
28it possible
29From the bottom to the top
30We're sparking wild fire's
31Never quit and never stop
32The rest of our lives
33From the bottom to the top
34We're sparking wild fire's
35Never quit and never stop
36Its not until you fall that you fly
37When your dreams come alive you're unstoppable
38Take a shot, chase the sun, find the beautiful
39We will glow in the dark turning dust to gold
40And we'll dream it possible
41possible
42And we'll dream it possible'''.lower()
43#aa = '''."?!'''
44#for word in aa:
45#   str2 =str2.replace('word','')
46str2 =str2.replace('\n',' ')
47str2 =str2.replace(',',' ')
48print(str2)#去除特殊符号
49
50str2 = str2.strip()#去掉首尾空格
51str2 = str2.split()#通过指定分隔符对字符串进行切片
52print(str2)
53
54print('统计每个单词出现的次数为:')
55for word in str2:
56   print(word,str2.count(word))
57
58strSet=set(str2)
59newSet={'a','will','it','out','of','my','the','i','in','to','when','and'}
60strSet1=strSet-newSet#去除介词和其他
61print(strSet1)
62
63
64strdict={}          #单词计数字典
65for word in strSet1:
66    strdict[word] = str2.count(word)
67print(len(strdict),strdict)
68
69strList = list(strdict.items())
70def takesecond(elem):#定义函数
71        return elem[1]
72#strList.sort(key=lambda x:x[1],reverse=True)#匿名函数
73strList.sort(key=takesecond,reverse=True)#按照数值大小进行排序
74print(strList)
75
76
77for i in range(20):
78    print (strList[i])#前二十
79

完整的中英文词频统计

2.中文小说 词频统计


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1import jieba;
2#准备utf-8编码的文本文件file
3f = open('doupo.txt','r',encoding='utf-8')
4#通过文件读取字符串 str,对文本进行预处理
5fo=f.read()
6f.close()
7print(fo)
8
9#用字典形式统计每个词的字数
10doupols = jieba.lcut(fo)
11doupodict = {}
12for word in doupols:
13    if len(word)==1:
14        continue
15    else:
16        doupodict[word]=doupodict.get(word,0)+1
17print(doupodict)
18#cut
19print(list(jieba.cut(fo)))      #精确模式,将句子最精确的分开,适合文本分析
20print(list(jieba.cut(fo,cut_all=True)))  #全模式,把句子中所有的可以成词的词语都扫描出来
21print(list(jieba.cut_for_search(fo)))    #搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词
22
23#以列表返回可遍历的(键, 值) 元组数组
24wcList = list(doupodict.items())
25wcList.sort(key = lambda x:x[1],reverse=True)   #出现词汇次数由高到低排序
26print(wcList)
27
28#第一个词循环遍历输出5次
29for i in range(5):
30    print(wcList[1])
31

完整的中英文词频统计

 

给TA打赏
共{{data.count}}人
人已打赏
安全运维

MongoDB最简单的入门教程之二 使用nodejs访问MongoDB

2021-12-11 11:36:11

安全运维

Ubuntu上NFS的安装配置

2021-12-19 17:36:11

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索