深度学习—-NLP-TextRank的textrank4zh模块源码解读

释放双眼，带上耳机，听听看~！

文章目录

1. textrank4zh模块源码解读
- 2 textrank4zh模块的使用
2.1 textrank4zh模块的安装
* 2.2 textrank4zh的使用实例
1）提取关键词、关键短语和关键句
* 2）展示textrank4zh模块的三种分词模式的效果

TextRank算法是一种文本排序算法，由谷歌的网页重要性排序算法PageRank算法改进而来，它能够从一个给定的文本中提取出该文本的关键词、关键词组，并使用抽取式的自动文摘方法提取出该文本的关键句。其提出论文是： Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]. Association for Computational Linguistics, 2004. 论文的百度学术下载地址为：点击打开链接。

TextRank算法的基本原理：顶点击这里

1. textrank4zh模块源码解读

$~~~~~~~~$textrank4zh模块是针对中文文本的TextRank算法的python算法实现，该模块的下载地址为：点击打开链接
对其源码解读如下：
util.py：textrank4zh模块的工具包，TextRank算法的核心思想在该文件中实现。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
1# -*- encoding:utf-8 -*-

2&quot;&quot;&quot;

3@author:   letian

4@homepage: http://www.letiantian.me

5@github:   https://github.com/someus/

6&quot;&quot;&quot;

7from __future__ import (absolute_import, division, print_function,

8                       unicode_literals)

9 

10import os

11import math

12import networkx as nx

13import numpy as np

14import sys

15 

16try:

17  reload(sys)

18  sys.setdefaultencoding(&#x27;utf-8&#x27;)

19except:

20  pass

21 

22sentence_delimiters = [&#x27;?&#x27;, &#x27;!&#x27;, &#x27;;&#x27;, &#x27;？&#x27;, &#x27;！&#x27;, &#x27;。&#x27;, &#x27;；&#x27;, &#x27;……&#x27;, &#x27;…&#x27;, &#x27;\n&#x27;]

23allow_speech_tags = [&#x27;an&#x27;, &#x27;i&#x27;, &#x27;j&#x27;, &#x27;l&#x27;, &#x27;n&#x27;, &#x27;nr&#x27;, &#x27;nrfg&#x27;, &#x27;ns&#x27;, &#x27;nt&#x27;, &#x27;nz&#x27;, &#x27;t&#x27;, &#x27;v&#x27;, &#x27;vd&#x27;, &#x27;vn&#x27;, &#x27;eng&#x27;]

24 

25PY2 = sys.version_info[0] == 2

26if not PY2:

27  # Python 3.x and up

28  text_type = str

29  string_types = (str,)

30  xrange = range

31 

32 

33  def as_text(v):  ## 生成unicode字符串

34      if v is None:

35          return None

36      elif isinstance(v, bytes):

37          return v.decode(&#x27;utf-8&#x27;, errors=&#x27;ignore&#x27;)

38      elif isinstance(v, str):

39          return v

40      else:

41          raise ValueError(&#x27;Unknown type %r&#x27; % type(v))

42 

43 

44  def is_text(v):

45      return isinstance(v, text_type)

46 

47else:

48  # Python 2.x

49  text_type = unicode

50  string_types = (str, unicode)

51  xrange = xrange

52 

53 

54  def as_text(v):

55      if v is None:

56          return None

57      elif isinstance(v, unicode):

58          return v

59      elif isinstance(v, str):

60          return v.decode(&#x27;utf-8&#x27;, errors=&#x27;ignore&#x27;)

61      else:

62          raise ValueError(&#x27;Invalid type %r&#x27; % type(v))

63 

64 

65  def is_text(v):

66      return isinstance(v, text_type)

67 

68__DEBUG = None

69 

70 

71def debug(*args):

72  global __DEBUG

73  if __DEBUG is None:

74      try:

75          if os.environ[&#x27;DEBUG&#x27;] == &#x27;1&#x27;:

76              __DEBUG = True

77          else:

78              __DEBUG = False

79      except:

80          __DEBUG = False

81  if __DEBUG:

82      print(&#x27; &#x27;.join([str(arg) for arg in args]))

83 

84 

85class AttrDict(dict):

86  &quot;&quot;&quot;Dict that can get attribute by dot&quot;&quot;&quot;

87 

88  def __init__(self, *args, **kwargs):

89      super(AttrDict, self).__init__(*args, **kwargs)

90      self.__dict__ = self

91 

92 

93def combine(word_list, window=2):

94  &quot;&quot;&quot;构造在window下的单词组合，用来构造单词之间的边。

95  Keyword arguments:

96  word_list  --  list of str, 由单词组成的列表。

97  windows    --  int, 窗口大小。

98  &quot;&quot;&quot;

99  if window &lt; 2: window = 2

100 for x in xrange(1, window):

101     if x &gt;= len(word_list):

102         break

103     word_list2 = word_list[x:]

104     res = zip(word_list, word_list2)

105     for r in res:

106         yield r

107 

108 

109def get_similarity(word_list1, word_list2):

110 &quot;&quot;&quot;默认的用于计算两个句子相似度的函数。

111 Keyword arguments:

112 word_list1, word_list2  --  分别代表两个句子，都是由单词组成的列表

113 &quot;&quot;&quot;

114 words = list(set(word_list1 + word_list2))

115 vector1 = [float(word_list1.count(word)) for word in words]

116 vector2 = [float(word_list2.count(word)) for word in words]

117 vector3 = [vector1[x] * vector2[x] for x in xrange(len(vector1))]

118 vector4 = [1 for num in vector3 if num &gt; 0.]

119 co_occur_num = sum(vector4)

120 

121 if abs(co_occur_num) &lt;= 1e-12:

122     return 0.

123 

124 denominator = math.log(float(len(word_list1))) + math.log(float(len(word_list2)))  # 分母

125 

126 if abs(denominator) &lt; 1e-12:

127     return 0.

128 

129 return co_occur_num / denominator

130 

131 

132def sort_words(vertex_source, edge_source, window=2, pagerank_config={&#x27;alpha&#x27;: 0.85, }):

133 &quot;&quot;&quot;将单词按关键程度从大到小排序

134 Keyword arguments:

135 vertex_source   --  二维列表，子列表代表句子，子列表的元素是单词，这些单词用来构造pagerank中的节点

136 edge_source     --  二维列表，子列表代表句子，子列表的元素是单词，根据单词位置关系构造pagerank中的边

137 window          --  一个句子中相邻的window个单词，两两之间认为有边

138 pagerank_config --  pagerank的设置

139 &quot;&quot;&quot;

140 sorted_words = []

141 word_index = {}

142 index_word = {}

143 _vertex_source = vertex_source

144 _edge_source = edge_source

145 words_number = 0

146 for word_list in _vertex_source:

147     for word in word_list:

148         if not word in word_index:

149             word_index[word] = words_number

150             index_word[words_number] = word

151             words_number += 1

152 

153 graph = np.zeros((words_number, words_number))

154 

155 for word_list in _edge_source:

156     for w1, w2 in combine(word_list, window):

157         if w1 in word_index and w2 in word_index:

158             index1 = word_index[w1]

159             index2 = word_index[w2]

160             graph[index1][index2] = 1.0

161             graph[index2][index1] = 1.0

162 

163 debug(&#x27;graph:\n&#x27;, graph)

164 

165 nx_graph = nx.from_numpy_matrix(graph)

166 scores = nx.pagerank(nx_graph, **pagerank_config)  # this is a dict

167 sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)

168 for index, score in sorted_scores:

169     item = AttrDict(word=index_word[index], weight=score)

170     sorted_words.append(item)

171 

172 return sorted_words

173 

174 

175def sort_sentences(sentences, words, sim_func=get_similarity, pagerank_config={&#x27;alpha&#x27;: 0.85, }):

176 &quot;&quot;&quot;将句子按照关键程度从大到小排序

177 Keyword arguments:

178 sentences         --  列表，元素是句子

179 words             --  二维列表，子列表和sentences中的句子对应，子列表由单词组成

180 sim_func          --  计算两个句子的相似性，参数是两个由单词组成的列表

181 pagerank_config   --  pagerank的设置

182 &quot;&quot;&quot;

183 sorted_sentences = []

184 _source = words

185 sentences_num = len(_source)

186 graph = np.zeros((sentences_num, sentences_num))

187 

188 for x in xrange(sentences_num):

189     for y in xrange(x, sentences_num):

190         similarity = sim_func(_source[x], _source[y])

191         graph[x, y] = similarity

192         graph[y, x] = similarity

193 

194 nx_graph = nx.from_numpy_matrix(graph)

195 scores = nx.pagerank(nx_graph, **pagerank_config)  # this is a dict

196 sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)

197 

198 for index, score in sorted_scores:

199     item = AttrDict(index=index, sentence=sentences[index], weight=score)

200     sorted_sentences.append(item)

201 

202 return sorted_sentences

203 

204 

205if __name__ == &#x27;__main__&#x27;:

206 pass

207

208

Segmentation.py：包含用于分词和分句的类。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
1# -*-coding:utf-8-*-

2 

3# 把新版本的特性引入当前版本

4from __future__ import (absolute_import, division, print_function, unicode_literals)

5# 导入结巴分词的词性标注组件

6import jieba.posseg as pseg

7# 导入编码转换模块

8import codecs

9# 导入操作系统模块

10import os

11# 导入工具包组件

12from textrank4zh import util

13 

14 

15# 获取停用词文件的路径

16def get_default_stop_words_file():

17  # 获取当前脚本所在的路径

18  d = os.path.dirname(os.path.realpath(__file__))

19  # 返回停用词表所在路径，os.path.join方法用于将多个路径组合后返回

20  return os.path.join(d, &#x27;stopwords.txt&#x27;)

21 

22 

23&quot;&quot;&quot;分词类&quot;&quot;&quot;

24 

25 

26class WordSegmentation(object):

27 

28  &quot;&quot;&quot;初始化函数，获取词性列表和停用词表&quot;&quot;&quot;

29  def __init__(self, stop_words_file=None, allow_speech_tags=util.allow_speech_tags):

30      &quot;&quot;&quot;

31      :param stop_words_file：保存停用词表的文件路径，使用utf-8编码方式，每行存放一个停用词，若不是str类型，则使用默认的停用词

32      :param allow_speech_tags：默认的词性列表，用于过滤某些词性的词

33      :return:无

34      &quot;&quot;&quot;

35      # 词性列表

36      allow_speech_tags = [util.as_text(item) for item in allow_speech_tags]

37      # 将词性列表设置为默认的词性列表

38      self.default_speech_tags_filter = allow_speech_tags

39 

40      # 使用set方法创建空集合

41      self.stop_words = set()

42      # 获取停用词文件的路径

43      self.stop_words_file = get_default_stop_words_file()

44      # 若停用词文件路径不是str类型，则使用默认的停用词

45      if type(stop_words_file is str):

46          self.stop_words_file = stop_words_file

47      # 打开并读取停用词文件，将其中的停用词加入停用词集合

48      for word in codecs.open(self.stop_words_file, &#x27;r&#x27;, &#x27;utf-8&#x27;, &#x27;ignore&#x27;):

49          self.stop_words.add(word.strip())

50 

51  &quot;&quot;&quot;对文本进行分词，返回的分词结果以列表方式存储&quot;&quot;&quot;

52  def segment(self, text, lower=True, user_stop_words=True, use_speech_tags_filter=False):

53      &quot;&quot;&quot;

54      :param text: 要进行分词的文本

55      :param lower: 是否要将单词小写，针对英文

56      :param user_stop_words: 若为True，表示使用停用词集合进行过滤，去掉停用词

57      :param use_speech_tags_filter:是否基于词性进行过滤，若为True，则使用默认的词性列表进行过滤

58      :return:词性过滤后的词列表

59      &quot;&quot;&quot;

60      # 待分词的文本

61      text = util.as_text(text)

62      # 词性标注结果列表

63      jieba_result = pseg.cut(text)

64 

65      if use_speech_tags_filter == True:

66          # 进行词性过滤后的词性标注结果

67          jieba_result = [w for w in jieba_result if w.flag in self.default_speech_tags_filter]

68      else:

69          # 不进行词性过滤的词性标注结果

70          jieba_result = [w for w in jieba_result]

71 

72      # 去除特殊符号

73 

74      # 去除非语素字和词两端的空格

75      # 非语素字只是一个符号，字母x通常用于代表未知数、符号

76      word_list = [w.word.strip() for w in jieba_result if w.flag != &#x27;x&#x27;]

77      # 去除空字符

78      word_list = [word for word in word_list if len(word) &gt; 0]

79 

80      # 是否将英文单词小写

81      if lower:

82          word_list = [word.lower() for word in word_list]

83 

84      # 是否使用停用词集合进行过滤

85      if user_stop_words:

86          word_list = [word.strip() for word in word_list if word.strip() not in self.stop_words]

87 

88      # 返回词性过滤后的词列表

89      return word_list

90 

91  &quot;&quot;&quot;将列表sentences中的每个元素/句子转换为由单词构成的列表&quot;&quot;&quot;

92  def segment_sentences(self, sentences, lower=True, user_stop_words=True, user_speech_tags_filter=False):

93      &quot;&quot;&quot;

94      :param sentences: 句子列表

95      :return: 以词性过滤后的词列表为元素的列表

96      &quot;&quot;&quot;

97      res = []

98      for sentence in sentences:

99          # 调用segment方法，将词性过滤后的词列表加入到列表中

100         res.append(self.segment(text=sentences, lower=lower, user_stop_words=user_stop_words, use_speech_tags_filter=user_speech_tags_filter))

101     # 返回以词性过滤后的词列表为元素的列表

102     return res

103 

104 

105&quot;&quot;&quot;分句类&quot;&quot;&quot;

106 

107 

108class SentenceSegmentation(object):

109 

110 &quot;&quot;&quot;初始化函数，获取用于分句的分隔符集合&quot;&quot;&quot;

111 def __init__(self, delimiters=util.sentence_delimiters):

112     &quot;&quot;&quot;

113     :param delimiters: 可迭代对象，用于拆分句子

114     &quot;&quot;&quot;

115     self.delimiters = set([util.as_text(item) for item in delimiters])

116 

117 &quot;&quot;&quot;将文本划分为句子，返回句子列表&quot;&quot;&quot;

118 def segment(self, text):

119     # 获取文本

120     res = [util.as_text(text)]

121     # 调试

122     util.debug(res)

123     util.debug(self.delimiters)

124 

125     # 分句，使用了两层循环

126     # 遍历分隔符对象

127     for sep in self.delimiters:

128         # res表示分句结果

129         text, res = res, []

130         # 遍历文本对象

131         for seq in text:

132             # 分句操作

133             res += seq.split(sep)

134     # 去除句子两端空格，并滤除空句

135     res = [s.strip() for s in res if len(s.strip() &gt; 0)]

136     # 返回句子列表

137     return res

138 

139 

140&quot;&quot;&quot;分割类&quot;&quot;&quot;

141 

142 

143class Segmentation(object):

144 

145 &quot;&quot;&quot;初始化函数&quot;&quot;&quot;

146 def __init__(self, stop_word_file=None, allow_speech_tags=util.allow_speech_tags, delimiters=util.sentence_delimiters):

147     &quot;&quot;&quot;

148     :param stop_word_file: 停用词文件

149     :param allow_speech_tags: 词性列表，用于过滤某些词性的词

150     :param delimiters: 用于拆分句子的分隔符

151     &quot;&quot;&quot;

152     # 创建分词类的实例

153     self.ws = WordSegmentation(stop_word_file=stop_word_file, allow_speech_tags=allow_speech_tags)

154     # 创建分句类的实例

155     self.ss = SentenceSegmentation(delimiters=delimiters)

156 

157 def segment(self, text, lower=False):

158     # 获取文本

159     text = util.as_text(text)

160     # 拆分文本，得到句子列表

161     sentences = self.ss.segment(text)

162     # 未进行词性过滤后的词列表

163     words_no_filter = self.ws.segment_sentences(sentences=sentences, lower=lower, user_stop_words=False, user_speech_tags_filter=False)

164     # 去掉停用词后的词列表

165     words_no_stop_words = self.ws.segment_sentences(sentences=sentences, lower=lower, user_stop_words=True, user_speech_tags_filter=False)

166     # 进行词性过滤并去掉停用词后的词列表

167     words_all_filters = self.ws.segment_sentences(sentences=sentences, lower=lower, user_stop_words=True, user_speech_tags_filter=True)

168     # 返回以上结果

169     return util.AttrDict(sentences=sentences, words_no_filter=words_no_filter, words_no_stop_words=words_no_stop_words, words_all_filters=words_all_filters)

170 

171 

172# 主模块

173if __name__ == &#x27;__main__&#x27;:

174 # 空语句，保持程序结构的完整性

175 pass

176

177

TextRank4Keyword.py：包含用于提取关键词和关键词组的类。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
1#-*-coding:utf-8-*-

2 

3# 把新版本的特性引入当前版本

4from __future__ import (absolute_import, division, print_function, unicode_literals)

5# 导入操作复杂网络的模块

6import networkx as nx

7# 导入数值计算模块

8import numpy as np

9# 导入工具包组件

10from textrank4zh import util

11# 导入Segmentation文件

12from textrank4zh.Segmentation import Segmentation

13 

14 

15class TextRank4Keyword(object):

16 

17  &quot;&quot;&quot;初始化函数&quot;&quot;&quot;

18  def __init__(self, stop_words_file=None, allow_speech_tags=util.allow_speech_tags, delimiters=util.sentence_delimiters):

19      &quot;&quot;&quot;

20      :param stop_words_file:str类型，指定停用词文件的路径，若为其他类型，则使用默认的停用词文件

21      :param allow_speech_tags:词性列表，用于过滤某些词性的词

22      :param delimiters:用于拆分句子的分隔符，默认值为`?!;？！。；…\n`

23      &quot;&quot;&quot;

24      self.text = &#x27;&#x27;

25      self.Keywords = None

26      # 创建分割类的实例

27      self.seg = Segmentation(stop_words_file=stop_words_file, allow_speech_tags=allow_speech_tags, delimiters=delimiters)

28      # 句子列表

29      self.sentences = None

30      # 对sentences中每个句子分词而得到的两维列表

31      self.words_no_filter = None

32      # 去掉words_no_filter中的停止词而得到的两维列表

33      self.word_no_stop_words = None

34      # 保留words_no_stop_words中指定词性的单词而得到的两维列表

35      self.words_all_filters = None

36 

37  &quot;&quot;&quot;分析文本的函数，体现算法思想的部分&quot;&quot;&quot;

38  def analyze(self, text, window=2, lower=False, vertex_source=&#x27;all_filters&#x27;, edge_source=&#x27;no_stop_words&#x27;, pagerank_config={&#x27;alpha&#x27;: 0.85,}):

39      &quot;&quot;&quot;

40      :param text: 文本内容

41      :param window: 窗口大小，整型，用于构造单词之间的边，去默认值为2

42      :param lower: 是否将英文文本转换为小写，默认值为False

43      :param vertex_source: 选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点。默认值为`&#x27;all_filters&#x27;`，可选值为`&#x27;no_filter&#x27;, &#x27;no_stop_words&#x27;, &#x27;all_filters&#x27;`。关键词也来自`vertex_source`

44      :param edge_source:选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点之间的边。默认值为`&#x27;no_stop_words&#x27;`，可选值为`&#x27;no_filter&#x27;, &#x27;no_stop_words&#x27;, &#x27;all_filters&#x27;`。边的构造要结合`window`参数。

45      :param pagerank_config:pagerank算法参数配置，阻尼系数为0.85

46      &quot;&quot;&quot;

47      self.text = text

48      self.word_index = {}

49      self.index_word = {}

50      # 关键词列表

51      self.keywords = []

52      self.graph = None

53 

54      result = self.seg.segment(text=text, lower=lower)

55      self.sentences = result.sentences

56      self.words_no_filter = result.words_no_filter

57      self.word_no_stop_words = result.word_no_stop_words

58      self.words_all_filters = result.words_all_filters

59 

60      # 调试

61      util.debug(20 * &#x27;*&#x27;)

62      util.debug(&#x27;self.sentences in TextRank4Keyword:\n&#x27;, &#x27; || &#x27;.join(self.sentences))

63      util.debug(&#x27;self.words_no_filter in TextRank4Keyword:\n&#x27;, self.words_no_filter)

64      util.debug(&#x27;self.words_no_stop_words in TextRank4Keyword:\n&#x27;, self.words_no_stop_words)

65      util.debug(&#x27;self.words_all_filters in TextRank4Keyword:\n&#x27;, self.words_all_filters)

66 

67      # 选项，几种模式

68      options = [&#x27;no_filter&#x27;, &#x27;no_stop_words&#x27;, &#x27;all_filters&#x27;]

69      # 模式选择

70      if vertex_source in options:

71          _vertex_source = result[&#x27;words_&#x27; +vertex_source]

72      else:

73          _vertex_source = result[&#x27;words_all_filters&#x27;]

74      if edge_source in options:

75          _edge_source = result[&#x27;words_&#x27; + edge_source]

76      else:

77          _edge_source = result[&#x27;words_no_stop_words&#x27;]

78 

79      self.keywords = util.sort_words(_vertex_source, _edge_source, window=window, pagerank_config=pagerank_config)

80 

81 

82  &quot;&quot;&quot;获取最重要的num个长度大于等于word_min_len的关键词&quot;&quot;&quot;

83  def get_keywords(self, num=6, word_min_len=1):

84      &quot;&quot;&quot;

85      :param num: 返回的关键词个数

86      :param word_min_len: 最小关键词长度

87      :return: 关键词列表

88      &quot;&quot;&quot;

89      result = []

90      count = 0

91      for item in self.keywords:

92          if count &gt;= num:

93              break

94          if len(item.word) &gt;= word_min_len:

95              result.append(item)

96              count += 1

97      return result

98 

99  &quot;&quot;&quot;获取 keywords_num 个关键词构造的可能出现的短语，要求这个短语在原文本中至少出现的次数为min_occur_num&quot;&quot;&quot;

100 def get_keyphrases(self, keywords_num=12, min_occur_num=2):

101     &quot;&quot;&quot;

102     :param keywords_num: 返回的关键词短语个数

103     :param min_occur_num: 短语在文本中的最小出现次数

104     :return: 关键词短语列表

105     &quot;&quot;&quot;

106     # 关键词集合

107     keywords_set = set([item.word for item in self.get_keywords(num=keywords_num, word_min_len=1)])

108     # 关键词短语集合

109     keyphrases = set()

110     for sentence in self.words_no_filter:

111         one = []

112         for word in sentence:

113             if word in keywords_set:

114                 one.append(word)

115             else:

116                 if len(one) &gt; 1:

117                     # 将关键词组成关键词短语

118                     keyphrases.add(&#x27;&#x27;.join(one))

119                 if len(one) == 0:

120                     continue

121                 else:

122                     one = []

123         # 兜底

124         if len(one) &gt; 1:

125             keyphrases.add(&#x27;&#x27;.join(one))

126     # 在原文本中至少出现min_occur_num词

127     return [phrase for phrase in keyphrases if self.text.count(phrase) &gt;= min_occur_num]

128 

129# 主模块

130if __name__ == &#x27;__main__&#x27;:

131 # 空语句，保持程序结构的完整性

132 pass

133

134

TextRank4Sentence.py：包含用于提取关键句的类。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
1# -*- encoding:utf-8 -*-

2&quot;&quot;&quot;

3@author:   letian

4@homepage: http://www.letiantian.me

5@github:   https://github.com/someus/

6&quot;&quot;&quot;

7from __future__ import (absolute_import, division, print_function,

8                       unicode_literals)

9 

10import networkx as nx

11import numpy as np

12 

13from . import util

14from .Segmentation import Segmentation

15 

16 

17class TextRank4Sentence(object):

18 

19  def __init__(self, stop_words_file=None,

20               allow_speech_tags=util.allow_speech_tags,

21               delimiters=util.sentence_delimiters):

22      &quot;&quot;&quot;

23      Keyword arguments:

24      stop_words_file  --  str，停止词文件路径，若不是str则是使用默认停止词文件

25      delimiters       --  默认值是`?!;？！。；…\n`，用来将文本拆分为句子。

26      Object Var:

27      self.sentences               --  由句子组成的列表。

28      self.words_no_filter         --  对sentences中每个句子分词而得到的两级列表。

29      self.words_no_stop_words     --  去掉words_no_filter中的停止词而得到的两级列表。

30      self.words_all_filters       --  保留words_no_stop_words中指定词性的单词而得到的两级列表。

31      &quot;&quot;&quot;

32      self.seg = Segmentation(stop_words_file=stop_words_file,

33                              allow_speech_tags=allow_speech_tags,

34                              delimiters=delimiters)

35 

36      self.sentences = None

37      self.words_no_filter = None  # 2维列表

38      self.words_no_stop_words = None

39      self.words_all_filters = None

40 

41      self.key_sentences = None

42 

43  def analyze(self, text, lower=False,

44              source=&#x27;no_stop_words&#x27;,

45              sim_func=util.get_similarity,

46              pagerank_config={&#x27;alpha&#x27;: 0.85, }):

47      &quot;&quot;&quot;

48      Keyword arguments:

49      text                 --  文本内容，字符串。

50      lower                --  是否将文本转换为小写。默认为False。

51      source               --  选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来生成句子之间的相似度。

52                               默认值为`&#x27;all_filters&#x27;`，可选值为`&#x27;no_filter&#x27;, &#x27;no_stop_words&#x27;, &#x27;all_filters&#x27;`。

53      sim_func             --  指定计算句子相似度的函数。

54      &quot;&quot;&quot;

55 

56      self.key_sentences = []

57 

58      result = self.seg.segment(text=text, lower=lower)

59      self.sentences = result.sentences

60      self.words_no_filter = result.words_no_filter

61      self.words_no_stop_words = result.words_no_stop_words

62      self.words_all_filters = result.words_all_filters

63 

64      options = [&#x27;no_filter&#x27;, &#x27;no_stop_words&#x27;, &#x27;all_filters&#x27;]

65      if source in options:

66          _source = result[&#x27;words_&#x27; + source]

67      else:

68          _source = result[&#x27;words_no_stop_words&#x27;]

69 

70      self.key_sentences = util.sort_sentences(sentences=self.sentences,

71                                               words=_source,

72                                               sim_func=sim_func,

73                                               pagerank_config=pagerank_config)

74 

75  def get_key_sentences(self, num=6, sentence_min_len=6):

76      &quot;&quot;&quot;获取最重要的num个长度大于等于sentence_min_len的句子用来生成摘要。

77      Return:

78      多个句子组成的列表。

79      &quot;&quot;&quot;

80      result = []

81      count = 0

82      for item in self.key_sentences:

83          if count &gt;= num:

84              break

85          if len(item[&#x27;sentence&#x27;]) &gt;= sentence_min_len:

86              result.append(item)

87              count += 1

88      return result

89 

90 

91if __name__ == &#x27;__main__&#x27;:

92  pass

93

94

2 textrank4zh模块的使用

2.1 textrank4zh模块的安装

这里介绍几种安装Python模块的方法，仅供参考。


1
2
3
4
5
6
7
8
9
10
11）python setup.py install --user

22）sudo python setup.py install

33）pip install textrank4zh --user

44）sudo pip install textrank4zh

5

6textrank4zh模块在python2或python3中均可使用，它所依赖的其他模块要求满足：

7

8jieba &gt;= 0.35； numpy &gt;= 1.7.1；networkx &gt;= 1.9.1

9

10

2.2 textrank4zh的使用实例

因为操作比较简单，所有直接以代码的形式展示例子，代码在python3环境下亲测可用。

1）提取关键词、关键短语和关键句


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
1#-*-coding:utf-8-*-

2&quot;&quot;&quot;

3@author:taoshouzheng

4@time:2018/5/18 8:20

5@email:tsz1216@sina.com

6&quot;&quot;&quot;

7# 导入系统模块

8import sys

9# imp模块提供了一个可以实现import语句的接口

10from imp import reload

11 

12# 异常处理

13try:

14  # reload方法用于对已经加载的模块进行重新加载，一般用于原模块有变化的情况

15  reload(sys)

16  # 设置系统的默认编码方式，仅本次有效，因为setdefaultencoding函数在被系统调用后即被删除

17  sys.setdefaultencoding(&#x27;utf-8&#x27;)

18except:

19  pass

20 

21&quot;&quot;&quot;

22展示textrank4zh模块的主要功能：

23提取关键词

24提取关键短语（关键词组）

25提取摘要（关键句）

26&quot;&quot;&quot;

27 

28# 导入编码转换模块

29import codecs

30# 从textrank4zh模块中导入提取关键词和生成摘要的类

31from textrank4zh import TextRank4Keyword, TextRank4Sentence

32 

33# 待读取的文本文件，一则新闻

34file = r&#x27;C:\Users\Tao Shouzheng\Desktop\01.txt&#x27;

35# 打开并读取文本文件

36text = codecs.open(file, &#x27;r&#x27;, &#x27;utf-8&#x27;).read()

37 

38# 创建分词类的实例

39tr4w = TextRank4Keyword()

40# 对文本进行分析，设定窗口大小为2，并将英文单词小写

41tr4w.analyze(text=text, lower=True, window=2)

42 

43&quot;&quot;&quot;输出&quot;&quot;&quot;

44print(&#x27;关键词为：&#x27;)

45# 从关键词列表中获取前20个关键词

46for item in tr4w.get_keywords(num=20, word_min_len=1):

47  # 打印每个关键词的内容及关键词的权重

48  print(item.word, item.weight)

49print(&#x27;\n&#x27;)

50 

51print(&#x27;关键短语为：&#x27;)

52# 从关键短语列表中获取关键短语

53for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num=2):

54  print(phrase)

55print(&#x27;\n&#x27;)

56 

57# 创建分句类的实例

58tr4s = TextRank4Sentence()

59# 英文单词小写，进行词性过滤并剔除停用词

60tr4s.analyze(text=text, lower=True, source=&#x27;all_filters&#x27;)

61 

62print(&#x27;摘要为：&#x27;)

63# 抽取3条句子作为摘要

64for item in tr4s.get_key_sentences(num=3):

65  # 打印句子的索引、权重和内容

66  print(item.index, item.weight, item.sentence)

67

68

结果如下：

关键词为：
媒体 0.02155864734852778
高圆圆 0.020220281898126486
微 0.01671909730824073
宾客 0.014328439104001788
赵又廷 0.014035488254875914
答谢 0.013759845912857732
谢娜 0.013361244496632448
现身 0.012724133346018603
记者 0.01227742092899235
新人 0.01183128428494362
北京 0.011686712993089671
博 0.011447168887452668
展示 0.010889176260920504
捧场 0.010507502237123278
礼物 0.010447275379792245
张杰 0.009558332870902892
当晚 0.009137982757893915
戴 0.008915271161035208
酒店 0.00883521621207796
外套 0.008822082954131174

关键短语为：
微博


1
2
3
4
5
6
1&gt;摘要为：

20 0.07097195571711616 中新网北京12月1日电(记者 张曦) 30日晚，高圆圆和赵又廷在京举行答谢宴，诸多明星现身捧场，其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等

36 0.05410372364148859 高圆圆身穿粉色外套，看到大批记者在场露出娇羞神色，赵又廷则戴着鸭舌帽，十分淡定，两人快步走进电梯，未接受媒体采访

427 0.04904283129838876 记者了解到，出席高圆圆、赵又廷答谢宴的宾客近百人，其中不少都是女方的高中同学

5

6

2）展示textrank4zh模块的三种分词模式的效果


1
2
3
4
5
6
7
8
9
1三种分词模式分别为：

2

3words_no_filter模式：简单分词，不剔除停用词，不进行词性过滤

4

5words_no_stop_words模式：剔除停用词

6

7words_all_filters模式（默认）：即剔除停用词，又进行词性过滤

8

9


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
1#-*-coding:utf-8-*-

2&quot;&quot;&quot;

3@author:taoshouzheng

4@time:2018/5/18 14:52

5@email:tsz1216@sina.com

6&quot;&quot;&quot;

7 

8import codecs

9from imp import reload

10 

11from textrank4zh import TextRank4Keyword, TextRank4Sentence

12 

13import sys

14try:

15  reload(sys)

16  sys.setdefaultencoding(&#x27;utf-8&#x27;)

17except:

18  pass

19 

20&quot;&quot;&quot;测试3类分词的效果&quot;&quot;&quot;

21 

22text = &#x27;这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。&#x27;

23tr4w = TextRank4Keyword()

24 

25tr4w.analyze(text=text, lower=True, window=2)

26# 将文本划分为句子列表

27print(&#x27;sentences:&#x27;)

28for s in tr4w.sentences:

29  print(s)

30print(&#x27;\n&#x27;)

31 

32# 对句子列表中的句子进行分词，不进行词性过滤

33print(&#x27;words_no_filter:&#x27;)

34# words为词列表，tr4w.words_no_filter为由词列表组成的列表

35for words in tr4w.words_no_filter:

36  print(&#x27;/&#x27;.join(words))

37print(&#x27;\n&#x27;)

38 

39# 打印去掉停用词的词列表

40print(&#x27;words_no_stop_words:&#x27;)

41for words in tr4w.words_no_stop_words:

42  print(&#x27;/&#x27;.join(words))

43print(&#x27;\n&#x27;)

44 

45# 打印去掉停用词并进行词性过滤之后的词列表

46print(&#x27;words_all_filters:&#x27;)

47for words in tr4w.words_all_filters:

48  print(&#x27;/&#x27;.join(words))

49

50

结果如下：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1sentences:

2这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足

3答谢宴于晚上8点开始

4

5

6words_no_filter:

7这/间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足

8答谢/宴于/晚上/8/点/开始

9

10

11words_no_stop_words:

12间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足

13答谢/宴于/晚上/8/点

14

15

16words_all_filters:

17酒店/位于/北京/东三环/摆放/雕塑/文艺/气息

18答谢/宴于/晚上

19

20

{{userData.name}}已认证

深度学习—-NLP-TextRank的textrank4zh模块源码解读

文章目录

1. textrank4zh模块源码解读

2 textrank4zh模块的使用

2.1 textrank4zh模块的安装

2.2 textrank4zh的使用实例

1）提取关键词、关键短语和关键句

2）展示textrank4zh模块的三种分词模式的效果

MongoDB数据建模小案例：多列数据结构

Ubuntu上NFS的安装配置

{{userData.name}}已认证

文章目录

1. textrank4zh模块源码解读

2 textrank4zh模块的使用

2.1 textrank4zh模块的安装

2.2 textrank4zh的使用实例

1）提取关键词、关键短语和关键句

2）展示textrank4zh模块的三种分词模式的效果

Related posts:

MongoDB数据建模小案例：多列数据结构

Ubuntu上NFS的安装配置

15个私有云上的 DevOps 开源工具

ElasticSearch大数据分布式弹性搜索引擎使用—从0到1

数据库集群

nginx日志分析利器GoAccess