基于lucene的案例开发：纵横小说简介页采集

释放双眼，带上耳机，听听看~！

转载请注明出处：http://blog.csdn.net/xiaojimanman/article/details/44851419

http://www.llwjy.com/blogdetail/1b5ae17c513d127838c2e02102b5bb87.html

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

在上一篇博客中，我们已经对纵横中文小说的更新列表页做了简单的采集，获得了小说简介页的URL，因此这篇博客我们就介绍纵横中文小说简介页信息的采集，事例地址：http://book.zongheng.com/book/362857.html

页面分析

在开始之前，建议个人先看一下简介页的样子，下图只是我们要采集的信息所在的区域。

基于lucene的案例开发：纵横小说简介页采集

在这一部分，我们需要获取书名、作者名、分类、字数、简介、最新章节名、章节页URL和标签等信息。在页面上，我们通过鼠标右键–查看网页源代码发现下面一个现象

基于lucene的案例开发：纵横小说简介页采集

纵横小说为了做360的seo，把小说的一些关键信息放到head中，这样就大大减少我们下正则的复杂度，由于这几个正则大同小异，所以就只用书名做简单的介绍，其余的正则可以参照后面的源代码。这里的书名在上述截图中的33行，我们需要提取中间的 飞仙诀 信息，因此我们提取该信息的正则表达式为”
<meta name="og:novel:book_name" content="(.\*?)"/> “ ，其他信息和此正则类似。通过上图这部分源代码我们可以轻易的获取书名、作者名、最新章节、简介、分类和章节列表页URL，对于标签和字数这两个字段，我们就需要继续分析下面的源代码。通过简单的查找，我们可以找到下图中的源代码，这里就包含我们需要的字数和标签两个属性。

基于lucene的案例开发：纵横小说简介页采集

对于字数这个属性，我们可以通过简单的正则表达式 ”
<span itemprop="wordCount">(\d*?)</span> “ 获取，而对于标签这个属性，我们需要通过两步才能得到想要的内容。

第一步：获取keyword所在的html代码，也就是上图中的234行，这一步的正则表达式为 ”

<div class="keyword">(.\*?)</div> “；

第二步：对第一步获得的部分html做进一步提取，获取想要的内容，这一步的正则表达式为 ”
<a.*?>(.*?)</a> “。

代码实现

对于非更新列表也的网页信息采集，我们统一继承CrawlBase类，对于如何伪装可以参照上一篇博客，这里就重点介绍DoRegex类中的两个方法

方法一：


1
2
1String getFirstString(String dealStr, String regexStr, int n)

2


1
2
1      这里的第一个参数是要处理的字符串，这里也就是网页源代码，第二个参数是要查找内容的正则表达式，第三个参数是要提取的内容在正则表达式中的位置，函数的功能是从指定的字符串中查找与正则第一个匹配的内容，返回指定的提取信息。 

2

方法二：


1
2
1String getString(String dealStr, String regexStr, String splitStr, int n)

2


1
2
1      这里的第1、2、4参数分别对应方法一中的第1、2、3参数，参数splitStr的意义是分隔符，函数的功能是在指定的字符串中查找与正则表达式匹配的内容，之间用指定的分隔符隔开。 

2

运行结果

基于lucene的案例开发：纵横小说简介页采集

源代码

通过对上面两个方法的介绍，相信对于下面的源代码也会很简单。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
1 /**  

2 *@Description:  简介页

3 */ 

4package com.lulei.crawl.novel.zongheng;  

5

6import java.io.IOException;

7import java.util.HashMap;

8

9import com.lulei.crawl.CrawlBase;

10import com.lulei.util.DoRegex;

11import com.lulei.util.ParseUtil;

12  

13public class IntroPage extends CrawlBase {

14  private static final String NAME = &quot;&lt;meta name=\&quot;og:novel:book_name\&quot; content=\&quot;(.*?)\&quot;/&gt; &quot;;

15  private static final String AUTHOR = &quot;&lt;meta name=\&quot;og:novel:author\&quot; content=\&quot;(.*?)\&quot;/&gt; &quot;;

16  private static final String DESC = &quot;&lt;meta property=\&quot;og:description\&quot; content=\&quot;(.*?)\&quot;/&gt; &quot;;

17  private static final String TYPE = &quot;&lt;meta name=\&quot;og:novel:category\&quot; content=\&quot;(.*?)\&quot;/&gt; &quot;;

18  private static final String LATESTCHAPTER = &quot;&lt;meta name=\&quot;og:novel:latest_chapter_name\&quot; content=\&quot;(.*?)\&quot;/&gt; &quot;;

19  private static final String CHAPTERLISTURL = &quot;&lt;meta name=\&quot;og:novel:read_url\&quot; content=\&quot;(.*?)\&quot;/&gt; &quot;;

20  private static final String WORDCOUNT = &quot;&lt;span itemprop=\&quot;wordCount\&quot;&gt;(\\d*?)&lt;/span&gt;&quot;;

21  private static final String KEYWORDS = &quot;&lt;div class=\&quot;keyword\&quot;&gt;(.*?)&lt;/div&gt;&quot;;

22  private static final String KEYWORD = &quot;&lt;a.*?&gt;(.*?)&lt;/a&gt;&quot;;

23  private String pageUrl;

24  

25  private static HashMap&lt;String, String&gt; params;

26  /**

27   * 添加相关头信息，对请求进行伪装

28   */

29  static {

30      params = new HashMap&lt;String, String&gt;();

31      params.put(&quot;Referer&quot;, &quot;http://book.zongheng.com&quot;);

32      params.put(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36&quot;);

33  }

34  

35  public IntroPage(String url) throws IOException {

36      readPageByGet(url, &quot;utf-8&quot;, params);

37      this.pageUrl = url;

38  }

39  

40  /**

41   * @return

42   * @Author:lulei  

43   * @Description: 获取书名

44   */

45  private String getName() {

46      return DoRegex.getFirstString(getPageSourceCode(), NAME, 1);

47  }

48  

49  /**

50   * @return

51   * @Author:lulei  

52   * @Description: 获取作者名

53   */

54  private String getAuthor() {

55      return DoRegex.getFirstString(getPageSourceCode(), AUTHOR, 1);

56  }

57  

58  /**

59   * @return

60   * @Author:lulei  

61   * @Description: 书籍简介

62   */

63  private String getDesc() {

64      return DoRegex.getFirstString(getPageSourceCode(), DESC, 1);

65  }

66  

67  /**

68   * @return

69   * @Author:lulei  

70   * @Description: 书籍分类

71   */

72  private String getType() {

73      return DoRegex.getFirstString(getPageSourceCode(), TYPE, 1);

74  }

75  

76  /**

77   * @return

78   * @Author:lulei  

79   * @Description: 最新章节

80   */

81  private String getLatestChapter() {

82      return DoRegex.getFirstString(getPageSourceCode(), LATESTCHAPTER, 1);

83  }

84  

85  /**

86   * @return

87   * @Author:lulei  

88   * @Description: 章节列表页Url

89   */

90  private String getChapterListUrl() {

91      return DoRegex.getFirstString(getPageSourceCode(), CHAPTERLISTURL, 1);

92  }

93  

94  /**

95   * @return

96   * @Author:lulei  

97   * @Description: 字数

98   */

99  private int getWordCount() {

100     String wordCount = DoRegex.getFirstString(getPageSourceCode(), WORDCOUNT, 1);

101     return ParseUtil.parseStringToInt(wordCount, 0);

102 }

103 

104 /**

105  * @return

106  * @Author:lulei  

107  * @Description: 标签

108  */

109 private String keyWords() {

110     String keyHtml = DoRegex.getFirstString(getPageSourceCode(), KEYWORDS, 1);

111     return DoRegex.getString(keyHtml, KEYWORD, &quot; &quot;, 1);

112 }

113

114 public static void main(String[] args) throws IOException {

115     // TODO Auto-generated method stub  

116     IntroPage intro = new IntroPage(&quot;http://book.zongheng.com/book/362857.html&quot;);

117     System.out.println(intro.pageUrl);

118     System.out.println(intro.getName());

119     System.out.println(intro.getAuthor());

120     System.out.println(intro.getDesc());

121     System.out.println(intro.getType());

122     System.out.println(intro.getLatestChapter());

123     System.out.println(intro.getChapterListUrl());

124     System.out.println(intro.getWordCount());

125     System.out.println(intro.keyWords());

126 }

127

128}

129

130


1
2
1  ----------------------------------------------------------------------------------------------------   

2

ps:最近发现其他网站可能会对博客转载，上面并没有源链接，如想查看更多关于基于lucene的案例开发请点击这里。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html

{{userData.name}}已认证

基于lucene的案例开发：纵横小说简介页采集

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

{{userData.name}}已认证

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

Related posts:

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

基于lucene的案例开发：纵横小说分布式采集

Lucene6入门教程（一）简介和学习流程

基于lucene的案例开发：纵横小说阅读页采集

pagerank算法