基于lucene的案例开发：纵横小说阅读页采集

释放双眼，带上耳机，听听看~！

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

在之前的三篇博客中，我们已经介绍了关于纵横小说的更新列表页、简介页、章节列表页的相关信息采集，今天这篇博客就重点介绍一下最重要的阅读页的信息采集。本文还是以一个简单的URL为例，网址如下：http://book.zongheng.com/chapter/362857/6001264.html 。

页面分析

上述url网址下的下面样式如下：

阅读页和章节列表页一样，都无法通过简单的鼠标右键–>查看网页源代码这个操作，所以还是通过F12–>NetWork–>Ctrl+F5这个操作找到页面的源代码，结果截图如下：

对页面源代码做简单的查找，即可找到标题、字数和章节内容这些属性值所在的位置分别是 47行、141行和145行（页面不同，可能所在的行数也略微有点差别，具体的行数请个人根据实际情况来确定）。

对于这三部分的正则，因为和之前的大同小异，使用的方法之前也已经介绍了，所以这里就只给出最终的结果：


1
2
3
4
5
6
7
1\\章节内容正则

2private static final String CONTENT = &quot;&lt;div id=\&quot;chapterContent\&quot; class=\&quot;content\&quot; itemprop=\&quot;acticleBody\&quot;&gt;(.*?)&lt;/div&gt;&quot;;

3\\标题正则

4private static final String TITLE = &quot;chapterName=\&quot;(.*?)\&quot;&quot;;

5\\字数正则

6private static final String WORDCOUNT = &quot;itemprop=\&quot;wordCount\&quot;&gt;(\\d*)&lt;/span&gt;&quot;;

7


1
2
1  **运行结果** 

2

看到运行结果的截图，你也许会发现一个问题，就是章节内容中含有一些html标签，这里是因为我们的案例最终的展示是网页展示，所以这里就偷个懒，如果需要去掉这些标签的，可以直接通过String的repalceAll方法对其替换。

源代码

查看最新源代码请访问：http://www.llwjy.com/source/com.lulei.crawl.novel.zongheng.ReadPage.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
1/**  

2 *@Description:   阅读页

3 */ 

4package com.lulei.crawl.novel.zongheng;  

5

6import java.io.IOException;

7import java.util.HashMap;

8

9import com.lulei.crawl.CrawlBase;

10import com.lulei.util.DoRegex;

11import com.lulei.util.ParseUtil;

12  

13public class ReadPage extends CrawlBase {

14  private static final String CONTENT = &quot;&lt;div id=\&quot;chapterContent\&quot; class=\&quot;content\&quot; itemprop=\&quot;acticleBody\&quot;&gt;(.*?)&lt;/div&gt;&quot;;

15  private static final String TITLE = &quot;chapterName=\&quot;(.*?)\&quot;&quot;;

16  private static final String WORDCOUNT = &quot;itemprop=\&quot;wordCount\&quot;&gt;(\\d*)&lt;/span&gt;&quot;;

17  private String pageUrl;

18  private static HashMap&lt;String, String&gt; params;

19  /**

20   * 添加相关头信息，对请求进行伪装

21   */

22  static {

23      params = new HashMap&lt;String, String&gt;();

24      params.put(&quot;Referer&quot;, &quot;http://book.zongheng.com&quot;);

25      params.put(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36&quot;);

26  }

27  

28  public ReadPage(String url) throws IOException {

29      readPageByGet(url, &quot;utf-8&quot;, params);

30      this.pageUrl = url;

31  }

32  

33  /**

34   * @return

35   * @Author:lulei  

36   * @Description: 章节标题

37   */

38  private String getTitle() {

39      return DoRegex.getFirstString(getPageSourceCode(), TITLE, 1);

40  }

41  

42  /**

43   * @return

44   * @Author:lulei  

45   * @Description: 字数

46   */

47  private int getWordCount() {

48      String wordCount = DoRegex.getFirstString(getPageSourceCode(), WORDCOUNT, 1);

49      return ParseUtil.parseStringToInt(wordCount, 0);

50  }

51  

52  /**

53   * @return

54   * @Author:lulei  

55   * @Description: 正文

56   */

57  private String getContent() {

58      return DoRegex.getFirstString(getPageSourceCode(), CONTENT, 1);

59  }

60

61  public static void main(String[] args) throws IOException {

62      // TODO Auto-generated method stub  

63      ReadPage readPage = new ReadPage(&quot;http://book.zongheng.com/chapter/362857/6001264.html&quot;);

64      System.out.println(readPage.pageUrl);

65      System.out.println(readPage.getTitle());

66      System.out.println(readPage.getWordCount());

67      System.out.println(readPage.getContent());

68  }

69

70}

71

ps:最近发现其他网站可能会对博客转载，上面并没有源链接，如想查看更多关于基于lucene的案例开发请http://www.aiuxian.com/catalog/p-322798.html。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html

{{userData.name}}已认证

基于lucene的案例开发：纵横小说阅读页采集

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

{{userData.name}}已认证

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

Related posts:

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

基于lucene的案例开发：纵横小说更新列表页抓取

基于lucene的案例开发：纵横小说简介页采集

基于lucene的案例开发：纵横小说章节列表采集

实战hadoop海量数据处理系列04预热篇：窗函数row_number 从理论到实践