基于lucene的案例开发：纵横小说章节列表采集

释放双眼，带上耳机，听听看~！

转载请注明出处：http://blog.csdn.net/xiaojimanman/article/details/44854719

http://www.llwjy.com/blogdetail/ddcad68eeb91034247ffa331eb461213.html

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

在上两篇博客中，已经介绍了纵横中文小说的更新列表页和简介页内容的采集，这篇将介绍从简介页采集获得的下一跳章节列表页的信息采集，事例地址：http://book.zongheng.com/showchapter/362857.html

页面分析

通过对页面的分析，我们可以确定下图中的部分就是我们需要采集信息及下一跳的地址。

这里当我们想用鼠标右键–查看网页源代码的时候发现页面已经把鼠标右键这个操作屏蔽了，因此我们只能采用另一种办法来查看源代码，对页面进行分析。在当前页面，按下F12，会出现一个新窗口，也就是之前博客中提到的审查元素出现的窗口，选中Network选项卡，按下 Ctrl + F5，会出现如下画面：

鼠标单机红色选中部分，即可查看网页源代码，效果图如下：

对网页源代码做简单的分析，我们很容易找到章节信息所在的部分，如下图：

每一个章节信息都存储在td标签内，因此对这部分信息我们确定最后的正则表达式为“

<td class="chapterBean" chapterId="\d\*" chapterName="(.\*?)" chapterLevel="\d\*" wordNum="(.\*?)" updateTime="(.\*?)"><a href="(.\*?)" title=".\*?"> ”。

代码实现

对于章节列表也信息的采集我们采用和简介页相同的方法，创建一个CrawlBase子类，用它来完成相关信息的采集。对于请求伪装等操作参照更新列表页中的介绍，这里只介绍DoRegex类中的一个方法：


1
2
1List&lt;String[]&gt; getListArray(String dealStr, String regexStr, int[] array)

2


1
2
1      第一个参数是需要查询的字符串，第二个参数是正则表达式，第三个是需要提取的信息在正则表达式中的定位，函数的整体功能是返回字符串中所有满足条件的信息。 

2

运行结果

源代码

查看最新源代码请访问：http://www.llwjy.com/source/com.lulei.crawl.novel.zongheng.ChapterPage.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
1 /**  

2 *@Description:   章节列表页

3 */ 

4package com.lulei.crawl.novel.zongheng;  

5

6import java.io.IOException;

7import java.util.HashMap;

8import java.util.List;

9

10import com.lulei.crawl.CrawlBase;

11import com.lulei.util.DoRegex;

12  

13public class ChapterPage extends CrawlBase {

14  private static final String CHAPTER = &quot;&lt;td class=\&quot;chapterBean\&quot; chapterId=\&quot;\\d*\&quot; chapterName=\&quot;(.*?)\&quot; chapterLevel=\&quot;\\d*\&quot; wordNum=\&quot;(.*?)\&quot; updateTime=\&quot;(.*?)\&quot;&gt;&lt;a href=\&quot;(.*?)\&quot; title=\&quot;.*?\&quot;&gt;&quot;;

15  private static final int []ARRAY = {1, 2, 3, 4};

16  private static HashMap&lt;String, String&gt; params;

17  /**

18   * 添加相关头信息，对请求进行伪装

19   */

20  static {

21      params = new HashMap&lt;String, String&gt;();

22      params.put(&quot;Referer&quot;, &quot;http://book.zongheng.com&quot;);

23      params.put(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36&quot;);

24  }

25  

26  public ChapterPage(String url) throws IOException {

27      readPageByGet(url, &quot;utf-8&quot;, params);

28  }

29  

30  public List&lt;String[]&gt; getChaptersInfo() {

31      return DoRegex.getListArray(getPageSourceCode(), CHAPTER, ARRAY);

32  }

33  

34  public static void main(String[] args) throws IOException {

35      ChapterPage chapterPage = new ChapterPage(&quot;http://book.zongheng.com/showchapter/362857.html&quot;);

36      for (String []ss : chapterPage.getChaptersInfo()) {

37          for (String s : ss) {

38              System.out.println(s);

39          }

40          System.out.println(&quot;----------------------------------------------------   &quot;);

41      }

42  }

43

44}

45

46


1
2
1 ----------------------------------------------------------------------------------------------------  

2

ps:最近发现其他网站可能会对博客转载，上面并没有源链接，如想查看更多关于基于lucene的案例开发请点击这里。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html

{{userData.name}}已认证

基于lucene的案例开发：纵横小说章节列表采集

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

{{userData.name}}已认证

个人博客站已经上线了，网址 www.llwjy.com 欢迎各位吐槽

Related posts:

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

基于lucene的案例开发：纵横小说分布式采集

Lucene6入门教程（一）简介和学习流程

ElasticSearch大数据分布式弹性搜索引擎使用—从0到1

pagerank算法