转载请注明出处:http://www.voidcn.com/article/p-msdnvuff-bkz.html
http://www.llwjy.com/blogdetail/efda32f346445dd8423a942aa4c8c2cd.html
个人博客站已经上线了,网址 www.llwjy.com 欢迎各位吐槽
首先和大家说一生抱歉,由于最近经常在外面出差,博客断更了很长时间,后面不出意外的话,博客会恢复更新。
在上次的博客中已经介绍了纵横小说的数据库表结构,这里需要说明的是,我在设计数据表的时候,取消了数据表之间的外键,至于为什么这样做这里就不再多说,感兴趣的可以自行百度下。下面我们就开始今天的介绍:
模版类
在介绍数据库的操作之前,我们首先看一下定义的模版(javabean),这里定义了四个模版分别为抓取入口信息模版、小说简介页模版、小说章节列表模版、小说阅读页模版,类中只有一些简单的set和put方法,下面就看下具体的代码实现:
1.CrawlListInfo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 1 /**
2 *@Description:
3 */
4package com.lulei.crawl.novel.zongheng.model;
5
6public class CrawlListInfo {
7 private String url;
8 private String info;
9 private int frequency;
10
11 public String getUrl() {
12 return url;
13 }
14 public void setUrl(String url) {
15 this.url = url;
16 }
17 public String getInfo() {
18 return info;
19 }
20 public void setInfo(String info) {
21 this.info = info;
22 }
23 public int getFrequency() {
24 return frequency;
25 }
26 public void setFrequency(int frequency) {
27 this.frequency = frequency;
28 }
29}
30
31
1
2 12.NovelIntroModel
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80 1 /**
2 *@Description:
3 */
4package com.lulei.crawl.novel.zongheng.model;
5
6public class NovelIntroModel {
7 private String md5Id;
8 private String name;
9 private String author;
10 private String description;
11 private String type;
12 private String lastChapter;
13 private String chapterlisturl;
14 private int wordCount;
15 private String keyWords;
16 private int chapterCount;
17
18 public String getMd5Id() {
19 return md5Id;
20 }
21 public void setMd5Id(String md5Id) {
22 this.md5Id = md5Id;
23 }
24 public String getName() {
25 return name;
26 }
27 public void setName(String name) {
28 this.name = name;
29 }
30 public String getAuthor() {
31 return author;
32 }
33 public void setAuthor(String author) {
34 this.author = author;
35 }
36 public String getDescription() {
37 return description;
38 }
39 public void setDescription(String description) {
40 this.description = description;
41 }
42 public String getType() {
43 return type;
44 }
45 public void setType(String type) {
46 this.type = type;
47 }
48 public String getLastChapter() {
49 return lastChapter;
50 }
51 public void setLastChapter(String lastChapter) {
52 this.lastChapter = lastChapter;
53 }
54 public String getChapterlisturl() {
55 return chapterlisturl;
56 }
57 public void setChapterlisturl(String chapterlisturl) {
58 this.chapterlisturl = chapterlisturl;
59 }
60 public int getWordCount() {
61 return wordCount;
62 }
63 public void setWordCount(int wordCount) {
64 this.wordCount = wordCount;
65 }
66 public String getKeyWords() {
67 return keyWords;
68 }
69 public void setKeyWords(String keyWords) {
70 this.keyWords = keyWords;
71 }
72 public int getChapterCount() {
73 return chapterCount;
74 }
75 public void setChapterCount(int chapterCount) {
76 this.chapterCount = chapterCount;
77 }
78}
79
80
1
2 13.NovelChapterModel
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 1 /**
2 *@Description:
3 */
4package com.lulei.crawl.novel.zongheng.model;
5
6
7public class NovelChapterModel {
8 private String url;
9 private int chapterId;
10 private long time;
11 public String getUrl() {
12 return url;
13 }
14 public void setUrl(String url) {
15 this.url = url;
16 }
17 public int getChapterId() {
18 return chapterId;
19 }
20 public void setChapterId(int chapterId) {
21 this.chapterId = chapterId;
22 }
23 public long getTime() {
24 return time;
25 }
26 public void setTime(long time) {
27 this.time = time;
28 }
29}
30
31
1
2 14.NovelReadModel
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 1 /**
2 *@Description:
3 */
4package com.lulei.crawl.novel.zongheng.model;
5
6public class NovelReadModel extends NovelChapterModel {
7 private String title;
8 private int wordCount;
9 private String content;
10 public String getTitle() {
11 return title;
12 }
13 public void setTitle(String title) {
14 this.title = title;
15 }
16 public int getWordCount() {
17 return wordCount;
18 }
19 public void setWordCount(int wordCount) {
20 this.wordCount = wordCount;
21 }
22 public String getContent() {
23 return content;
24 }
25 public void setContent(String content) {
26 this.content = content;
27 }
28}
29
30
1
2 1 **数据库操作**
2
这里的数据库操作我们使用博客《基于lucene的案例开发:数据库连接池》介绍的数据库连接池,在采集这个业务过程中,主要是插入和查询操作,当然还有记录的状态值的更新操作,下面我们就每一个操作介绍一个方法,方面大家理解如何使用我们自己的数据库连接池操作来完成数据库的增删改查操作。
1.数据表查询:随机获取一条记录
我们之后的爬虫希望可以做成分布式的采集,因此这里我们在获取简介页的URL时候,我们可以每次获取一个随机值,这样在线程之间出现同时采集一个URL的情况就会大大降低,至于Mysql中的随机我们可以使用 order by rand() limit n 来获取前n条记录,其他的数据库实现方式稍微有点差异。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 1 /**
2 * @param state
3 * @return
4 * @Author:lulei
5 * @Description: 随机获取一个简介url
6 */
7 public String getRandIntroPageUrl(int state) {
8 DBServer dbServer = new DBServer(POOLNAME);
9 try {
10 String sql = "select * from novelinfo where state = '" + state + "' order by rand() limit 1";
11 ResultSet rs = dbServer.select(sql);
12 while (rs.next()) {
13 return rs.getString("url");
14 }
15 } catch (Exception e) {
16 e.printStackTrace();
17 } finally{
18 dbServer.close();
19 }
20 return null;
21 }
22
1
2 1 在这个方法中,我们直接使用DBServer中的select(String sql)方法即可执行对应的sql语句,他的返回值就是查询的结果集。
2
2.数据表更新:修改简介页的抓取状态
在简介页一次采集完成之后或者更新列表页检测到该简介页有更新的时候,我们需要对小说的简介页的抓取状态进行修改,标识这个简介页已经完成采集或需要采集,我们直接使用DBServer中的update(String sql)方法即可执行对应的sql语句。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 1 /**
2 * @param md5Id
3 * @param state
4 * @Author:lulei
5 * @Description: 修改简介页的抓取状态
6 */
7 public void updateInfoState(String md5Id, int state) {
8 DBServer dbServer = new DBServer(POOLNAME);
9 try {
10 String sql = "update novelinfo set state = '" + state + "' where id = '" + md5Id + "'";
11 dbServer.update(sql);
12 } catch (Exception e) {
13 e.printStackTrace();
14 } finally{
15 dbServer.close();
16 }
17 }
18
1
2 13.数据表插入:保存小说阅读页信息
2
在完成小说阅读页数据解析之后,我们需要将解析后的数据持久化到数据库中,这里我们可以使用DBServer中的insert(String table, String columns, HashMap<Integer, Object> params)方法即可执行相关的插入操作。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36 1 /**
2 * @param novel
3 * @Author:lulei
4 * @Description: 保存小说阅读页信息
5 */
6 public void saveNovelRead(NovelReadModel novel) {
7 if (novel == null) {
8 return;
9 }
10 DBServer dbServer = new DBServer(POOLNAME);
11 try {
12 HashMap<Integer, Object> params = new HashMap<Integer, Object>();
13 int i = 1;
14 String md5Id = ParseMD5.parseStrToMd5L32(novel.getUrl());
15 //如果已经存在,则直接返回
16 if (haveReadUrl(md5Id)) {
17 return;
18 }
19 long now = System.currentTimeMillis();
20 params.put(i++, md5Id);
21 params.put(i++, novel.getUrl());
22 params.put(i++, novel.getTitle());
23 params.put(i++, novel.getWordCount());
24 params.put(i++, novel.getChapterId());
25 params.put(i++, novel.getContent());
26 params.put(i++, novel.getTime());
27 params.put(i++, now);
28 params.put(i++, now);
29 dbServer.insert("novelchapterdetail", "id,url,title,wordcount,chapterid,content,chaptertime,createtime,updatetime", params);
30 } catch (Exception e) {
31 e.printStackTrace();
32 } finally{
33 dbServer.close();
34 }
35 }
36
1
2 1 **庐山真面目**
2
完整的纵横小说数据库操作类代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380 1 /**
2 *@Description: 纵横中文小说数据库操作
3 */
4package com.lulei.db.novel.zongheng;
5
6import java.sql.ResultSet;
7import java.util.ArrayList;
8import java.util.HashMap;
9import java.util.List;
10
11import com.lulei.crawl.novel.zongheng.model.CrawlListInfo;
12import com.lulei.crawl.novel.zongheng.model.NovelChapterModel;
13import com.lulei.crawl.novel.zongheng.model.NovelIntroModel;
14import com.lulei.crawl.novel.zongheng.model.NovelReadModel;
15import com.lulei.db.manager.DBServer;
16import com.lulei.util.ParseMD5;
17
18public class ZonghengDb {
19 private static final String POOLNAME = "proxool.test";
20
21 /**
22 * @param urls
23 * @Author:lulei
24 * @Description: 保存更新列表采集到的URL
25 */
26 public void saveInfoUrls(List<String> urls) {
27 if (urls == null || urls.size() < 1) {
28 return;
29 }
30 for (String url : urls) {
31 String md5Id = ParseMD5.parseStrToMd5L32(url);
32 if (haveInfoUrl(md5Id)) {
33 updateInfoState(md5Id, 1);
34 } else {
35 insertInfoUrl(md5Id, url);
36 }
37 }
38 }
39
40 /**
41 * @param state
42 * @return
43 * @Author:lulei
44 * @Description: 随机获取一个简介url
45 */
46 public String getRandIntroPageUrl(int state) {
47 DBServer dbServer = new DBServer(POOLNAME);
48 try {
49 String sql = "select * from novelinfo where state = '" + state + "' order by rand() limit 1";
50 ResultSet rs = dbServer.select(sql);
51 while (rs.next()) {
52 return rs.getString("url");
53 }
54 } catch (Exception e) {
55 e.printStackTrace();
56 } finally{
57 dbServer.close();
58 }
59 return null;
60 }
61
62 /**
63 * @param state
64 * @return
65 * @Author:lulei
66 * @Description: 随机获取一个章节信息
67 */
68 public NovelChapterModel getRandReadPageUrl(int state) {
69 DBServer dbServer = new DBServer(POOLNAME);
70 try {
71 String sql = "select * from novelchapter where state = '" + state + "' order by rand() limit 1";
72 ResultSet rs = dbServer.select(sql);
73 while (rs.next()) {
74 NovelChapterModel chapter = new NovelChapterModel();
75 chapter.setChapterId(rs.getInt("chapterid"));
76 chapter.setTime(rs.getLong("chaptertime"));
77 chapter.setUrl(rs.getString("url"));
78 return chapter;
79 }
80 } catch (Exception e) {
81 e.printStackTrace();
82 } finally{
83 dbServer.close();
84 }
85 return null;
86 }
87
88 /**
89 * @param novel
90 * @Author:lulei
91 * @Description: 保存小说阅读页信息
92 */
93 public void saveNovelRead(NovelReadModel novel) {
94 if (novel == null) {
95 return;
96 }
97 DBServer dbServer = new DBServer(POOLNAME);
98 try {
99 HashMap<Integer, Object> params = new HashMap<Integer, Object>();
100 int i = 1;
101 String md5Id = ParseMD5.parseStrToMd5L32(novel.getUrl());
102 //如果已经存在,则直接返回
103 if (haveReadUrl(md5Id)) {
104 return;
105 }
106 long now = System.currentTimeMillis();
107 params.put(i++, md5Id);
108 params.put(i++, novel.getUrl());
109 params.put(i++, novel.getTitle());
110 params.put(i++, novel.getWordCount());
111 params.put(i++, novel.getChapterId());
112 params.put(i++, novel.getContent());
113 params.put(i++, novel.getTime());
114 params.put(i++, now);
115 params.put(i++, now);
116 dbServer.insert("novelchapterdetail", "id,url,title,wordcount,chapterid,content,chaptertime,createtime,updatetime", params);
117 } catch (Exception e) {
118 e.printStackTrace();
119 } finally{
120 dbServer.close();
121 }
122 }
123
124 /**
125 * @return
126 * @Author:lulei
127 * @Description: 获取监控的更新列表页
128 */
129 public List<CrawlListInfo> getCrawlListInfos(){
130 List<CrawlListInfo> infos = new ArrayList<CrawlListInfo>();
131 DBServer dbServer = new DBServer(POOLNAME);
132 try {
133 String sql = "select * from crawllist where state = '1'";
134 ResultSet rs = dbServer.select(sql);
135 while (rs.next()) {
136 CrawlListInfo info = new CrawlListInfo();
137 infos.add(info);
138 info.setFrequency(rs.getInt("frequency"));
139 info.setInfo(rs.getString("info"));
140 info.setUrl(rs.getString("url"));
141 }
142 } catch (Exception e) {
143 e.printStackTrace();
144 } finally{
145 dbServer.close();
146 }
147 return infos;
148 }
149
150 /**
151 * @param bean
152 * @Author:lulei
153 * @Description: 更新简介页记录
154 */
155 public void updateInfo(NovelIntroModel bean) {
156 if (bean == null) {
157 return;
158 }
159 DBServer dbServer = new DBServer(POOLNAME);
160 try {
161 HashMap<Integer, Object> params = new HashMap<Integer, Object>();
162 int i = 1;
163 params.put(i++, bean.getName());
164 params.put(i++, bean.getAuthor());
165 params.put(i++, bean.getDescription());
166 params.put(i++, bean.getType());
167 params.put(i++, bean.getLastChapter());
168 params.put(i++, bean.getChapterCount());
169 params.put(i++, bean.getChapterlisturl());
170 params.put(i++, bean.getWordCount());
171 params.put(i++, bean.getKeyWords());
172 long now = System.currentTimeMillis();
173 params.put(i++, now);
174 params.put(i++, "0");
175 String columns = "name, author, description, type, lastchapter, chaptercount, chapterlisturl, wordcount, keywords, updatetime, state";
176 String condition = "where id = '" + bean.getMd5Id() + "'";
177 dbServer.update("novelinfo", columns, condition, params);
178 } catch (Exception e) {
179 e.printStackTrace();
180 } finally{
181 dbServer.close();
182 }
183 }
184
185 /**
186 * @param chapters
187 * @Author:lulei
188 * @Description: 保存章节列表信息
189 */
190 public void saveChapters(List<String[]> chapters) {
191 if (chapters == null) {
192 return;
193 }
194 DBServer dbServer = new DBServer(POOLNAME);
195 try {
196 for (int i = 0; i < chapters.size(); i++) {
197 String[] chapter = chapters.get(i);
198 if (chapter.length != 4) {
199 continue;
200 }
201 //name、wordcount、time、url
202 String md5Id = ParseMD5.parseStrToMd5L32(chapter[3]);
203 if (!haveChapterUrl(md5Id)) {
204 insertChapterUrl(chapter, i);
205 }
206 }
207 } catch (Exception e) {
208 e.printStackTrace();
209 } finally{
210 dbServer.close();
211 }
212 }
213
214 /**
215 * @param md5Id
216 * @param state
217 * @Author:lulei
218 * @Description: 修改简介页的抓取状态
219 */
220 public void updateInfoState(String md5Id, int state) {
221 DBServer dbServer = new DBServer(POOLNAME);
222 try {
223 String sql = "update novelinfo set state = '" + state + "' where id = '" + md5Id + "'";
224 dbServer.update(sql);
225 } catch (Exception e) {
226 e.printStackTrace();
227 } finally{
228 dbServer.close();
229 }
230 }
231
232 /**
233 * @param md5Id
234 * @param state
235 * @Author:lulei
236 * @Description: 更新章节列表采集状态
237 */
238 public void updateChapterState(String md5Id, int state) {
239 DBServer dbServer = new DBServer(POOLNAME);
240 try {
241 String sql = "update novelchapter set state = '" + state + "' where id = '" + md5Id + "'";
242 dbServer.update(sql);
243 } catch (Exception e) {
244 e.printStackTrace();
245 } finally{
246 dbServer.close();
247 }
248 }
249
250 /**
251 * @param md5Id
252 * @param url
253 * @Author:lulei
254 * @Description: 新增一个抓取简介页
255 */
256 private void insertInfoUrl(String md5Id, String url) {
257 DBServer dbServer = new DBServer(POOLNAME);
258 try {
259 HashMap<Integer, Object> params = new HashMap<Integer, Object>();
260 int i = 1;
261 params.put(i++, md5Id);
262 params.put(i++, url);
263 long now = System.currentTimeMillis();
264 params.put(i++, now);
265 params.put(i++, now);
266 params.put(i++, "1");
267 dbServer.insert("novelinfo", "id, url, createtime, updatetime, state", params);
268 } catch (Exception e) {
269 e.printStackTrace();
270 } finally{
271 dbServer.close();
272 }
273 }
274
275 /**
276 * @param md5Id
277 * @return
278 * @Author:lulei
279 * @Description: 判断简介页是否存在
280 */
281 private boolean haveInfoUrl(String md5Id) {
282 DBServer dbServer = new DBServer(POOLNAME);
283 try {
284 ResultSet rs = dbServer.select("select sum(1) as count from novelinfo where id = '" + md5Id + "'");
285 if (rs.next()) {
286 int count = rs.getInt("count");
287 return count > 0;
288 }
289 return false;
290 } catch (Exception e) {
291 e.printStackTrace();
292 return true;
293 } finally{
294 dbServer.close();
295 }
296 }
297
298 /**
299 * @param md5Id
300 * @return
301 * @Author:lulei
302 * @Description: 判断阅读页信息是否存在
303 */
304 private boolean haveReadUrl(String md5Id) {
305 DBServer dbServer = new DBServer(POOLNAME);
306 try {
307 ResultSet rs = dbServer.select("select sum(1) as count from novelchapterdetail where id = '" + md5Id + "'");
308 if (rs.next()) {
309 int count = rs.getInt("count");
310 return count > 0;
311 }
312 return false;
313 } catch (Exception e) {
314 e.printStackTrace();
315 return true;
316 } finally{
317 dbServer.close();
318 }
319 }
320
321 /**
322 * @param chapter
323 * @param chapterId
324 * @Author:lulei
325 * @Description: 插入章节列表页信息
326 */
327 private void insertChapterUrl(String[] chapter, int chapterId) {
328 //name、wordcount、time、url
329 DBServer dbServer = new DBServer(POOLNAME);
330 try {
331 HashMap<Integer, Object> params = new HashMap<Integer, Object>();
332 int i = 1;
333 params.put(i++, ParseMD5.parseStrToMd5L32(chapter[3]));
334 params.put(i++, chapter[3]);
335 params.put(i++, chapter[0]);
336 params.put(i++, chapter[1]);
337 params.put(i++, chapterId);
338 params.put(i++, chapter[2]);
339 long now = System.currentTimeMillis();
340 params.put(i++, now);
341 params.put(i++, "1");
342 dbServer.insert("novelchapter", "id, url, title, wordcount, chapterid, chaptertime, createtime, state", params);
343 } catch (Exception e) {
344 e.printStackTrace();
345 } finally{
346 dbServer.close();
347 }
348 }
349
350 /**
351 * @param md5Id
352 * @return
353 * @Author:lulei
354 * @Description: 是否存在章节信息
355 */
356 private boolean haveChapterUrl(String md5Id) {
357 DBServer dbServer = new DBServer(POOLNAME);
358 try {
359 ResultSet rs = dbServer.select("select sum(1) as count from novelchapter where id = '" + md5Id + "'");
360 if (rs.next()) {
361 int count = rs.getInt("count");
362 return count > 0;
363 }
364 return false;
365 } catch (Exception e) {
366 e.printStackTrace();
367 return true;
368 } finally{
369 dbServer.close();
370 }
371 }
372
373 public static void main(String[] args) {
374 // TODO Auto-generated method stub
375
376 }
377
378}
379
380
1
2 1 对于上面的代码还希望大家可以认真的阅读下,里面有一些简单的去重操作;在下一篇博客中我们将会介绍如何基于这写数据库操作来实现分布式采集。
2
ps:最近发现其他网站可能会对博客转载,上面并没有源链接,如想查看更多关于基于lucene的案例开发 请点击这里。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html
小福利
个人在极客学院上《Lucene案例开发》课程已经上线了(目前上线到第二课),欢迎大家吐槽~
第一课:Lucene概述
第二课:Lucene 常用功能介绍