在 Web 应用程序中集成 Lucene

释放双眼，带上耳机，听听看~！

接下来我们开发一个 Web 应用程序利用 Lucene 来检索存放在文件服务器上的 HTML 文档。在开始之前，需要准备如下环境：

Eclipse 集成开发环境
Tomcat 5.0
Lucene Library
JDK 1.5

这个例子使用 Eclipse 进行 Web 应用程序的开发，最终这个 Web 应用程序跑在 Tomcat 5.0 上面。在准备好开发所必需的环境之后，我们接下来进行 Web 应用程序的开发。

1、创建一个动态 Web 项目

在 Eclipse 里面，选择 File > New > Project，然后再弹出的窗口中选择动态 Web 项目，如图二所示。

创建动态Web项目

在创建好动态 Web 项目之后，你会看到创建好的项目的结构，如图三所示，项目的名称为 sample.dw.paper.lucene。

图三：动态 Web 项目的结构

2. 设计 Web 项目的架构

在我们的设计中，把该系统分成如下四个子系统：

用户接口: 这个子系统提供用户界面使用户可以向 Web 应用程序服务器提交搜索请求，然后搜索结果通过用户接口来显示出来。我们用一个名为 search.jsp 的页面来实现该子系统。
请求管理器: 这个子系统管理从客户端发送过来的搜索请求并把搜索请求分发到搜索子系统中。最后搜索结果从搜索子系统返回并最终发送到用户接口子系统。我们使用一个 Servlet 来实现这个子系统。
搜索子系统: 这个子系统负责在索引文件上进行搜索并把搜索结构传递给请求管理器。我们使用 Lucene 提供的 API 来实现该子系统。
索引子系统: 这个子系统用来为 HTML 页面来创建索引。我们使用 Lucene 的 API 以及 Lucene 提供的一个 HTML 解析器来创建该子系统。

图4 显示了我们设计的详细信息，我们将用户接口子系统放到 webContent 目录下面。你会看到一个名为 search.jsp 的页面在这个文件夹里面。请求管理子系统在包 sample.dw.paper.lucene.servlet 下面，类 SearchController 负责功能的实现。搜索子系统放在包 sample.dw.paper.lucene.search 当中，它包含了两个类，SearchManager 和 SearchResultBean，第一个类用来实现搜索功能，第二个类用来描述搜索结果的结构。索引子系统放在包 sample.dw.paper.lucene.index 当中。类 IndexManager 负责为 HTML 文件创建索引。该子系统利用包 sample.dw.paper.lucene.util 里面的类 HTMLDocParser 提供的方法 getTitle 和 getContent 来对 HTML 页面进行解析。
图四：项目的架构设计

3. 子系统的实现

在分析了系统的架构设计之后，我们接下来看系统实现的详细信息。

用户接口: 这个子系统有一个名为 search.jsp 的 JSP 文件来实现，这个 JSP 页面包含两个部分。第一部分提供了一个用户接口去向 Web 应用程序服务器提交搜索请求，如图5所示。注意到这里的搜索请求发送到了一个名为 SearchController 的 Servlet 上面。Servlet 的名字和具体实现的类的对应关系在 web.xml 里面指定。

图5：向Web服务器提交搜索请求

这个JSP的第二部分负责显示搜索结果给用户，如图6所示：
图6：显示搜索结果

请求管理器: 一个名为 SearchController 的 servlet 用来实现该子系统。清单６给出了这个类的源代码。

清单６：请求管理器的实现

package sample.dw.paper.lucene.servlet;import java.io.IOException;import java.util.List;import javax.servlet.RequestDispatcher;import javax.servlet.ServletException;import javax.servlet.http.HttpServlet;import javax.servlet.http.HttpServletRequest;import javax.servlet.http.HttpServletResponse;import sample.dw.paper.lucene.search.SearchManager;/** * This servlet is used to deal with the search request * and return the search results to the client */public class SearchController extends HttpServlet{ private static final long serialVersionUID = 1L; public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException{ String searchWord = request.getParameter("searchWord"); SearchManager searchManager = new SearchManager(searchWord); List searchResult = null; searchResult = searchManager.search(); RequestDispatcher dispatcher = request.getRequestDispatcher("search.jsp"); request.setAttribute("searchResult",searchResult); dispatcher.forward(request, response); } public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException{ doPost(request, response); }}

在清单6中，doPost 方法从客户端获取搜索词并创建类 SearchManager 的一个实例，其中类 SearchManager 在搜索子系统中进行了定义。然后，SearchManager 的方法 search 会被调用。最后搜索结果被返回到客户端。

搜索子系统: 在这个子系统中，我们定义了两个类：SearchManager 和 SearchResultBean。第一个类用来实现搜索功能，第二个类是个JavaBean，用来描述搜索结果的结构。清单7给出了类 SearchManager 的源代码。

清单7：搜索功能的实现

package sample.dw.paper.lucene.search;import java.io.IOException;import java.util.ArrayList;import java.util.List;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.queryParser.ParseException;import org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.Hits;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import sample.dw.paper.lucene.index.IndexManager;/** * This class is used to search the * Lucene index and return search results /public class SearchManager { private String searchWord; private IndexManager indexManager; private Analyzer analyzer; public SearchManager(String searchWord){ this.searchWord = searchWord; this.indexManager = new IndexManager(); this.analyzer = new StandardAnalyzer(); } /* * do search */ public List search(){ List searchResult = new ArrayList(); if(false == indexManager.ifIndexExist()){ try { if(false == indexManager.createIndex()){ return searchResult; } } catch (IOException e) { e.printStackTrace(); return searchResult; } } IndexSearcher indexSearcher = null; try{ indexSearcher = new IndexSearcher(indexManager.getIndexDir()); }catch(IOException ioe){ ioe.printStackTrace(); } QueryParser queryParser = new QueryParser("content",analyzer); Query query = null; try { query = queryParser.parse(searchWord); } catch (ParseException e) { e.printStackTrace(); } if(null != query >> null != indexSearcher){ try { Hits hits = indexSearcher.search(query); for(int i = 0; i < hits.length(); i ++){ SearchResultBean resultBean = new SearchResultBean(); resultBean.setHtmlPath(hits.doc(i).get("path")); resultBean.setHtmlTitle(hits.doc(i).get("title")); searchResult.add(resultBean); } } catch (IOException e) { e.printStackTrace(); } } return searchResult; }}

在清单7中，注意到在这个类里面有三个私有属性。第一个是 searchWord，代表了来自客户端的搜索词。第二个是
indexManager，代表了在索引子系统中定义的类 IndexManager 的一个实例。第三个是 analyzer，代表了用来解析搜索词的解析器。现在我们把注意力放在方法 search 上面。这个方法首先检查索引文件是否已经存在，如果已经存在，那么就在已经存在的索引上进行检索，如果不存在，那么首先调用类 IndexManager 提供的方法来创建索引，然后在新创建的索引上进行检索。搜索结果返回后，这个方法从搜索结果中提取出需要的属性并为每个搜索结果生成类 SearchResultBean 的一个实例。最后这些
SearchResultBean 的实例被放到一个列表里面并返回给请求管理器。

在类 SearchResultBean 中，含有两个属性，分别是 htmlPath 和 htmlTitle，以及这个两个属性的 get 和 set 方法。这也意味着我们的搜索结果包含两个属性：htmlPath 和 htmlTitle，其中 htmlPath 代表了 HTML 文件的路径，htmlTitle 代表了 HTML 文件的标题。

索引子系统: 类 IndexManager 用来实现这个子系统。清单8 给出了这个类的源代码。

清单8：索引子系统的实现

package sample.dw.paper.lucene.index;import java.io.File;import java.io.IOException;import java.io.Reader;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import sample.dw.paper.lucene.util.HTMLDocParser;/** * This class is used to create an index for HTML files * /public class IndexManager { //the directory that stores HTML files private final String dataDir = "c:\dataDir"; //the directory that is used to store a Lucene index private final String indexDir = "c:\indexDir"; /* * create index / public boolean createIndex() throws IOException{ if(true == ifIndexExist()){ return true; } File dir = new File(dataDir); if(!dir.exists()){ return false; } File[] htmls = dir.listFiles(); Directory fsDirectory = FSDirectory.getDirectory(indexDir, true); Analyzer analyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(fsDirectory, analyzer, true); for(int i = 0; i < htmls.length; i++){ String htmlPath = htmls[i].getAbsolutePath(); if(htmlPath.endsWith(".html") || htmlPath.endsWith(".htm")){ addDocument(htmlPath, indexWriter); } } indexWriter.optimize(); indexWriter.close(); return true; } /* * Add one document to the Lucene index / public void addDocument(String htmlPath, IndexWriter indexWriter){ HTMLDocParser htmlParser = new HTMLDocParser(htmlPath); String path = htmlParser.getPath(); String title = htmlParser.getTitle(); Reader content = htmlParser.getContent(); Document document = new Document(); document.add(new Field("path",path,Field.Store.YES,Field.Index.NO)); document.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED)); document.add(new Field("content",content)); try { indexWriter.addDocument(document); } catch (IOException e) { e.printStackTrace(); } } /* * judge if the index exists already */ public boolean ifIndexExist(){ File directory = new File(indexDir); if(0 < directory.listFiles().length){ return true; }else{ return false; } } public String getDataDir(){ return this.dataDir; } public String getIndexDir(){ return this.indexDir; }}

这个类包含两个私有属性，分别是 dataDir 和
indexDir。dataDir 代表存放等待进行索引的 HTML 页面的路径，indexDir 代表了存放 Lucene 索引文件的路径。类 IndexManager 提供了三个方法，分别是 createIndex, addDocument 和 ifIndexExist。如果索引不存在的话，你可以使用方法 createIndex 去创建一个新的索引，用方法
addDocument 去向一个索引上添加文档。在我们的场景中，一个文档就是一个 HTML 页面。方法 addDocument 会调用由类 HTMLDocParser 提供的方法对 HTML 文档进行解析。你可以使用最后一个方法 ifIndexExist 来判断 Lucene 的索引是否已经存在。

现在我们来看一下放在包 sample.dw.paper.lucene.util 里面的类 HTMLDocParser。这个类用来从 HTML 文件中提取出文本信息。这个类包含三个方法，分别是 getContent，getTitle 和 getPath。第一个方法返回去除了 HTML 标记的文本内容，第二个方法返回 HTML 文件的标题，最后一个方法返回 HTML 文件的路径。清单9 给出了这个类的源代码。
清单9：HTML 解析器

package sample.dw.paper.lucene.util;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.io.Reader;import java.io.UnsupportedEncodingException;import org.apache.lucene.demo.html.HTMLParser;public class HTMLDocParser { private String htmlPath; private HTMLParser htmlParser; public HTMLDocParser(String htmlPath){ this.htmlPath = htmlPath; initHtmlParser(); } private void initHtmlParser(){ InputStream inputStream = null; try { inputStream = new FileInputStream(htmlPath); } catch (FileNotFoundException e) { e.printStackTrace(); } if(null != inputStream){ try { htmlParser = new HTMLParser(new InputStreamReader(inputStream, "utf-8")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } } } public String getTitle(){ if(null != htmlParser){ try { return htmlParser.getTitle(); } catch (IOException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } } return ""; } public Reader getContent(){ if(null != htmlParser){ try { return htmlParser.getReader(); } catch (IOException e) { e.printStackTrace(); } } return null; } public String getPath(){ return this.htmlPath; }}

5．在 Tomcat 5.0 上运行应用程序

现在我们可以在 Tomcat 5.0 上运行开发好的应用程序。