Lucene采用TermVector高亮显示方法出现问题

释放双眼,带上耳机,听听看~!

采用的是Lucene3.0.2的核心包和高亮显示包,程序主要代码如下:

 Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(
"<font color=/"red/">", "</font>"), new QueryScorer(
query));
highlighter.setTextFragmenter(new SimpleFragmenter(50));

TermPositionVector termFreqVector = (TermPositionVector)reader.getTermFreqVector(id, fieldName);
TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector);  

String content = hitDoc.get(fieldName);
String result = highlighter.getBestFragments(tokenStream, content, 5,"…");  

测试发现:

正确高亮的检索:
复件 索引
题名:复件 (12) 索引测试新建文档1.txt
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]

复件 (12)
索引测试新建文档1.txt

 

错误高亮的检索:
索引 文档
例1题名:索引测试新建文档1.txt
查看tokens结果:[(1,8,9), (1.txt,8,13), (文档,6,8), (新建,4,6), (测试,2,4), (索引,0,2), (txt,10,13)]

索引测试新建文档1.txt
例2题名:复件 (12) 索引测试新建文档1.txt
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12)
索引测试新建文档1.txt

跟踪debug了一下高亮显示源代码发现:

 for (boolean next = tokenStream.incrementToken(); next && (offsetAtt.startOffset()< maxDocCharsToAnalyze);
next = tokenStream.incrementToken())
{
if( (offsetAtt.endOffset()>text.length())
||
(offsetAtt.startOffset()>text.length())
)      
{
throw new InvalidTokenOffsetsException("Token "+ termAtt.term()
+" exceeds length of provided text sized "+text.length());
}

if(
(
tokenGroup.numTokens>0)&&(tokenGroup.isDistinct()))
{
//the current token is distinct from previous tokens –
// markup the cached token group info
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(endOffset, lastEndOffset);
tokenGroup.clear();

     //check if current token marks the start of a new fragment
if(textFragmenter.isNewFragment())
{
currentFrag.setScore(fragmentScorer.getFragmentScore());
//record stats for a new fragment
currentFrag.textEndPos = newText.length();
currentFrag =new TextFragment(newText, newText.length(), docFrags.size());
fragmentScorer.startFragment(currentFrag);
docFrags.add(currentFrag);
}
}

    tokenGroup.addToken(fragmentScorer.getTokenScore());

//    if(lastEndOffset>maxDocBytesToAnalyze)
//    {
//     break;
//    }

}
currentFrag.setScore(fragmentScorer.getFragmentScore());

   if(tokenGroup.numTokens>0)
{
//flush the accumulated text (same code as in above loop)
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(lastEndOffset,endOffset);
}

              * 因为高亮显示的方法里是按位置信息,当当前匹配的term小于前面最大的最后位置时才去高亮,
* 不然则在最后获取到最小匹配的term的首位置到最后匹配的term的末位置的字符串全部高亮起来了。】

分析如下:

正确高亮的检索:
复件 索引
查看tokens结果:[(
复件,0,
2), (12,4,6), (1,16,17), (1.txt,16,
21), (文档,14,16), (新建,12,14), (测试,10,12), (
索引,8,
10), (txt,18,21)]

复件 (12)
索引测试新建文档1.txt

 

查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,
21), (
文档,14,
16), (新建,12,14), (测试,10,12), (
索引,8,
10), (txt,18,21)]
复件 (12)
索引测试新建文档1.txt

最后想是修改高亮显示类的代码还是在获取tokens时按位置排序再去做高亮呢?

查看了一下API发现:

       public static TokenStream getTokenStream(TermPositionVector tpv,
boolean tokenPositionsGuaranteedContiguous)

GuaranteedContiguous:就是保证连续性的意思,英语太烂了,O(∩_∩)O哈哈~ 

 
TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector);   改为:
TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector,
true);  

就ok啦。

给TA打赏
共{{data.count}}人
人已打赏
安全运维

OpenSSH-8.7p1离线升级修复安全漏洞

2021-10-23 10:13:25

安全运维

设计模式的设计原则

2021-12-12 17:36:11

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索