自定义评分器Similarity提高搜索体验

johnnyhg

浏览: 342506 次
来自: NA

最近访客更多访客>>

u012363178

kingwood2005

风吹雨打风

imckh

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Java
搜索引擎及相关

http://www.gbsou.com/2011/11/01/8048.html

score(q,d) = coord(q,d) · queryNorm(q) ·

∑

( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) )

具体可以查看相关文章：http://blog.chenlb.com/2009/08/lucene-scoring-architecture.html

这里先考虑三个因素coord(q,d)与tf(t in d)，当查询串中，命中的词越多，coord计算的值则越大，某个词在文档中出现的次数越多则tf的值越大。还有就是norm(t,d)，这个主要是文档 boost与字段boost的影响。值越大，对整体评分的影响越重。

首先说tf对搜索结果的影响：

这里是在于本站使用的搜索评分开始是默认的评分器的情况下，但发现有些不足之处。因为站内搜索主要是视频的标题与标签。对于一个视频文档来说，标题或者与标签重复的词本身就是无意义的，比如标题为”美女美女美女美女”，标签为“美女”，如果让tf 的作用变大，明显示会使得它的评分更大，而其实并不是视频网站想要的效果。因为我们更想让它更加发散，这样，用户的点击率才会高。所以我们应该让所有命中词的文档的tf 不受频率的影响，使其tf=1.0f;如下自定义的评分器

view plain

/**
 
 * @author yuzhy
 
 * 实现自已的评分器
 
 * 文档中重复多少个词不影响分数
 
 *
 
 */
  
public
 
class
 MySolrSimilarity 
extends
 DefaultSimilarity {  

    @Override
  
    public
 
float
 tf(
float
 freq) {  

        return
 
1
.0f;  

    }  

    @Override
  
    public
 
float
 tf(
int
 freq) {  

        return
 
1
.0f;  

    }  

}

别小看这段代码，因为使用这种评分，对于一个文档来说，一个term在文档出现的频率并不影响，即是不用担心zuobi的情况，因为在这方面上他们的分数都是一样的。之前还考虑了对标题与标签的重复字符串的处理，采用后缀树结构来处理公共子串，后来发现这种方法来得更简洁。

因为使用的是solr来作搜索服务来架构，所以首先修改solr默认的Similarity类。在solr 的配置文件schemal.xml，最后中修改或增加：

设置为自定义的评分器，重启solr服务后，自定义的评分器就生效了。搜索” 美女”后，不再出现“美女美女美女美女”文档靠前排的效果了。

接着说一下coord的影响：

搜索“htc Incredible S” 三个词，由于没有这完全命中，则使用了宽松规则，即命中一个词也返回进行排序，之前的评分，前几条的结果为：

view plain

<
doc
>
  


<
str
 
name
=
“Subject”
>
S.H.E爱而为一的魔力 幕后全纪录
</
str
>
  


<
str
 
name
=
“tag”
>
she selina hebe ella 爱而为一
</
str
>
  


<
int
 
name
=
“public_time”
>
1103150000
</
int
>
  


<
int
 
name
=
“times”
>
370
</
int
>
  


<
int
 
name
=
“hd”
>
1
</
int
>
  


</
doc
>
  


−  

<
doc
>
  


<
str
 
name
=
“Subject”
>
1000种死法-S04-01.1024X576.x264
</
str
>
  


<
str
 
name
=
“tag”
>
1000种死法    
</
str
>
  


<
int
 
name
=
“public_time”
>
1103140000
</
int
>
  


<
int
 
name
=
“times”
>
692
</
int
>
  


<
int
 
name
=
“hd”
>
1
</
int
>
  


</
doc
>
  


−  

<
doc
>
  


<
str
 
name
=
“Subject”
>
p-s-1
</
str
>
  


<
str
 
name
=
“tag”
>
    
</
str
>
  


<
int
 
name
=
“public_time”
>
1103150000
</
int
>
  


<
int
 
name
=
“times”
>
58
</
int
>
  


<
int
 
name
=
“hd”
>
1
</
int
>
  


</
doc
>
  

可以看到，命中的词S 的文档给排到较前，本应该让命中越来的词的文档分数更高，但因为这三个文档在其它方面影响到评分，使得它的最后分数高于命中多个词的文档，而排到最前，所以这样的搜索体验不够好，好的体验应该是让命中的词越多排得越高，所以我首先降低计算norm(t,d)的值。测试调了其权重值，让coord占更大的比例值，效果马上出来更好的,其前三条记录为：

view plain

<
doc
>
  


<
str
 
name
=
“Subject”
>
不可思议htc Incredible 对比 apple iphone4
</
str
>
  


<
str
 
name
=
“tag”
>
Incredible htc apple iphone4 苹果
</
str
>
  


<
int
 
name
=
“public_time”
>
1009250000
</
int
>
  


<
int
 
name
=
“times”
>
29758
</
int
>
  


<
int
 
name
=
“hd”
>
0
</
int
>
  


</
doc
>
  


−  

<
doc
>
  


<
str
 
name
=
“Subject”
>
不可思议 htc Incredible 比拼 苹果 iphone 3gs
</
str
>
  


<
str
 
name
=
“tag”
>
不可思议 Incredible htc 苹果 apple
</
str
>
  


<
int
 
name
=
“public_time”
>
1009250000
</
int
>
  


<
int
 
name
=
“times”
>
20231
</
int
>
  


<
int
 
name
=
“hd”
>
0
</
int
>
  


</
doc
>
  


−  

<
doc
>
  


<
str
 
name
=
“Subject”
>
HTC incredible拆解全过程
</
str
>
  


<
str
 
name
=
“tag”
>
手机 HTC incredible DROID系列 
</
str
>
  


<
int
 
name
=
“public_time”
>
1005030000
</
int
>
  


<
int
 
name
=
“times”
>
3649
</
int
>
  


<
int
 
name
=
“hd”
>
0
</
int
>
  


</
doc
>
  

这里命中两个词htc Incredible的文档给排到最前面来，显然这才更符合用户需要的。即使没有完全命中，它的相关性会更逼近。

最后讲一下norm(t,d):

没有norms 意味着
索引阶段禁用了文档boost 和域的boost 及长度标准化。好处在于节省内存，不用在搜索阶
段为索引中的每篇文档的每个域都占用一个字节来保存norms 信息了。但是对norms 信息
的禁用是必须全部域都禁用的，一旦有一个域不禁用，则其他禁用的域也会存放默认的
norms 值。因为为了加快norms 的搜索速度，Lucene 是根据文档号乘以每篇文档的norms
信息所占用的大小来计算偏移量的，中间少一篇文档，偏移量将无法计算。也即norms 信
息要么都保存，要么都不保存。

norm(t,d) 压缩几个索引期间的加权和长度因子：

Document boost - 文档加权，在索引之前使用 doc.setBoost()
Field boost - 字段加权，也在索引之前调用 field.setBoost()
lengthNorm(field) - 由字段内的 Token 的个数来计算此值，字段越短，评分越高，在做索引的时候由 Similarity.lengthNorm 计算。

以上所有因子相乘得出 norm 值，如果文档中有相同的字段，它们的加权也会相乘：

norm(t,d) = doc.getBoost() · lengthNorm(field) ·	∏	f.getBoost()
	field f in d named as t

搜索组件为dismax，其中文档bf的计算是由三个字段

public_time (视频发布时间)^15,times（视频播放数）^15,hd（视频高清）^4

字段的bf值为

qf=Subject^1+tag^0.3

如果想让coord的值靠前，计算文档 boost 与字段boost 的值应该降低一个级别。

改为：

public_time (视频发布时间)^1.5,times（视频播放数）^1.5,hd（视频高清）^0.4

这样 norm计算的值就远远小于 coord ,使命中越多词分数越高的效果

norm(t,d) = doc.getBoost() · lengthNorm(field) ·	∏	f.getBoost()
	field f in d named as t

分享到：

Apache Solr schema.xml及solrconfig.xml文 ... | Solr的扩展(Scaling)以及性能调优

2011-11-04 20:35
浏览 2045
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论