计算基于搜索结果或相关性和年龄的评分

时间:2014-04-13 21:35:51

标签: math lucene

我正在使用Lucene创建一个搜索引擎,它一切顺利,但我必须实现和算法根据其相关性和年龄对结果进行评分。我有三个输入:

  • 相关性得分 - 一个例子是2.68065834
  • 文件年龄(以UNIX纪元格式 - 例如自1970年以来的秒数) - 一个例子是1380979800
  • Age scew(这是0到10之间,由用户指定,它允许他们控制文档年龄对总体得分的影响程度)

我目前正在做的基本上是:

    ageOfDocumentInHours = age / 3600; //this is to avoid any overflows
    ageModifier = ageOfDocumentInHours * ageScew + 1; // scew of 0 results in relevancy * 1 
    overallScore = relevancy * ageModifier;

我对统计数据一无所知 - 有更好的方法吗?

谢谢,

1 个答案:

答案 0 :(得分:0)

这就是我最终做的事情:

    public override float CustomScore(int doc, float subQueryScore, float valSrcScore)
    {
        float contentScore = subQueryScore;

        double start = 1262307661d; //2010

        if (_dateVsContentModifier == 0)
        {
            return base.CustomScore(doc, subQueryScore, valSrcScore);
        }

        long epoch = (long)(DateTime.Now - new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc)).TotalSeconds;
        long docSinceStartHours = (long)Math.Ceiling((valSrcScore - start) / 3600);
        long nowSinceStartHours = (long)Math.Ceiling((epoch - start) / 3600);

        float ratio = (float)docSinceStartHours / (float)nowSinceStartHours; // Get a fraction where a document that was created this hour has a value of 1
        float ageScore = (ratio * _dateVsContentModifier) + 1; // We add 1 because we dont want the bit where we square it bellow to make the value smaller

        float ageScoreAdjustedSoNewerIsBetter = 1;

        if (_newerContentModifier > 0)
        {
            // Here we square it, multiuply it and then get the square root. This serves to make newer content have an exponentially higher score than old content instead of it just being linear
            ageScoreAdjustedSoNewerIsBetter =  (float)Math.Sqrt((ageScore * ageScore) * _newerContentModifier);
        }

        return ageScoreAdjustedSoNewerIsBetter * contentScore;
    }

基本思想是年龄分数是一个分数,其中0是2010年的第一天,1是现在。然后将此十进制值乘以_dateVsContentModifier,可选择使日期相对于相关性得分提升。

年龄scroe是平方,乘以_newerContentModifier然后平方根。这会导致较新的内容得分高于旧内容。