lucene 相似度默认实现:计算文档长度时没有调用lengthNorm方法文档长度。
public float lengthNorm(String fieldName, int numTerms) {
return (float) (1.0 / Math.sqrt(numTerms));
}
源码剖析:
TermQuery类下的方法:
public Explanation explain(IndexReader reader, int doc) throws IOException {
...
//从索引中读取byte[]值,在创建索引的时候已经写入了
byte[] fieldNorms = reader.norms(field);
float fieldNorm =
fieldNorms!=null ? Similarity.decodeNorm(fieldNorms[doc]) : 1.0f;
...
}
//decodeNorm方法:将byte解码成float值,在搜索查询时从索引中的byte解码城float
public static float decodeNorm(byte b) {
return NORM_TABLE[b & 0xFF]; // & 0xFF maps negative bytes to positive above 127
}
//NORM_TABLE具体为:
private static final float[] NORM_TABLE = new float[256];
//此处NORM_TABLE范围固定,因此没有调用lengthNorm计算文档长度,文档长短不一样可能得分一样
static {
for (int i = 0; i < 256; i++)
NORM_TABLE[i] = SmallFloat.byte315ToFloat((byte)i);
}
//编码float值,在创建索引时byte 写入索引中
public static byte encodeNorm(float f) {
return SmallFloat.floatToByte315(f);
}
SmallFloat类下的两个方法:
//
// Some specializations of the generic functions follow.
// The generic functions are just as fast with current (1.5)
// -server JVMs, but still slower with client JVMs.
//
/** floatToByte(b, mantissaBits=3, zeroExponent=15)
* <br>smallest non-zero value = 5.820766E-10
* <br>largest value = 7.5161928E9
* <br>epsilon = 0.125
*/
public static byte floatToByte315(float f) {
int bits = Float.floatToRawIntBits(f);
int smallfloat = bits >> (24-3);
if (smallfloat < (63-15)<<3) {
return (bits<=0) ? (byte)0 : (byte)1;
}
if (smallfloat >= ((63-15)<<3) + 0x100) {
return -1;
}
return (byte)(smallfloat - ((63-15)<<3));
}
/** byteToFloat(b, mantissaBits=3, zeroExponent=15) */
public static float byte315ToFloat(byte b) {
// on Java1.5 & 1.6 JVMs, prebuilding a decoding array and doing a lookup
// is only a little bit faster (anywhere from 0% to 7%)
if (b == 0) return 0.0f;
int bits = (b&0xff) << (24-3);
bits += (63-15) << 24;
return Float.intBitsToFloat(bits);
}
因此如果按照标准计算文档长度或其他需求等要自已实现Similarity。