RawComparator

最新推荐文章于 2022-08-12 17:11:43 发布

okie-dokie

最新推荐文章于 2022-08-12 17:11:43 发布

阅读量198

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop 文章标签： Hadoop Apache

本文链接：https://blog.youkuaiyun.com/quiii/article/details/83682041

hadoop 专栏收录该内容

13 篇文章

订阅专栏

本文详细解释了RawComparator如何用于比较Writable对象，包括其与WritableComparable和WritableComparator的关系，通过实验展示了如何使用RawComparator进行字节方式的比较，并深入探讨了Text类在序列化时的实现细节。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

RawComparator用于 Writable对象的比较，

例如：

Job.setSortComparatorClass(Class <? extends RowComparator>);
Job.setGroupingComparatorClass(Class <? extends RowComparator>);

能作为Key的 Writable有以下特征：

必须实现接口WritableComparable；

一般都包含一个扩展自WritableComparator 的比较器类。

而 WritableComparator类，实现了 RawComparator接口。

public interface WritableComparable<T> extends Writable, Comparable<T>;

public interface RawComparator<T> extends Comparator<T>;

public class WritableComparator implements RawComparator;

说明其中一个方法：

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

该方法以字节方式比较两个Writable对象

做个实验，

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

...
private static final Logger log = LoggerFactory.getLogger(...class);

public static void main (String[] args) {
	Text text = new Text(
		"01234567890123456789012345678901234567890123456789"
		+ "01234567890123456789012345678901234567890123456789"
		+ "01234567890123456789012345678901234567890123456789"
		+ "01234567890123456789012345678901234567890123456789"
		+ "01234567890123456789012345678901234567890123456789"
		+ "01234567890123456789012345678901234567890123456789");

	/*
	CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder()
				.onMalformedInput(CodingErrorAction.REPORT)
				.onUnmappableCharacter(CodingErrorAction.REPORT);
	CharBuffer charBuffer = CharBuffer.wrap(text.toString().toCharArray());
	ByteBuffer byteBuffer = encoder.encode(charBuffer);
	int l1 = byteBuffer.limit();

	byte[] byteArray = byteBuffer.array();
	DataOutputBuffer out = new DataOutputBuffer();
	WritableUtils.writeVInt(out, l1);
	out.write(byteArray, 0, l1);
	out.close();
	byte[] b1 = out.getData();
    */
	int l1 = text.toString().length();
	byte[] b1 = WritableUtils.toByteArray(text);

	int s1 = 0;
	int n1 = WritableUtils.decodeVIntSize(b1[s1]);

	log.info("[{}, {}]", l1, n1);

	byte[] b2 = Arrays.copyOfRange(b1, s1 + n1, l1 + n1);
	log.info(new String(b2));
}

执行结果，

[303, 3]
012345678901234567890123456789012345678901...

Text 会在序列化的时候，在字节数组的最开始，标示字符串的实际长度。上例中的注释部分

class Text:
public void write(DataOutput out) throws IOException {
	WritableUtils.writeVInt(out, length);
	out.write(bytes, 0, length);
}

RawComparator comparator = new RawComparator<Text> {

	public int compare(Text t1, Text t2) { 
		return t1.toString.compareTo(t2.toString());
	}

	public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
		int n1 = WritableUtils.decodeVIntSize(b1[s1]);
		int n2 = WritableUtils.decodeVIntSize(b2[s2]);

		// Text的比较是这么实现的 
		// WritableComparator.compareBytes(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2);

		// 其实完全可以这么干
		byte[] _b1 = Arrays.copyOfRange(b1, s1 + n1, s1 + l1);
		byte[] _b2 = Arrays.copyOfRange(b2, s2 + n2, s2 + l2);
		String t1 = new String(_b1);
		String t2 = new String(_b2);
		return compare(new Text(t1), new Text(t2));
	}

}