Accelerating Comparison by Providing RawComparator

本文介绍了如何为Hadoop中的自定义Writable类实现高效的比较器。通过对比WritableComparable接口和RawComparator接口,展示了如何避免不必要的对象反序列化,提高排序效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are fast because they don't require the byte form of the object to be unmarshalled to Object form for the comparision. When writing your own Writable, it may be tempting to implement the WritableComparable interface for it's easy to implemente this interface without knowing the layout of the custom Writable layout in memory. Unfortunately, it requres Object unmarshalling from byte form which lead to inefficiency of comparisions.

 

In this blog post, I'll show you how to implement your custom RawComparator to avoid the inefficiencies. But by comparision, I'll implement the WritableComparable interface first, then implement RawComparator with the same custom object.

 

Suppose you have a custom Writable called Person, in order to make it comparable, you implement the WritableComparable like this:

import org.apache.hadoop.io.WritableComparable;

import java.io.*;

public class Person implements WritableComparable<Person> {

    private String firstName;
    private String lastName;

    public Person() {
    }

    public String getFirstName() {
        return firstName;
    }

    public void setFirstName(String firstName) {
        this.firstName = firstName;
    }

    public String getLastName() {
        return lastName;
    }

    public void setLastName(String lastName) {
        this.lastName = lastName;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.lastName = in.readUTF();
        this.firstName = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(lastName);
        out.writeUTF(firstName);
    }

    @Override
    public int compareTo(Person other) {
        int cmp = this.lastName.compareTo(other.lastName);
        if (cmp != 0) {
            return cmp;
        }
        return this.firstName.compareTo(other.firstName);
    }

    public void set(String lastName, String firstName) {
        this.lastName = lastName;
        this.firstName = firstName;
    }
}

The trouble with this Comparator is that MapReduce store your intermediary map output data in byte form, and every time it needs to sort your data, it has to unmarshall it into Writable form to perform the comparison, this unmarshalling is expensive because it recreates your objects for comparison purposes. 

 

To write a byte-level Comparator for the Person class, we have to implement the RawComparator interface. Let's revisit the Person class and look at how to do this. In the Person class, we store the two fields, firstname and last name, as string, and used the DataOutput's writableUTF method to write them out.

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(lastName);
        out.writeUTF(firstName);
    }

 If you're going to read the javadoc of writeUTF(String str, DataOut out), you will see below statement:

 

     * First, two bytes are written to out as if by the <code>writeShort</code>

     * method giving the number of bytes to follow. This value is the number of

     * bytes actually written out, not the length of the string. Following the

     * length, each character of the string is output, in sequence, using the

     * modified UTF-8 encoding for the character. If no exception is thrown, the

     * counter <code>written</code> is incremented by the total number of 

     * bytes written to the output stream. This will be at least two 

     * plus the length of <code>str</code>, and at most two plus 

     * thrice the length of <code>str</code>.

This simply means that the writeUTF method writes two bytes containing the length of the string, followed by the byte form of the string.

 

Assume that you want to perform a lexicographical comparison that includes both the last and the first name, you can not do this with the entire byte array because the string lengths are also encoded in the array. Instead, the comparator needs to be smart enough to skip over the string lengths, as below code shown:

import org.apache.hadoop.io.WritableComparator;

public class PersonBinaryComparator extends WritableComparator {
    protected PersonBinaryComparator() {
        super(Person.class, true);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
                       int l2) {
        
        // Compare last name
        int lastNameResult = compare(b1, s1, b2, s2);

        // If last name is identical, return the result of comparison
        if (lastNameResult != 0) {
            return lastNameResult;
        }

        // Read the size of of the last name from the byte array
        int b1l1 = readUnsignedShort(b1, s1);
        int b2l1 = readUnsignedShort(b2, s2);

        // Return the comparison result on the first name
        return compare(b1, s1 + b1l1 + 2, b2, s2 + b2l1 + 2);
    }

    // Compare string in byte form
    public static int compare(byte[] b1, int s1, byte[] b2, int s2) {
        // Read the size of the UTF-8 string in byte array
        int b1l1 = readUnsignedShort(b1, s1);
        int b2l1 = readUnsignedShort(b2, s2);

        // Perform lexicographical comparison of the UTF-8 binary data
        // with the WritableComparator.compareBytes(...) method
        return compareBytes(b1, s1 + 2, b1l1, b2, s2 + 2, b2l1);
    }

    // Read two bytes
    public static int readUnsignedShort(byte[] b, int offset) {
        int ch1 = b[offset];
        int ch2 = b[offset + 1];
        return (ch1 << 8) + (ch2);
    }
}

 

Final note: Using the writableUTF is limited because it can only support string that contain less than 65525 (two bytes) characters. If you need to work with a larger string, you should look at using Hadoop's Text class, which can support much larget strings. The implementation of Text's comparator is similar to what we completed in this blog post.

 

Matlab 是一种常用的科学计算软件,在进行复杂计算和处理大量数据时可能会因为性能问题而变得缓慢。加速 Matlab 的性能可以通过以下几种方式实现: 1. 优化算法:通过对算法进行优化,减少计算量和内存占用,可以大幅提升 Matlab 的性能。例如,可以使用更高效的线性代数方法,避免重复计算,减少循环次数等。 2. 并行计算:Matlab 支持并行计算,可以利用多核处理器或集群来加速计算。可以使用 parfor 循环或 spmd 块来并行化计算过程,提高计算效率。 3. 矢量化计算:尽量避免使用循环,使用矩阵和矢量化计算来替代。Matlab 支持对整个矩阵或数组进行操作,可以大幅度减少循环次数,提高计算速度。 4. 预分配空间:在进行大型数据处理时,预先分配好足够的内存空间可以避免 Matlab 动态扩展内存的开销,提高计算速度。 5. 使用编译器:Matlab 提供了 MATLAB Compiler 工具箱,可以将 Matlab 代码编译成可执行文件或独立应用程序。编译后的代码可以减少运行时的开销,提高性能。 6. 使用 MEX 函数:MEX 是 Matlab 的外部接口,可以使用 C、C++ 或 Fortran 编写高效的算法,并与 Matlab 代码进行交互。使用 MEX 函数可以通过调用底层语言的优化算法来加速计算。 总而言之,加速 Matlab 性能的关键是通过优化算法、并行计算、矢量化计算、预分配空间、使用编译器和 MEX 函数等方法来减少计算量和提高计算效率。这些方法可以帮助用户更高效地使用 Matlab,并加速复杂计算和大数据处理的速度。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值