java spark 二进制,使用自定义Hadoop输入格式处理Spark中的二进制文件

博客作者正在尝试将一个使用Hadoop MapReduce处理10GB二进制文件的解决方案移植到Spark。他们已经开发了自定义的InputFormat和RecordReader,但在Spark中实现时遇到困难。尝试使用Spark的newAPIHadoopFile方法失败,因为map函数内的业务逻辑无法正确读取数据。作者后来通过使用newHadoopRDD和mapPartitionsWithInputSplit方法取得了一些进展,但遇到了新的错误,即在Spark map函数内部访问HDFS文件时出现问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.

Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used

package org {

object Driver {

def myFunc(key : IntWritable, content : BytesWritable):Int = {

println(key.get())

println(content.getSize())

return 1

}

def main(args: Array[String]) {

// create a spark context

val conf = new SparkConf().setAppName("Dummy").setMaster("spark://:7077")

val sc = new SparkContext(conf)

println(sc)

val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])

val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)

println("The count is *****************************"+count)

}

}

}

Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.

Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.

解决方案

I have made some progress in this issue. I am now using the below function which does the job

var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],

classOf[IntWritable],

classOf[BytesWritable],

job.getConfiguration()

)

val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()

However, landed up with another error the details of which i have posted here

Issue in accessing HDFS file inside spark map function

15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值