Spark使用Libsvm格式数据构造LabeledPoint格错误：requirement failed:Index 2287 out of bounds for vector of size 27

最新推荐文章于 2020-12-18 11:09:40 发布

原创最新推荐文章于 2020-12-18 11:09:40 发布 · 1.7k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#LabeledPoint #Spark #推荐系统开发实战 #Thinkgamer

Spark实战专栏收录该内容

18 篇文章

订阅专栏

背景

使用libsvm格式的数据构造LabeledPoint格式，例如我的libsvm格式数据如下(索引下标最大值为，3000)：

790718 1:1 2:1 4:1 5:1 6:1 7:1 9:1 11:1 13:1 16:1 19:1 21:1 28:1 31:1 43:1 64:1 65:1 140:1 164:1 184:1 296:1 463:1 481:1 642:1 813:1 1093:1 2288:1
692384 9:1 10:1 16:1 19:1 30:1 31:1 54:1 56:1 69:1 140:1 142:1 224:1 232:1 307:1 601:1 649:1 692:1 2851:1

但是在构造LabeledPoint格式数据的时候忽略的应该创建的数组长度，使用如下代码：

val dataset = rdd
    .map( l => ( l._1, l._2.split(" ").map(_.split(":")).map(e => (e(0).toInt-1, e(1).toDouble)) ) )
    .map(l => LabeledPoint(l._1.toDouble, Vectors.sparse(l._2.length, l._2.filter(_._2!=0).map(_._1), l._2.filter(_._2!=0).map(_._2))))
    .toDF("label", "features")

所以报了 java.lang.IllegalArgumentException: requirement failed: Index 2287 out of bounds for vector of size 27 错误

解决办法

将创建LabeledPoint数据的长度改为3000即可，如下：

val dataset = rdd
    .map( l => ( l._1, l._2.split(" ").map(_.split(":")).map(e => (e(0).toInt-1, e(1).toDouble)) ) )
    .map(l => LabeledPoint(l._1.toDouble, Vectors.sparse(3000, l._2.filter(_._2!=0).map(_._1), l._2.filter(_._2!=0).map(_._2))))
    .toDF("label", "features")

打印信息显示如下：

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)

+---------+--------------------+
|    label|            features|
+---------+--------------------+
| 790718.0|(3275,[0,1,3,4,5,...|
| 692384.0|(3275,[8,9,15,18,...|
| 672331.0|(3275,[0,1,2,7,8,...|
|1646601.0|   (3275,[30],[1.0])|
|1740585.0|(3275,[0,3,6,9,11...|
| 615659.0|(3275,[2,4,5,30,4...|
| 169763.0|(3275,[1,2,3,4,7,...|
| 639653.0|(3275,[1,2,4,10,1...|
|1774993.0|(3275,[6,11,13,14...|
|1680621.0|(3275,[11,16,31],...|
+---------+--------------------+
only showing top 10 rows

完美解决问题！！！希望本文能够帮助到你！

不得不说对于Spark中一些格式的数据使用还是不太熟悉！

【技术服务】，详情点击查看： https://mp.weixin.qq.com/s/PtX9ukKRBmazAWARprGIAg