摸石头过Java & scala & spark [0]

本文记录了在Windows环境下搭建Spark开发环境的过程,包括解决Scala版本不匹配、JDK版本不符及Spark与Scala版本冲突等问题,并介绍了如何加载libsvm格式的数据。

SparkSession spark= SparkSession.builder()
.master(“local”)
.appName(“RandomForestTest”)
.config(“spark.some.config.option”, “some-value”)
.getOrCreate();

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.internal.config.TypedConfigBuilder.checkValue(Lscala/Function1;Ljava/lang/String;)Lorg/apache/spark/internal/config/TypedConfigBuilder;
    at org.apache.spark.sql.internal.SQLConf$.<init>(SQLConf.scala:276)
    at org.apache.spark.sql.internal.SQLConf$.<clinit>(SQLConf.scala)
    at org.apache.spark.sql.internal.StaticSQLConf$.<init>(StaticSQLConf.scala:31)
    at org.apache.spark.sql.internal.StaticSQLConf$.<clinit>(StaticSQLConf.scala)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:935)

目测版本不对

想查一下本地windows环境scala的安装版本,cmd命令行scala的时候会显示版本。

C:\Users\A585043>scala
\scala\bin\scala.bat) was unexpected at this time.

参考:解决windows下scala安装\scala\bin\scala.bat)错误

当在安装完scala后,如果安装目录中有空格

我也是

C:\Program Files (x86)\scala

目录下。Anyway 卸了换了一个目录重新装了一个2.11版本

C:\Users\A585043>scala
Welcome to Scala 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161).
Type in expressions for evaluation. Or try :help.

之后开始解决版本冲突

参考关于Spark 和 scala 版本冲突的问题

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/network/util/ByteUnit : Unsupported major.minor version 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

目测JDK版本也不对
Project Facets里面Java版本改成1.8

终于遇到了下一个问题,读取libsvm文件也报错

Dataset data = spark.read().format(“libsvm”).option(“numFeatures”, “780”).load(libsvmFile);
// Dataset data = spark.read().format(“libsvm”).load(libsvmFile);

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: libsvm. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

libsvm是从1.6版本以上才有的。去查pom.xml果然是版本低了。

maven公共库查看:
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11/2.3.0

顺便在走投无路的时候读了spark的java API
http://spark.apache.org/docs/latest/api/java/

public class LibSVMDataSource
extends Object

libsvm package implements Spark SQL data source API for loading LIBSVM data as DataFrame. The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.

To use LIBSVM data source, you need to set "libsvm" as the format in DataFrameReader and optionally specify options, for example:


   // Scala
   val df = spark.read.format("libsvm")
     .option("numFeatures", "780")
     .load("data/mllib/sample_libsvm_data.txt")

   // Java
   Dataset<Row> df = spark.read().format("libsvm")
     .option("numFeatures, "780")
     .load("data/mllib/sample_libsvm_data.txt");


LIBSVM data source supports the following options: - "numFeatures": number of features. If unspecified or nonpositive, the number of features will be determined automatically at the cost of one additional pass. This is also useful when the dataset is already split into multiple files and you want to load them separately, because some features may not present in certain files, which leads to inconsistent feature dimensions. - "vectorType": feature vector type, "sparse" (default) or "dense". 

最后pom参考

<dependency>  
        <groupId>org.apache.spark</groupId>  
        <artifactId>spark-core_2.11</artifactId>  
        <version>2.3.0</version> 
    </dependency>  
     <dependency> 
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.3.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
        <version>2.3.0</version>
    </dependency>
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值