官方文档https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started翻译
This tutorial teaches you how to run a .NET for Apache Spark app using .NET Core on Windows.
这篇文章主要介绍如果在Windows系统上基于.NET Core 框架运行.NET for Apache Spark程序。
In this tutorial, you learn how to:
在这片文章中,您将学会:
- Prepare your Windows environment for .NET for Apache Spark
- 准备.NET for Apache Spark 的Windows运行环境
- Download the Microsoft.Spark.Worker
- 下载Microsoft.Spark.Worker
- Build and run a simple .NET for Apache Spark application
- 创建、运行一个简单的.NET for Apache Spark 程序
Prepare your environment 运行环境准备
Before you begin, make sure you can run dotnet
, java
, mvn
, spark-shell
from your command line. If your environment is already prepared, you can skip to the next section. If you cannot run any or all of the commands, follow the steps below.
开始之前,请确认在您的命令行工具上可以运行这些指令:dotnet, java, mvn, spark-shell。 如果都可以,说明你的系统已经配置好,可以跳过这一节。如果不行,请按以下步骤进行安装。
-
Download and install the .NET Core 2.1x SDK. Installing the SDK adds the
dotnet
toolchain to your PATH. Use the PowerShell commanddotnet --version
to verify the installation. 下载、安装.NET Core 2.1x SDK. 安装SDK并将dotnet工具添加到系统环境变量。使用PowerShell执行 dotnet --version指令确认安装成功。 -
Install Visual Studio 2017 or Visual Studio 2019 with the latest updates. You can use Community, Professional, or Enterprise. The Community version is free. 安装最新版Visual Studio 2017 或 Visual Studio 2019。可以使用社区版、专业版或企业版。社区版免费。
Choose the following minimum requirements during installation: * .NET desktop development * All required components * .NET Framework 4.6.1 Development Tools * .NET Core cross-platform development * All required components 必须安装的项目。
-
Install Java 1.8. 安装Java1.8
- Select the appropriate version for your operating system. For example, select jdk-8u201-windows-x64.exe for a Windows x64 machine. 选择合适的Java 版本。
- Use the PowerShell command
java -version
to verify the installation. 使用Power Shell 运行 java -version 验证安装。
-
Install Apache Maven 3.6.0+. 安装Apache Maven 3.6.0+
- Download Apache Maven 3.6.0. 下载。
- Extract to a local directory. For example,
c:\bin\apache-maven-3.6.0\
. 解压到本地目录。 - Add Apache Maven to your PATH environment variable. If you extracted to
c:\bin\apache-maven-3.6.0\
, you would addc:\bin\apache-maven-3.6.0\bin
to your PATH. 将本地目录添加到系统环境变量。 - Use the PowerShell command
mvn -version
to verify the installation. 使用Power Shell 运行 mvn -version 验证安装。
-
Install Apache Spark 2.3+. Apache Spark 2.4+ isn't supported. 安装Apache Spark 2.3+。 Apache Spark 2.4+ 不支持。
- Download Apache Spark 2.3+ and extract it into a local folder using a tool like 7-zip or WinZip. For example, you might extract it to
c:\bin\spark-2.3.2-bin-hadoop2.7\
. 下载Apache Spark 2.3+, 使用7-zip或WinZip 解压到本地。(译注:下载的是tgz文件,tgz可以直接使用7-zip或WinZip解压) - Add Apache Spark to your PATH environment variable. If you extracted to
c:\bin\spark-2.3.2-bin-hadoop2.7\
, you would addc:\bin\spark-2.3.2-bin-hadoop2.7\bin
to your PATH. 将Apache Spark的本地路径添加到系统变量 - Add a new environment variable called
SPARK_HOME
. If you extracted toC:\bin\spark-2.3.2-bin-hadoop2.7\
, useC:\bin\spark-2.3.2-bin-hadoop2.7\
for the Variable value. 添加一个新的系统环境变量 SPARK_HOME。 - Verify you are able to run
spark-shell
from your command line. 验证spark-shell指令。
- Download Apache Spark 2.3+ and extract it into a local folder using a tool like 7-zip or WinZip. For example, you might extract it to
-
Set up WinUtils. 安装WinUtils
- Download the winutils.exe binary from WinUtils repository. Select the version of Hadoop the Spark distribution was compiled with. For example, you use hadoop-2.7.1 for Spark 2.3.2. The Hadoop version is annotated at the end of your Spark install folder name. 下载wintils.exe。 根据Spark版本选择Hadoop 的版本。Hadoop的版本可以根本Spark的文件夹的名称看出。(译:本机安装的是spark-2.3.3-bin-hadoop2.7,Spark的版本是2.3.3,hadoop的版本是2.7。)
- Save the winutils.exe binary to a directory of your choice. For example,
c:\hadoop\bin
. 选择wintils.exe的安装路径。 - Set
HADOOP_HOME
to reflect the directory with winutils.exe withoutbin
. For example,c:\hadoop
. 添加系统环境变量HADOOP_HOME。 - Set the PATH environment variable to include
%HADOOP_HOME%\bin
. 添加系统环境变量。
Double check that you can run dotnet
, java
, mvn
, spark-shell
from your command line before you move to the next section.再次验证安装结果。
本人电脑上的验证情况:
Download the Microsoft.Spark.Worker release 下载Microsoft.Spark.Worker
-
Download the Microsoft.Spark.Worker release from the .NET for Apache Spark GitHub Releases page to your local machine. For example, you might download it to the path,
c:\bin\Microsoft.Spark.Worker\
. 从github下载Microsoft.Spark.Worker到本地。 -
Create a new environment variable called
DotnetWorkerPath
and set it to the directory where you downloaded and extracted the Microsoft.Spark.Worker, for examplec:\bin\Microsoft.Spark.Worker
. 创建新的系统环境变量DotnetWorkerPath,并赋值为
Microsoft.Spark.Worker所在目录。
Clone the .NET for Apache Spark GitHub repo
Use the following GitBash command to clone the .NET for Apache Spark repo to your machine.
bashCopy
git clone https://github.com/dotnet/spark.git c:\github\dotnet-spark
Write a .NET for Apache Spark app 创建一个.NET for Apache Spark程序
-
Open Visual Studio and navigate to File > Create New Project > Console App (.NET Core). Name the application HelloSpark. 打开Visual Studio,创建一个.Net Core的控制台项目,命名为HelloSpark.
-
Install the Microsoft.Spark NuGet package. For more information on installing NuGet packages, see Different ways to install a NuGet Package. 安装 Microsoft.Spark的nuget包。
-
In Solution Explorer, open Program.cs and write the following C# code:打开Program.cs文件,写入以下代码。
var spark = SparkSession.Builder().GetOrCreate(); var df = spark.Read().Json("people.json"); df.Show();
-
Build the solution. 生成项目。
Run your .NET for Apache Spark app 运行.NET for Apache Spark 项目
-
Open PowerShell and change the directory to the folder where your app is stored. 打开PowerShell,将目录修改到你的项目所在目录。
PowerShellCopy
cd <your-app-output-directory>
-
Create a file called people.json with the following content: 创建文件people.json,内容如下
{"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
-
Use the following PowerShell command to run your app: 使用以下指令运行程序
PowerShellCopy
spark-submit ` --class org.apache.spark.deploy.DotnetRunner ` --master local ` microsoft-spark-2.4.x-<version>.jar ` HelloSpark
本机上安装的是spark2.3.3,所以本人的运行指令是
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
microsoft-spark-2.3.x-0.2.0.jar `
HelloSpark
运行成功。截图太大就不贴了。
Congratulations! You successfully authored and ran a .NET for Apache Spark app.