python spark install

本文指导如何在本地环境中安装Apache Spark,并配置IPython Notebook与Spark进行集成,包括下载、解压Spark、设置环境变量、下载Anaconda Python、安装Anaconda及创建PySpark配置文件等步骤。

from http://jmdvinodjmd.blogspot.in/2015/08/installing-ipython-notebook-with-apache.html
1. First of all download Apache Spark from here.
Select the pre built version like spark-1.4.1-bin-hadoop2.6.tgz otherwise you may need to build the source files.

2. Now extract Spark form the downloaded zip file. And place at desired location(I placed it in C drive)

3. Create an Environment variable named ‘SPARK_HOME’ with value equal to path upto folder containing Apache Spark. (I placed Spark inside C drive so I set SPARK_HOME equal to ‘C:\spark-1.4.1-bin-hadoop2.6′).

4. Download Anaconda Python distribution from here.
We can download either 2.x or 3.x versions. I suggest you to download 2.x since the higher versions may not be supported by Apache Spark. I downloaded Windows 32-bit — Python 2.7 — Graphical Installer.

5. Install Anaconda. This would install Python 2.7, IPython and other necessary libraries for Python. Moreover this should set some environment variable for you which are required to access python.

6. Open command prompt and enter command-

ipython profile create pyspark

This should create a pyspark profile where we need to make some changes.

Now we need to make changes in two configuration files. ipython_notebook_config.py and 00-pyspark-setup.py

6. Locate ipython_notebook_config.py file at C:\Users\here_your_user_name_which_is_currently_logged_in\.ipython\profile_pyspark\ipython_notebook_config.py and add following lines of code to it-

c = get_config()
c.NotebookApp.ip = ‘*’
c.NotebookApp.open_browser = True
c.NotebookApp.port = 8880 # or whatever you want; be aware of conflicts with CDH

7. Now create file named 00-pyspark-setup.pyin C:\Users\Vinod\.ipython\profile_pyspark\startup
(Note-Here path contains ‘Vinod’ which is user name so you need to replace it with your user name for currently logged in user in windows.)
Add following contents to 00-pyspark-setup.py

import os
import sys

spark_home = os.environ.get(‘SPARK_HOME’, None)
if not spark_home:
raise ValueError(‘SPARK_HOME environment variable is not set’)
sys.path.insert(0, os.path.join(spark_home, ‘python’))
sys.path.insert(0, os.path.join(spark_home, ‘python/lib/py4j-0.8.2.1-src.zip’))
execfile(os.path.join(spark_home, ‘python/pyspark/shell.py’))

Note:-All lines should remain same except the second last line. It contains location of ‘py4j’. So inside Spark folder go to python\lib folder and check the version of your py4j library and make changes into second last line accordingly.

8. Now open the command prompt and type following command to run the IPython notebook-

ipython notebook –profile=pyspark

This should launch ipython notebook in a browser.

9. Create one notebook and type ‘sc’ command in one cell and run it. You should get

<pyspark.context.SparkContext at 0x4f4a490>
This indicates your IPython notebook and Spark are successfully integrated. Otherwise if you get ' ' after running then that means integration is unsuccessful.
NOTE: If we install IPython notebook separately instead of Anaconda +Spark then we might get following exception-
Java gateway process exited before sending the driver its port number.

转载于:https://www.cnblogs.com/fuxiaotong/p/5115704.html

### PythonSpark集成进行数据分析 #### 安装依赖库 为了能够在Python环境中使用Apache Spark,需要安装`pyspark`包。可以通过pip工具来完成这一过程。 ```bash pip install pyspark ``` 对于更复杂的开发环境配置,比如在IDE PyCharm中设置项目解释器并添加`pyspark`作为外部库也是必要的[^2]。 #### 初始化SparkSession对象 创建一个入口点用于编写分布式应用程序,在新版API推荐使用`SparkSession`替代旧版的`SQLContext`和`HiveContext`。 ```python from pyspark.sql import SparkSession spark = ( SparkSession.builder.appName("ExampleApp") .config("spark.some.config.option", "some-value") # 可选配置项 .getOrCreate() ) ``` #### 数据加载与基本操作 可以读取不同类型的文件如CSV, JSON等进入DataFrame结构以便于执行各种查询动作;也可以直接构建RDDs并通过转换函数(map/filter/groupByKey...)来进行低级别的控制。 ```python df = spark.read.csv('file_path', header=True, inferSchema=True) # 显示前几行记录 df.show(5) # 统计每列缺失值数量 null_counts = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).collect() print(null_counts) ``` 针对特定需求过滤掉不想要的数据条目,例如移除所有第二个字段等于零(`region`, `0`)这样的元组: ```python filtered_rdd = rdd.filter(lambda x: x[1] != 0) ``` #### 执行复杂计算任务 借助内置机器学习库MLlib或者GraphX组件能够方便地实施聚类、分类预测以及图算法求解等问题解决流程。不过这部分内容将在后续文章里详细介绍[^1]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值