R formulas in Spark and un-nesting data in SparklyR: Nice and handy!

本文分享了两个实用技巧:一是利用RFormula进行特征选择,它能自动处理字符串特征的一热编码;二是解决sparklyR读取嵌套数据的问题,通过调用Spark的select和col方法实现数据扁平化。

sparkblog

Intro

In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyR interface. However, using Spark through the pySpark interface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelinesin Spark very elegant.

In using Spark I like to share two little tricks described below with you.

The RFormula feature selector

As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula 

R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+ , . and :) , but it is enough to be useful. The two figures below show a simple example.rformula1

1
2
3
4
from pyspark.ml.feature import RFormula
f1 = "Targetf ~ paidDuration + Gender "
formula = RFormula(formula = f1)
train2 = formula.fit(train).transform(train)

sparkrformula

A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.

Nested (hierarchical) data in sparklyR

Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:schemaThen to flatten the data you could use:sparkdfIn SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.nesteddataFortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.rparsedAfter registering the output object, it is visible in the Spark interface and you can view the content.unnested

Thanks for reading my two tricks. Cheers, Longhow.

转自:https://longhowlam.wordpress.com/2017/02/15/r-formulas-in-spark-and-un-nesting-data-in-sparklyr-nice-and-handy/

转载于:https://www.cnblogs.com/payton/p/6405543.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值