Spark 学习笔记

最新推荐文章于 2022-08-16 09:16:54 发布

原创最新推荐文章于 2022-08-16 09:16:54 发布 · 588 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Spark #Big Data

Spark 专栏收录该内容

1 篇文章

订阅专栏

本文详细介绍了Spark中Pair RDD的创建、转换和操作，包括从元组和普通RDD创建Pair RDD，使用map、filter、reduceByKey等进行操作。重点讨论了reduceByKey、groupByKey、aggregateByKey的区别和性能考量，并探讨了partitionBy和不同类型的join操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark configuration

SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> airports = sc.textFile("in/airports.text");

Creating RDD’s

RDD = parallelize([1, 2, 3, 4])
RDD = sc.textFile("file:///c:/users/frank/gobs-o-text.txt")
	- or s3n:// hdfs://
hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")

can also create from:

JDBC
Cassandra
HBase
Elastisearch
JSON, CSV, sequence filee…

Transformation and Action

##Transformations

Transformations will return a new RDD

map and filter

$\color{red}{filter}$

JavaRDD<String> cleanedLines = lines.filter(line -> !line.isEmpty())

$\color{red}{map}$

JavaRDD<String> URLs = sc.textFile("in/urls.text");
URLs.map(url -> makeHttpRequest(url));

// the return type of the map function is not necessary the same as its input type
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<Integer> lengths = lines.map(line -> line.length());

airport example

public static void main(String[] args) throws Exception {
    SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
    JavaSparkContext sc = new JavaSparkContext(conf);
	JavaRDD<String> airports = sc.textFile("in/airports.text");
	JavaRDD<String> airportsInUSA = airports.filter(line -> line.split(",")[3].equals("USA"));
	JavaRDD<String> airportsNameAndCityNames = airportsInUSA.map(line -> {
            String[] splits = line.split(",");
            return StringUtils.join(new String[]{splits[1], splits[2]}, ",");
        }
	);
	airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text");
}

$\color{red}{flatmap}$ : first map then flaten

JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
    System.out.println(entry.getKey() + " : " + entry.getValue());
}

function type

funciton: one input and one output => map and filter
function2: two inputs and one output => aggregate and reduce
flatmapfunction: one input, 0 or more outputs => flatmap

Example

lines.filter({
    public Boolean call(String line) throws Exception {
        return line.startsWith("Friday");
    }
});

lines.filter(line -> line.startWith("Friday"));

lines.filter(new StartsWithFriday());
static class StartsWithFriday implements Funciton<String, Boolean> {
    public Boolean call(String line) {
        return line.startsWith("Friday");
    }
}

set operations

$\color{red}{sample}$ : the sample operation will create a random sample form an RDD.

sample(boolean withRepleacement, double fraction)

$\color{red}{distinct}$ : returns the distinct rows from the input RDD. Expensive since it requires shuffling all the data across partitions to ensure that we receive only one copy of each element.
$\color{red}{union}$ : returns an RDD consisting of the data from both input RDDs. wil keep the duplicates.
$\color{red}{intersection}$ : removes all duplicates
$\color{red}{subtract}$
$\color{red}{cartesian Product}$ : returns all possible pairs of a and b where a is in the source RDD and b is in another RDD.

JavaRDD<String>aggreagagtedLogLines = julyFirstLogs.union(augustFirstLogs);

##Actions

$\color{red}{collect}$ : retrieves the entire RDD and returns it to the driver program in the form of a regular collection or value. String RDD to String list.

List<String> inputWords = Arrays.asList("spark", "hadoop", "hive", "pig");
JavaRDD<String> wordRDD = sc.parallelize(inputWords);
List<String> words = wordRDD.collect();
for (String word : words) {
    System.out.println(word);
}

$\color{red}{count}$

JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Integer wordCounts = words.count();

$\color{red}{countByValue}$ : will look at unique values in the each row of your RDD and return a map of each uniqye value to its count.

JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
    System.out.println(entry.getKey() + " : " + entry.getValue());
}

$\color{red}{take}$ : takes n elements from an RDD. try to reduce the number of partitions. Possible to give back biased collection.

List<String> inputWords = Arrays.asList("spark", "hadoop", "hive", "pig");
JavaRDD<String> wordRDD = sc.parallelize(inputWords);
List<String> words = wordRDD.take(3);
for (String word : words) {
    System.out.println(word);
}

$\color{red}{reduce}$ : takes a function that operates on two elements of the type in the input RDD and returns a new element of the same type. Produces the same result when repetitively applied on the same set of RDD data and reduces to a single value. With reduce operation, we can perform different types of aggregations.

List<Integer> inputIntegers = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
Integer product = integerRdd.reduce((x, y) -> x * y);

RDDs features

RDDs are Immutable: cannot be changed after they are created.

RDDs are distributed: RDD is broken into multiple pieces called partitions, and these partitions are divided across the clusters.

RDDs are a deterministic function of their input. RDDs can be recreated at any time. In case of any node in the cluster goes down, Spark can recover the parts of the RDDs from the input and pick up from where it left off. RDDs are fault tolerant.

lazy evaluation

Lazy evaluation: just building the graph

JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<String> linesWithFriday = lines.filter(line -> line.startsWith("Friday"));
// Spark scans the file only until the first line starting with Friday is detected; So it doesn't need to go through the entire file.
String firstLineWithFriday = linesWithFriday.first();

Transformations on RDDs are lazily evaluated, meaning that Spark will not begin to execute until it sees an action.
Rather than thinking of an RDD as containing specific data, it might be better to think of each RDD as consisting of instructions on how to compute the data that we build up through transformations.
Spark uses lazy evaluation to reduce the number of passes it has to take over our data by grouping operations together.

Transformations return RDDs, whereas actions return some other data type.

// Transformations
JavaRDD<string> linesWithFriday = lines.filter(line -> line.contains("Friday"));
JavaRDD<Integer> lengths = lines.map(line -> line.length());

// Actions
List<String> words = wordRdd.collect();
String firstLine = lines.first();

###Persistence

$\color{red}{persist}$

List<Integer> inputIntegers = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
integerRdd.persist(StorageLevel.MEMORY_ONLY());
integerRdd.reduce((x, y) -> x * y);
integerRdd.count();

Pair RDD

A Pair RDDs is a particular type of RDD that can store key-value pairs.

create Pair RDDs

From tuples

Tuple2<Intefer, String> tuple = new Tuple2<>(12, "value");
Integer key = tuple._1();
String value = tuple._2();
// list of tuple
public <K, V> JavaPairRDD<K, V> parallelizePairs(List<Tuple2<K, V>> list)

List<Tuple2<String, Integer>> tuple = Arrays.asList(new Tuple2<>("Lily", 23), new Tuple2<>("Jack", 29), new Tuple2<>("Mary", 26));
JavaPairRDD<String, Integer> pairRDD = sc.parallelizePairs(tuple);
pairRDD.coalesce(1).saveAsTextFile("out/pair_rdd_from_tuple_list");

$\color{red}{coalesce(numPartitions)}$ : Return a new RDD that is reduced into numPartitions partitions.
$\color{red}{collectAsMap}$ : collectAsMap will return the results for paired RDD as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed. If you use collect instead, it will return Array[(Int,Int)] without losing any of your pairs.

From regular RDDs

List<String> inputStrings = Arrays.asList("Lily 23", "Jack 29", "Mary 26");
JavaRDD<String> regularRDDs = sc.parallelize(inputStrings);
JavaPairRDD<String, Integer> pairRDD = regularRDDs.mapToPair(getNameAndAgePair());

PairFunction<String, String, Integer> getNameAndAgePair() {
    return (PairFunction<String, String, Integer>) s -> new Tuple2<>(s.split(" ")[0], Integer.valjueOf(s.split(" ")[1]));
}

transformations on Pair RDD

Pair RDDs are allowed to use all the transformations available to regular RDDs, and thus support the same functions as regular RDDs.
Since pair RDDs contain tuples, we need to pass functions that operate on tuples rather than on individual elements.

filter

$\color{red}{filter}$

JavaPairRDD<String, String> airportPairRDD = airportsRDD.mapToPair(getAirportNanmeAndCountryNamePair());

JavaPairRDD<String, String> airportsNotInUSA = airportPairRDD.filter(keyValue -> !keyValue._2().equals("\"Unite States\""));

PairFunction<String, String, String> getAirportNameAndCountryNamePair() {
    return (PairFunction<String, String, String> line -> new Tuple2<>(line.split(Utils.COMMA_DELIMITER)[1], line.split(Utils.COMMA_DELIMITER)[3]));
}

map and mapValues

$\color{red}{mapValues}$

JavaRDD<String> airportsRDD = sc.textFile("in/airports.text");
JavaPairRDD<String, String> airportPairRDD = airportsRDD.mapToPair(getAirportNameAndCountryNamePair());
JavaPairRDD<String, String> upperCase = airportPairRDD.mapValues(countryName -> countryName.toUpperCase());

PairFunction<String, String, String> getAirportNameAndCountryNamePair() {
    return (PairFunction<String, String, String> line -> new Tuple2<>(line.split(Utils.COMMA_DELIMITER)[1], line.split(Utils.COMMA_DELIMITER)[3]));
}

reduceByKey

$\color{red}{reduceByKey}$

when our dataset is described in the format of key-value pairs, it is quite common that we would like to aggregate statistics across all elements with the same key.
We have looked at the reduce actions on regular RDDs, and there is a similar operation for pair RDD, it is called reduceByKey.
reduceBykey runs several parallels reduce operations, one for each key in the dataset, where each operation combines values that have the same key.
Considering input datasets could have a huge number of keys, reduceByKey operation is not implemented as an action that returns a value to the driver program. Instead, it returns a new RDD consisting of each key and the reduced value for that key.

public static void main(String[] args) throws Exception {
    Logger.getLogger("org").setLevel(Level.ERROR);
    SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
    JavaSparkContext sc = new JavaSparkContext(conf);
    
    JavaRDD<String> lines = sc.textFile("in/word_count.text");
    JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
    
    Map<String, Long> wordCounts = words.countByValue();
}

Problem: wordCounts might be too large to be fitted into memory

Solution: reduceByKey

JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> wordRdd = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
JavaPairRDD<String, Integer> wordPairRdd = wordRdd.mapToPair(PairFunction<String, String, Integer> word -> new Tuple2<>(word, 1));
JavaPairRDD<String, Integer> wordCounts = wordPairRdd.reduceByKey((Function2<Integer, Integer, Integer>) (x, y) -> x + y);
Map<String, Integer> worldCountsMap = wordCounts.collectAsMap();

JavaRDD<String> lines = sc.textFile("in/RealEstate.csv");
JavaRDD<String> cleanedLines = lines.filter(line -> !line.contains("Bedrooms"));
JavaPairRDD<String, AvgCount> housePricePairRdd = cleanedLines.mapToPair(
line -> new Tuple2<>(line.split(",")[3], new AvgCount(1, Double.parseDouble(line.split(",")[2]))));
JavaPairRDD<String, AvgCount> housePriceTotal = housePricePairRdd.reduceByKey(
(x, y) -> new AvgCount(x.getCount() + y.getCount(), x.getTotal() + y.getTotal()));
JavaPairRDD<String, Double> housePriceAvg = housePriceTotal.mapValues(avgCount -> avgCount.getTotal()/avgCount.getCount());

groupByKey

$\color{red}{groupByKey}$

JavaPairRDD<String, String> CountryAndAirportNamePair =
    lines.mapToPair( airport -> new Tuple2<>(line.split(Utils.COMMA_DELIMITER)[3], line.split(Utils.COMMA_DELIMITER)[1]) );

JavaPairRDD<String, Iterable<String>> AirportsByCountry = CountryAndAirportNameAndPair.groupByKey();

groupByKey versus reduceByKey

List<String> words = Arrays.asList("one", "two", "two", "three", "three", "three");
JavaPairRDD<String, Integer> wordsPairRdd = sc.parallelize(words).mapToPair(word -> new Tuple2<>(word, 1));

List<Tuple2<String, Integer>> wordCountsWithReduceByKey = wordsPairRdd.reduceByKey((x, y) -> x + y).collect();

List<Tuple2<String, Integer>> wordCountsWithGroupByKey = wordsPairRdd.groupByKey().mapValues(intIterable -> Iterables.size(intIterable)).collect();

sortByKey

$\color{red}{sortByKey}$

sortByKey in reverse order: wordsPairRdd.sortByKey(true);

JavaRDD<String> lines = sc.textFile("in/RealEstate.csv");
JavaRDD<String> cleanedLines = lines.filter(line -> !line.contains("Bedrooms"));
JavaPairRDD<Integer, AvgCount> housePricePairRdd = cleanedLines.mapToPair(
line -> new Tuple2<>(Integer.valueOf(line.split(",")[3]), new AvgCount(1, Double.parseDouble(line.split(",")[2]))));

JavaPairRDD<Integer, AvgCount> housePriceTotal = gousePricePairRdd.reduceByKey(
(x, y) -> new AvgCount(x.getCount() + y.getCount(), x.getTotal() + y.getTotal()));

JavaPairRDD<Integer, Double> housePriceAvg = housePriceTotal.mapValues(avgCount -> avgCount.getTotal()/avgCount.getCount());
JavaPairRDD<Integer, Double> sortedHousePriceAvg = housePriceAvg.sortByKey();

aggregateByKey

$\color{red}{aggregateByKey}$

Important Points

Performance wise aggregateByKey is an optimized transformation
aggregateByKey is a wider transformation
We should use aggregateByKey when aggregation required plus type of input and output RDDs are different
We can use reduceByKey in case of operations with aggregateByKey having same RDD types of input and output RDDs

# Creating PairRDD student_rdd with key value pairs
student_rdd = sc.parallelize([
  ("Joseph", "Maths", 83), ("Joseph", "Physics", 74), ("Joseph", "Chemistry", 91), ("Joseph", "Biology", 82), 
  ("Jimmy", "Maths", 69), ("Jimmy", "Physics", 62), ("Jimmy", "Chemistry", 97), ("Jimmy", "Biology", 80), 
  ("Tina", "Maths", 78), ("Tina", "Physics", 73), ("Tina", "Chemistry", 68), ("Tina", "Biology", 87), 
  ("Thomas", "Maths", 87), ("Thomas", "Physics", 93), ("Thomas", "Chemistry", 91), ("Thomas", "Biology", 74), 
  ("Cory", "Maths", 56), ("Cory", "Physics", 65), ("Cory", "Chemistry", 71), ("Cory", "Biology", 68), 
  ("Jackeline", "Maths", 86), ("Jackeline", "Physics", 62), ("Jackeline", "Chemistry", 75), ("Jackeline", "Biology", 83), 
  ("Juan", "Maths", 63), ("Juan", "Physics", 69), ("Juan", "Chemistry", 64), ("Juan", "Biology", 60)], 3)
 
# Defining Seqencial Operation and Combiner Operations
# Sequence operation : Finding Maximum Marks from a single partition
def seq_op(accumulator, element):
    if(accumulator > element[1]):
        return accumulator 
    else: 
        return element[1]
 
 
# Combiner Operation : Finding Maximum Marks out Partition-Wise Accumulators
def comb_op(accumulator1, accumulator2):
    if(accumulator1 > accumulator2):
        return accumulator1 
    else:
        return accumulator2
 
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val = 0
aggr_rdd = student_rdd.map(lambda t: (t[0], (t[1], t[2]))).aggregateByKey(zero_val, seq_op, comb_op) 
 
# Check the Outout
for tpl in aggr_rdd.collect():
    print(tpl)
 
# Output
# (Tina,87)
# (Thomas,93)
# (Jackeline,83)
# (Joseph,91)
# (Juan,69)
# (Jimmy,97)
# (Cory,71)
 
#####################################################
# Let's Print Subject name along with Maximum Marks #
#####################################################
 
# Defining Seqencial Operation and Combiner Operations
def seq_op(accumulator, element):
    if(accumulator[1] > element[1]):
        return accumulator 
    else: 
        return element
 
 
# Combiner Operation : Finding Maximum Marks out Partition-Wise Accumulators
def comb_op(accumulator1, accumulator2):
    if(accumulator1[1] > accumulator2[1]):
        return accumulator1 
    else:
        return accumulator2
    
 
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val = ('', 0)
aggr_rdd = student_rdd.map(lambda t: (t[0], (t[1], t[2]))).aggregateByKey(zero_val, seq_op, comb_op) 
 
# Check the Outout
for tpl in aggr_rdd.collect():
    print(tpl)
 
# Output
# ('Thomas', ('Physics', 93))
# ('Tina', ('Biology', 87))
# ('Jimmy', ('Chemistry', 97))
# ('Juan', ('Physics', 69))
# ('Joseph', ('Chemistry', 91))
# ('Cory', ('Chemistry', 71))
# ('Jackeline', ('Maths', 86))
 
#####################################################################
# Printing over all percentage of all students using aggregateByKey #
#####################################################################
 
# Defining Seqencial Operation and Combiner Operations
def seq_op(accumulator, element):
    return (accumulator[0] + element[1], accumulator[1] + 1)
    
 
# Combiner Operation : Finding Maximum Marks out Partition-Wise Accumulators
def comb_op(accumulator1, accumulator2):
    return (accumulator1[0] + accumulator2[0], accumulator1[1] + accumulator2[1])
    
 
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val = (0, 0)
aggr_rdd = student_rdd.map(lambda t: (t[0], (t[1], t[2])))
                      .aggregateByKey(zero_val, seq_op, comb_op)
                      .map(lambda t: (t[0], t[1][0]/t[1][1]*1.0))
  
 
# Check the Outout
for tpl in aggr_rdd.collect():
    print(tpl)
 
# Output
# ('Thomas', 86.25)
# ('Tina', 76.5)
# ('Jimmy', 77.0)
# ('Juan', 64.0)
# ('Joseph', 82.5)
# ('Cory', 65.0)
# ('Jackeline', 76.5)

combineByKey

$\color{red}{combineByKey}$

Create a Combiner

lambda value: (value, 1)

The first required argument in the combineByKey method is a function to be used as the very first aggregation step for each key. The argument of this function corresponds to the value in a key-value pair. If we want to compute the sum and count using combineByKey, then we can create this “combiner” to be a tuple in the form of (sum, count). The very first step in this aggregation is then (value, 1), where value is the first RDD value that combineByKey comes across and 1 initializes the count.

Merge a Value

lambda x, value: (x[0] + value, x[1] + 1)

The next required function tells combineByKey what to do when a combiner is given a new value. The arguments to this function are a combiner and a new value. The structure of the combiner is defined above as a tuple in the form of (sum, count) so we merge the new value by adding it to the first element of the tuple while incrementing 1 to the second element of the tuple.

Merge two Combiners

lambda x, y: (x[0] + y[0], x[1] + y[1])

The final required function tells combineByKey how to merge two combiners. In this example with tuples as combiners in the form of (sum, count), all we need to do is add the first and last elements together.

data = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )

sumCount = data.combineByKey(lambda value: (value, 1),
                             lambda x, value: (x[0] + value, x[1] + 1),
                             lambda x, y: (x[0] + y[0], x[1] + y[1]))

averageByKey = sumCount.map(lambda (label, (value_sum, count)): (label, value_sum / count))

print averageByKey.collectAsMap()

combineByKey versus aggregateByKey

combineByKey is more general then aggregateByKey. Actually, the implementation of aggregateByKey, reduceByKey and groupByKey is achieved by combineByKey. aggregateByKey is similar to reduceByKey but you can provide initial values when performing aggregation.

As the name suggests, aggregateByKey is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func satisfies this rule, you probably should use aggregateByKey.

combineByKey is more general and you have the flexibility to specify whether you’d like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner, mergeValue, mergeCombiners.

partition

$\color{red}{partitionBy}$

JavaPairRDD<String, Integer> partitionedWordPairRDD = wordsPairRdd.partitionBy(new HashPartition(4));
partitionedWordPairRDD.persist(StorageLevel.DISK_ONLY());
partitionedWordPairRDD.groupByKey().mapToPair(word -> new Tuple2<>(word._1(), getSum(word._2()))).collect();

join operations

$\color{red}{join}$ $\color{red}{leftOuterjoin}$ $\color{red}{rightOuterjoin}$ $\color{red}{fullOuterjoin}$

JavaPairRDD<String, Integer> ages = sc.parallelizePairs(Arrays.asList(new Tuple2<>("Tom", 29), new Tuple2<>("John", 22)));

JavaPairRDD<String, String> addresses = sc.parallelizePairs(Arrays.asList(new Tuple2<>("James", "USA"), new Tuple2<>("John", "UK")));

JavaPairRDD<String, Tuple2<Integer, String>> join = ages.join(addresses);
JavaPairRDD<String, Tuple2<Integer, Optional<String>>> leftOuterjoin = ages.leftOuterjoin(addresses);
JavaPairRDD<String, Tuple2<Optional<Integer>, String>> rightOuterjoin = ages.rightOuterjoin(addresses);
JavaPairRDD<String, Tuple2<Optional<Integer>, Optional<String>>> fullOuterjoin = ages.fullOuterjoin(addresses);