Spark configuration
SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> airports = sc.textFile("in/airports.text");
Creating RDD’s
RDD = parallelize([1, 2, 3, 4])
RDD = sc.textFile("file:///c:/users/frank/gobs-o-text.txt")
- or s3n:// hdfs://
hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")
can also create from:
- JDBC
- Cassandra
- HBase
- Elastisearch
- JSON, CSV, sequence filee…
Transformation and Action
##Transformations
Transformations will return a new RDD
map and filter
- f i l t e r \color{red}{filter} filter
JavaRDD<String> cleanedLines = lines.filter(line -> !line.isEmpty())
- m a p \color{red}{map} map
JavaRDD<String> URLs = sc.textFile("in/urls.text");
URLs.map(url -> makeHttpRequest(url));
// the return type of the map function is not necessary the same as its input type
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<Integer> lengths = lines.map(line -> line.length());
airport example
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> airports = sc.textFile("in/airports.text");
JavaRDD<String> airportsInUSA = airports.filter(line -> line.split(",")[3].equals("USA"));
JavaRDD<String> airportsNameAndCityNames = airportsInUSA.map(line -> {
String[] splits = line.split(",");
return StringUtils.join(new String[]{splits[1], splits[2]}, ",");
}
);
airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text");
}
- f l a t m a p \color{red}{flatmap} flatmap: first map then flaten
JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
System.out.println(entry.getKey() + " : " + entry.getValue());
}
function type
funciton: one input and one output => map and filter
function2: two inputs and one output => aggregate and reduce
flatmapfunction: one input, 0 or more outputs => flatmap
Example
lines.filter({
public Boolean call(String line) throws Exception {
return line.startsWith("Friday");
}
});
lines.filter(line -> line.startWith("Friday"));
lines.filter(new StartsWithFriday());
static class StartsWithFriday implements Funciton<String, Boolean> {
public Boolean call(String line) {
return line.startsWith("Friday");
}
}
set operations
- s a m p l e \color{red}{sample} sample: the sample operation will create a random sample form an RDD.
sample(boolean withRepleacement, double fraction)
-
d i s t i n c t \color{red}{distinct} distinct: returns the distinct rows from the input RDD. Expensive since it requires shuffling all the data across partitions to ensure that we receive only one copy of each element.
-
u n i o n \color{red}{union} union: returns an RDD consisting of the data from both input RDDs. wil keep the duplicates.
-
i n t e r s e c t i o n \color{red}{intersection} intersection: removes all duplicates
-
s u b t r a c t \color{red}{subtract} subtract
-
c a r t e s i a n P r o d u c t \color{red}{cartesian Product} cartesianProduct: returns all possible pairs of a and b where a is in the source RDD and b is in another RDD.
JavaRDD<String>aggreagagtedLogLines = julyFirstLogs.union(augustFirstLogs);
##Actions
- c o l l e c t \color{red}{collect} collect: retrieves the entire RDD and returns it to the driver program in the form of a regular collection or value. String RDD to String list.
List<String> inputWords = Arrays.asList("spark", "hadoop", "hive", "pig");
JavaRDD<String> wordRDD = sc.parallelize(inputWords);
List<String> words = wordRDD.collect();
for (String word : words) {
System.out.println(word);
}
- c o u n t \color{red}{count} count
JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Integer wordCounts = words.count();
- c o u n t B y V a l u e \color{red}{countByValue} countByValue: will look at unique values in the each row of your RDD and return a map of each uniqye value to its count.
JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
System.out.println(entry.getKey() + " : " + entry.getValue());
}
- t a k e \color{red}{take} take: takes n elements from an RDD. try to reduce the number of partitions. Possible to give back biased collection.
List<String> inputWords = Arrays.asList("spark", "hadoop", "hive", "pig");
JavaRDD<String> wordRDD = sc.parallelize(inputWords);
List<String> words = wordRDD.take(3);
for (String word : words) {
System.out.println(word);
}
- r e d u c e \color{red}{reduce} reduce: takes a function that operates on two elements of the type in the input RDD and returns a new element of the same type. Produces the same result when repetitively applied on the same set of RDD data and reduces to a single value. With reduce operation, we can perform different types of aggregations.
List<Integer> inputIntegers = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
Integer product = integerRdd.reduce((x, y) -> x * y);
RDDs features
RDDs are Immutable: cannot be changed after they are created.
RDDs are distributed: RDD is broken into multiple pieces called partitions, and these partitions are divided across the clusters.
RDDs are a deterministic function of their input. RDDs can be recreated at any time. In case of any node in the cluster goes down, Spark can recover the parts of the RDDs from the input and pick up from where it left off. RDDs are fault tolerant.
lazy evaluation
Lazy evaluation: just building the graph
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<String> linesWithFriday = lines.filter(line -> line.startsWith("Friday"));
// Spark scans the file only until the first line starting with Friday is detected; So it doesn't need to go through the entire file.
String firstLineWithFriday = linesWithFriday.first();
- Transformations on RDDs are lazily evaluated, meaning that Spark will not begin to execute until it sees an action.
- Rather than thinking of an RDD as containing specific data, it might be better to think of each RDD as consisting of instructions on how to compute the data that we build up through transformations.
- Spark uses lazy evaluation to reduce the number of passes it has to take over our data by grouping operations together.
Transformations return RDDs, whereas actions return some other data type.
// Transformations
JavaRDD<string> linesWithFriday = lines.filter(line -> line.contains("Friday"));
JavaRDD<Integer> lengths = lines.map(line -> line.length());
// Actions
List<String> words = wordRdd.collect();
String firstLine = lines.first();
###Persistence
p e r s i s t \color{red}{persist} persist
List<Integer> inputIntegers = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
integerRdd.persist(StorageLevel.MEMORY_ONLY());
integerRdd.reduce((x, y) -> x * y);
integerRdd.count();
Pair RDD
A Pair RDDs is a particular type of RDD that can store key-value pairs.
create Pair RDDs
From tuples
Tuple2<Intefer, String> tuple = new Tuple2<>(12, "value");
Integer key = tuple._1();
String value = tuple._2();
// list of tuple
public <K, V> JavaPairRDD<K, V> parallelizePairs(List<Tuple2<K, V>> list)
List<Tuple2<String, Integer>> tuple = Arrays.asList(new Tuple2<>("Lily", 23), new Tuple2<>("Jack", 29), new Tuple2<>("Mary", 26));
JavaPairRDD<String, Integer> pairRDD = sc.parallelizePairs(tuple);
pairRDD.coalesce(1).saveAsTextFile("out/pair_rdd_from_tuple_list");
-
c o a l e s c e ( n u m P a r t i t i o n s ) \color{red}{coalesce(numPartitions)} coalesce(numPartitions): Return a new RDD that is reduced into numPartitions partitions.
-
c o l l e c t A s M a p \color{red}{collectAsMap} collectAsMap: collectAsMap will return the results for paired RDD as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed. If you use
collect
instead, it will returnArray[(Int,Int)]
without losing any of your pairs.
From regular RDDs
List<String> inputStrings = Arrays.asList("Lily 23", "Jack 29", "Mary 26");
JavaRDD<String> regularRDDs = sc.parallelize(inputStrings);
JavaPairRDD<String, Integer> pairRDD = regularRDDs.mapToPair(getNameAndAgePair());
PairFunction<String, String, Integer> getNameAndAgePair() {
return (PairFunction<String, String, Integer>) s -> new Tuple2<>(s.split(" ")[0], Integer.valjueOf(s.split(" ")[1]));
}
transformations on Pair RDD
- Pair RDDs are allowed to use all the transformations available to regular RDDs, and thus support the same functions as regular RDDs.
- Since pair RDDs contain tuples, we need to pass functions that operate on tuples rather than on individual elements.
filter
f i l t e r \color{red}{filter} filter
JavaPairRDD<String, String> airportPairRDD = airportsRDD.mapToPair(getAirportNanmeAndCountryNamePair());
JavaPairRDD<String, String> airportsNotInUSA = airportPairRDD.filter(keyValue -> !keyValue._2().equals("\"Unite States\""));
PairFunction<String, String, String> getAirportNameAndCountryNamePair() {
return (PairFunction<String, String, String> line -> new Tuple2<>(line.split(Utils.COMMA_DELIMITER)[1], line.split(Utils.COMMA_DELIMITER)[3]));
}
map and mapValues
m a p V a l u e s \color{red}{mapValues} mapValues
JavaRDD<String> airportsRDD = sc.textFile("in/airports.text");
JavaPairRDD<String, String> airportPairRDD = airportsRDD.mapToPair(getAirportNameAndCountryNamePair());
JavaPairRDD<String, String> upperCase = airportPairRDD.mapValues(countryName -> countryName.toUpperCase());
PairFunction<String, String, String> getAirportNameAndCountryNamePair() {
return (PairFunction<String, String, String> line -> new Tuple2<>(line.split(Utils.COMMA_DELIMITER)[1], line.split(Utils.COMMA_DELIMITER)[3]));
}
reduceByKey
r e d u c e B y K e y \color{red}{reduceByKey} reduceByKey
- when our dataset is described in the format of key-value pairs, it is quite common that we would like to aggregate statistics across all elements with the same key.
- We have looked at the reduce actions on regular RDDs, and there is a similar operation for pair RDD, it is called reduceByKey.
- reduceBykey runs several parallels reduce operations, one for each key in the dataset, where each operation combines values that have the same key.
- Considering input datasets could have a huge number of keys, reduceByKey operation is not implemented as an action that returns a value to the driver program. Instead, it returns a new RDD consisting of each key and the reduced value for that key.
public static void main(String[] args) throws Exception {
Logger.getLogger("org").setLevel(Level.ERROR);
SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
}
Problem: wordCounts might be too large to be fitted into memory
Solution: reduceByKey
JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> wordRdd = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
JavaPairRDD<String, Integer> wordPairRdd = wordRdd.mapToPair(PairFunction<String, String, Integer> word -> new Tuple2<>(word, 1));
JavaPairRDD<String, Integer> wordCounts = wordPairRdd.reduceByKey((Function2<Integer, Integer, Integer>) (x, y) -> x + y);
Map<String, Integer> worldCountsMap = wordCounts.collectAsMap();
JavaRDD<String> lines = sc.textFile("in/RealEstate.csv");
JavaRDD<String> cleanedLines = lines.filter(line -> !line.contains("Bedrooms"));
JavaPairRDD<String, AvgCount> housePricePairRdd = cleanedLines.mapToPair(
line -> new Tuple2<>(line.split(",")[3], new AvgCount(1, Double.parseDouble(line.split(",")[2]))));
JavaPairRDD<String, AvgCount> housePriceTotal = housePricePairRdd.reduceByKey(
(x, y) -> new AvgCount(x.getCount() + y.getCount(), x.getTotal() + y.getTotal()));
JavaPairRDD<String, Double> housePriceAvg = housePriceTotal.mapValues(avgCount -> avgCount.getTotal()/avgCount.getCount());
groupByKey
g r o u p B y K e y \color{red}{groupByKey} groupByKey
JavaPairRDD<String, String> CountryAndAirportNamePair =
lines.mapToPair( airport -> new Tuple2<>(line.split(Utils.COMMA_DELIMITER)[3], line.split(Utils.COMMA_DELIMITER)[1]) );
JavaPairRDD<String, Iterable<String>> AirportsByCountry = CountryAndAirportNameAndPair.groupByKey();
groupByKey versus reduceByKey
List<String> words = Arrays.asList("one", "two", "two", "three", "three", "three");
JavaPairRDD<String, Integer> wordsPairRdd = sc.parallelize(words).mapToPair(word -> new Tuple2<>(word, 1));
List<Tuple2<String, Integer>> wordCountsWithReduceByKey = wordsPairRdd.reduceByKey((x, y) -> x + y).collect();
List<Tuple2<String, Integer>> wordCountsWithGroupByKey = wordsPairRdd.groupByKey().mapValues(intIterable -> Iterables.size(intIterable)).collect();
sortByKey
s o r t B y K e y \color{red}{sortByKey} sortByKey
- sortByKey in reverse order: wordsPairRdd.sortByKey(true);
JavaRDD<String> lines = sc.textFile("in/RealEstate.csv");
JavaRDD<String> cleanedLines = lines.filter(line -> !line.contains("Bedrooms"));
JavaPairRDD<Integer, AvgCount> housePricePairRdd = cleanedLines.mapToPair(
line -> new Tuple2<>(Integer.valueOf(line.split(",")[3]), new AvgCount(1, Double.parseDouble(line.split(",")[2]))));
JavaPairRDD<Integer, AvgCount> housePriceTotal = gousePricePairRdd.reduceByKey(
(x, y) -> new AvgCount(x.getCount() + y.getCount(), x.getTotal() + y.getTotal()));
JavaPairRDD<Integer, Double> housePriceAvg = housePriceTotal.mapValues(avgCount -> avgCount.getTotal()/avgCount.getCount());
JavaPairRDD<Integer, Double> sortedHousePriceAvg = housePriceAvg.sortByKey();
aggregateByKey
a g g r e g a t e B y K e y \color{red}{aggregateByKey} aggregateByKey
Important Points
- Performance wise
aggregateByKey
is an optimized transformation aggregateByKey
is a wider transformation- We should use
aggregateByKey
when aggregation required plus type of input and output RDDs are different - We can use reduceByKey in case of operations with
aggregateByKey
having same RDD types of input and output RDDs
# Creating PairRDD student_rdd with key value pairs
student_rdd = sc.parallelize([
("Joseph", "Maths", 83), ("Joseph", "Physics", 74), ("Joseph", "Chemistry", 91), ("Joseph", "Biology", 82),
("Jimmy", "Maths", 69), ("Jimmy", "Physics", 62), ("Jimmy", "Chemistry", 97), ("Jimmy", "Biology", 80),
("Tina", "Maths", 78), ("Tina", "Physics", 73), ("Tina", "Chemistry", 68), ("Tina", "Biology", 87),
("Thomas", "Maths", 87), ("Thomas", "Physics", 93), ("Thomas", "Chemistry", 91), ("Thomas", "Biology", 74),
("Cory", "Maths", 56), ("Cory", "Physics", 65), ("Cory", "Chemistry", 71), ("Cory", "Biology", 68),
("Jackeline", "Maths", 86), ("Jackeline", "Physics", 62), ("Jackeline", "Chemistry", 75), ("Jackeline", "Biology", 83),
("Juan", "Maths", 63), ("Juan", "Physics", 69), ("Juan", "Chemistry", 64), ("Juan", "Biology", 60)], 3)
# Defining Seqencial Operation and Combiner Operations
# Sequence operation : Finding Maximum Marks from a single partition
def seq_op(accumulator, element):
if(accumulator > element[1]):
return accumulator
else:
return element[1]
# Combiner Operation : Finding Maximum Marks out Partition-Wise Accumulators
def comb_op(accumulator1, accumulator2):
if(accumulator1 > accumulator2):
return accumulator1
else:
return accumulator2
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val = 0
aggr_rdd = student_rdd.map(lambda t: (t[0], (t[1], t[2]))).aggregateByKey(zero_val, seq_op, comb_op)
# Check the Outout
for tpl in aggr_rdd.collect():
print(tpl)
# Output
# (Tina,87)
# (Thomas,93)
# (Jackeline,83)
# (Joseph,91)
# (Juan,69)
# (Jimmy,97)
# (Cory,71)
#####################################################
# Let's Print Subject name along with Maximum Marks #
#####################################################
# Defining Seqencial Operation and Combiner Operations
def seq_op(accumulator, element):
if(accumulator[1] > element[1]):
return accumulator
else:
return element
# Combiner Operation : Finding Maximum Marks out Partition-Wise Accumulators
def comb_op(accumulator1, accumulator2):
if(accumulator1[1] > accumulator2[1]):
return accumulator1
else:
return accumulator2
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val = ('', 0)
aggr_rdd = student_rdd.map(lambda t: (t[0], (t[1], t[2]))).aggregateByKey(zero_val, seq_op, comb_op)
# Check the Outout
for tpl in aggr_rdd.collect():
print(tpl)
# Output
# ('Thomas', ('Physics', 93))
# ('Tina', ('Biology', 87))
# ('Jimmy', ('Chemistry', 97))
# ('Juan', ('Physics', 69))
# ('Joseph', ('Chemistry', 91))
# ('Cory', ('Chemistry', 71))
# ('Jackeline', ('Maths', 86))
#####################################################################
# Printing over all percentage of all students using aggregateByKey #
#####################################################################
# Defining Seqencial Operation and Combiner Operations
def seq_op(accumulator, element):
return (accumulator[0] + element[1], accumulator[1] + 1)
# Combiner Operation : Finding Maximum Marks out Partition-Wise Accumulators
def comb_op(accumulator1, accumulator2):
return (accumulator1[0] + accumulator2[0], accumulator1[1] + accumulator2[1])
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val = (0, 0)
aggr_rdd = student_rdd.map(lambda t: (t[0], (t[1], t[2])))
.aggregateByKey(zero_val, seq_op, comb_op)
.map(lambda t: (t[0], t[1][0]/t[1][1]*1.0))
# Check the Outout
for tpl in aggr_rdd.collect():
print(tpl)
# Output
# ('Thomas', 86.25)
# ('Tina', 76.5)
# ('Jimmy', 77.0)
# ('Juan', 64.0)
# ('Joseph', 82.5)
# ('Cory', 65.0)
# ('Jackeline', 76.5)
combineByKey
c o m b i n e B y K e y \color{red}{combineByKey} combineByKey
Create a Combiner
lambda value: (value, 1)
The first required argument in the combineByKey
method is a function to be used as the very first aggregation step for each key. The argument of this function corresponds to the value in a key-value pair. If we want to compute the sum and count using combineByKey
, then we can create this “combiner” to be a tuple in the form of (sum, count)
. The very first step in this aggregation is then (value, 1)
, where value
is the first RDD value that combineByKey
comes across and 1
initializes the count.
Merge a Value
lambda x, value: (x[0] + value, x[1] + 1)
The next required function tells combineByKey
what to do when a combiner is given a new value. The arguments to this function are a combiner and a new value. The structure of the combiner is defined above as a tuple in the form of (sum, count)
so we merge the new value by adding it to the first element of the tuple while incrementing 1
to the second element of the tuple.
Merge two Combiners
lambda x, y: (x[0] + y[0], x[1] + y[1])
The final required function tells combineByKey
how to merge two combiners. In this example with tuples as combiners in the form of (sum, count)
, all we need to do is add the first and last elements together.
data = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )
sumCount = data.combineByKey(lambda value: (value, 1),
lambda x, value: (x[0] + value, x[1] + 1),
lambda x, y: (x[0] + y[0], x[1] + y[1]))
averageByKey = sumCount.map(lambda (label, (value_sum, count)): (label, value_sum / count))
print averageByKey.collectAsMap()
combineByKey versus aggregateByKey
combineByKey
is more general then aggregateByKey
. Actually, the implementation of aggregateByKey
, reduceByKey
and groupByKey
is achieved by combineByKey
. aggregateByKey
is similar to reduceByKey
but you can provide initial values when performing aggregation.
As the name suggests, aggregateByKey
is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func
satisfies this rule, you probably should use aggregateByKey
.
combineByKey
is more general and you have the flexibility to specify whether you’d like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner
, mergeValue
, mergeCombiners
.
partition
p a r t i t i o n B y \color{red}{partitionBy} partitionBy
JavaPairRDD<String, Integer> partitionedWordPairRDD = wordsPairRdd.partitionBy(new HashPartition(4));
partitionedWordPairRDD.persist(StorageLevel.DISK_ONLY());
partitionedWordPairRDD.groupByKey().mapToPair(word -> new Tuple2<>(word._1(), getSum(word._2()))).collect();
join operations
j o i n \color{red}{join} join l e f t O u t e r j o i n \color{red}{leftOuterjoin} leftOuterjoin r i g h t O u t e r j o i n \color{red}{rightOuterjoin} rightOuterjoin f u l l O u t e r j o i n \color{red}{fullOuterjoin} fullOuterjoin
JavaPairRDD<String, Integer> ages = sc.parallelizePairs(Arrays.asList(new Tuple2<>("Tom", 29), new Tuple2<>("John", 22)));
JavaPairRDD<String, String> addresses = sc.parallelizePairs(Arrays.asList(new Tuple2<>("James", "USA"), new Tuple2<>("John", "UK")));
JavaPairRDD<String, Tuple2<Integer, String>> join = ages.join(addresses);
JavaPairRDD<String, Tuple2<Integer, Optional<String>>> leftOuterjoin = ages.leftOuterjoin(addresses);
JavaPairRDD<String, Tuple2<Optional<Integer>, String>> rightOuterjoin = ages.rightOuterjoin(addresses);
JavaPairRDD<String, Tuple2<Optional<Integer>, Optional<String>>> fullOuterjoin = ages.fullOuterjoin(addresses);