Lecture 15: Big Data: Spark

PageRank in Spark
- scala
- graph
- simulation in parallel iteratively

run example

page1 has highest rank

val lines = spark.read.textFile("in").rdd

// MapPartitionRDD[5]
// not actually executed! lazy lineage graph!

// get an array rep
lines.collect()

val links1 = lines.map{ s => val parts = s.split("\\s+"); (parts(0), parts(1)) }

val links2 = links1.distinct()

val links3 = links2.groupByKey()

// persist to disk ?
val links4 = links3.cache()

var ranks = links4.mapValues(v => 1.0)

val jj = links4.join(ranks)

val contribs = jj.values.flatMap{ case (usls, rank) => urls.map(url => (url, rank / urls.size)) }

ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)

val output = rank.collect()

output.foreach(tup => println(s"${tup._1} has rank: ${tup._2} ."))

the lineage graph

store in HDFS (like GFS)
parallel RDD
- read
- map
shuffle
- expensive
- distinct
- wide dependencies

complete

failed worker
- wide dependencies problem!

store on HDFS - fault tolerant storage
- checkpoint
- if a worker fail
- have the lineage graph so we can reconstruct

Distributed Systems Notes

Lecture 15: Big Data: Spark