Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System: Week 4 Updates

Hi Readers!

Sorry for the super long post from last week. I doubt anyone actually read through it all and that's my fault. So, this time I will try to just review what I did this week and only give a little bit of content. Hopefully it will be a lot shorter and easier to understand this time around.

First off, I am learning the Scala Programming Language, which is rooted in Java and was created to address the shortcomings of Java. Although my final program will be written in Java, not Scala, I will still need to learn Scala in order to understand the code of Spark and the GraphX Documentation, so that I will be able to properly implement Spark. This is the Scala tutorial that I am using.

Secondly, I am still reading the book: "Mining of Massive Datasets" (see my syllabus for the citation; I will refer to the book as MMDS for short). Here is the link to the online PDF of MMDS if, for whatever reason, you want to read a college textbook on Data mining (maybe for a bit of light-513 page-reading). The big picture that I got out of MMDS this week was about MapReduce (Again, please note that some description may come directly out of the text). This is a programming system that is central to the new software stack. Implementations of MapReduce enable many of the most common calculations on large-scale data to be performed on computing clusters efficiently and in a way that is tolerant of hardware failures during the computation. Implementations of MapReduce are tolerant of hardware faults through redundant file storage and division of computations into tasks. MapReduce is composed of two parts, Map and Reduce, as the name suggests. The system the execution and coordination of the tasks that execute Map or Reduce, but it also deals with the possibility of failure in execution (if you would like more depth about any of this, please leave a comment below and I will elaborate to the best of my ability). A Map task takes an input consisting of certain elements and produces some amount of key-value pairs. The types of keys and values are each arbitrary. The "keys" themselves do not have to be unique. A Map task can produce several key-value pairs all with the same key, from even the same element. As soon as each Map task is complete, the key-value pairs are grouped by key and the values are compiled into a list. Lets say that r is the number of tasks. A Master controller picks a hash function that applies to keys and produces a bucket number from 0 to r-1. Each key that is output by a Map task is hashed and its key-value pair is put in one of r local files, each file destined for one of the Reduce tasks. The Reduce function's input is a pair consisting of a key and its associated list of values. The output is a sequence of some amount of key-value pairs. That is an extremely shortened version of what I got out of MMDS.

Thirdly, I read up on the GraphX Documentation. Basically, GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge. Some background on Graph-Parallel Computation. Graph-parallel systems restrict the type of computation that can be expressed introduces new techniques to partition and distribute graphs. In this way, graph-parallel systems can execute graph algorithms orders of magnitude faster than more general data-parallel systems.

The graph-parallel systems also have certain shortcomings that make it difficult to express many of the important stages of a typical graph-analytics pipeline. The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API.

Lastly, I need to fix Eclipse. I should have it up and running by the end of this weekend. At which point, I plan to begin practicing simple programs in Scala and revising my Java and also begin plotting out the setup of my final program.

See? This was sort of shorter than last week's post.

Anyway, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts. And if you go to Google Maps right now, Peg Man is dressed up as Link from the Legend of Zelda, at least I think that's what it is.

Have a great day!

Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System

Pages

Friday, March 4, 2016

Week 4 Updates

No comments:

Post a Comment

Blog Archive