Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System: March 2016

Sunday, March 27, 2016

Week 7 Updates

Hello Readers,

I am working on debugging the page rank program. In addition, I am going to be testing it using information provided by an ASU website. This site provides network datasets for various websites (primarily social media). They load as CSV (comma separated values) files which are opened through programs such as Excel or can be loaded as a text file (which is the file that I will most likely be using when testing my program).

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Title Change

Hello Readers,

Based on the focus of my project, I do not think that the current title accurately describes my topic anymore. Therefore I have changed my title and description.

Here is the new title: Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System for use in Big Data Analysis

Here is the new description: As technology becomes a larger part of our everyday lives, the cloud, and therefore cloud computing, becomes essential. Large-scale data processing is an important part of cloud computing, and can be run through programs such as Spark. Spark is a shell that allows developers to write a program that utilizes Spark libraries. This allows for the driver program to be run on Spark and quickly process large amounts of information. One such program is the Page Rank algorithm, which analyzes the importance of a web page by assigning it a rank based on the outgoing and incoming links to and from other webpages. This program was made famous by Google and is used to order the listed search results by importance.

Sunday, March 20, 2016

Week 6 Updates

Hello Readers,

Sorry for the late post. This is for last week. I'll just give a very brief overview of what I did.

I wrote a naive implementation of the Page Rank algorithm in a Java program utilizing Spark's libraries. I plan to test it using a small set of sample web pages. I will run it on Amazon Web Services (AWS) EC2 servers. I made an AWS account recently and basically it allows you to use virtual servers for use in development, computation, storage, and networking. As long as you stay under 750 hours/month, it is free for the first year of use.

I also wrote a naive implementation of a simple word count program in Java that also utilizes Spark, but this was more for getting familiar with Spark in Java than it was for the actual project. The word count program is Spark's 'Hello World' program.

Finally, I worked on resolving certain issues within my Virtual Box that pertained to importing an AWS security key into my local disk in order to be able to run instances of EC2 servers from a program.

Anyway, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Tuesday, March 8, 2016

Week 5 Quick Updates

I just wanted to let you all know that I fixed Eclipse (with some help).
Here was the error that was listed when trying to open Eclipse, found in the Configuration file of the Eclipse folder (please click on the picture if you have trouble viewing the error).

This is the important part: java.lang.UnsatisfiedLinkError: Cannot load 64-bit SWT libraries on 32-bit JVM. This is basically saying that the system cannot load the Eclipse Standard Widget Toolkit (this particular one is 64-bit) on a 32-bit Java Virtual Machine. When Eclipse is downloaded, both a 32-bit JVM and a 64-bit are downloaded. A 64-bit is downwards compatible with 32-bit but not vice versa. Sometimes you intentionally use a 32-bit for developing certain programs. That is why it is necessary for Eclipse to download both 64 and 32-bit JVMs. Sometimes when installing other Java-based products, they may change your path and could result in a different Java VM being used when you next launch Eclipse. I'm pretty sure that in my case, the "other Java-based product" was Android Studio.

To fix this issue, you need to fix the target JVM in Eclipse.

To create a Windows shortcut to an installed Eclipse:

1. Navigate to eclipse.exe in Windows Explorer and use Create Shortcut on the content menu.

2. Select the shortcut and edit its Properties. In the Target: field append the command line arguments.

This is a typical command line argument: eclipse -vm c:\jdk6u22\jre\bin\javaw

The following is what I changed my target to (the highlighted portion is what I added):

E:\Programs\EclipseForAndroid\adt-bundle-windows-x86_64-20140702\eclipse\eclipse.exe -vm "C:\Program Files\Java\jdk1.7.0_05\bin"

Note the quotes around C:\ ... bin. This is because there is a space between the words Program and Files in the file name "Program Files". As this is an auto generated file, you should not just change the name of the File, but instead add quotes around the intended command line. If quotes are not added then the system throws and error that "C:\Program" was not found.

In addition, I downloaded the newest version of Eclipse, Mars. I still have Juno to use for other projects; however, I installed Eclipse Mars because I wanted to install the Scala IDE and the newest IDE (4.3.0 Release) supports only Eclipse 4.4 and 4.5 (Luna and Mars, respectively). For information on downloading Scala as a plug-in for Eclipse, please refer to the following links: Getting Started and Download. The two links provide a step-by-step explanation of how to download Scala.

I have also begun to write some basic Scala programs (any similar Java programs, I attempted to translate into Scala). So far it is going OK, but certain declarations that were easy in Java, are different, and somewhat more complicated, in Scala.

Friday, March 4, 2016

Week 4 Updates

Hi Readers!

Sorry for the super long post from last week. I doubt anyone actually read through it all and that's my fault. So, this time I will try to just review what I did this week and only give a little bit of content. Hopefully it will be a lot shorter and easier to understand this time around.

First off, I am learning the Scala Programming Language, which is rooted in Java and was created to address the shortcomings of Java. Although my final program will be written in Java, not Scala, I will still need to learn Scala in order to understand the code of Spark and the GraphX Documentation, so that I will be able to properly implement Spark. This is the Scala tutorial that I am using.

Secondly, I am still reading the book: "Mining of Massive Datasets" (see my syllabus for the citation; I will refer to the book as MMDS for short). Here is the link to the online PDF of MMDS if, for whatever reason, you want to read a college textbook on Data mining (maybe for a bit of light-513 page-reading). The big picture that I got out of MMDS this week was about MapReduce (Again, please note that some description may come directly out of the text). This is a programming system that is central to the new software stack. Implementations of MapReduce enable many of the most common calculations on large-scale data to be performed on computing clusters efficiently and in a way that is tolerant of hardware failures during the computation. Implementations of MapReduce are tolerant of hardware faults through redundant file storage and division of computations into tasks. MapReduce is composed of two parts, Map and Reduce, as the name suggests. The system the execution and coordination of the tasks that execute Map or Reduce, but it also deals with the possibility of failure in execution (if you would like more depth about any of this, please leave a comment below and I will elaborate to the best of my ability). A Map task takes an input consisting of certain elements and produces some amount of key-value pairs. The types of keys and values are each arbitrary. The "keys" themselves do not have to be unique. A Map task can produce several key-value pairs all with the same key, from even the same element. As soon as each Map task is complete, the key-value pairs are grouped by key and the values are compiled into a list. Lets say that r is the number of tasks. A Master controller picks a hash function that applies to keys and produces a bucket number from 0 to r-1. Each key that is output by a Map task is hashed and its key-value pair is put in one of r local files, each file destined for one of the Reduce tasks. The Reduce function's input is a pair consisting of a key and its associated list of values. The output is a sequence of some amount of key-value pairs. That is an extremely shortened version of what I got out of MMDS.

Thirdly, I read up on the GraphX Documentation. Basically, GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge. Some background on Graph-Parallel Computation. Graph-parallel systems restrict the type of computation that can be expressed introduces new techniques to partition and distribute graphs. In this way, graph-parallel systems can execute graph algorithms orders of magnitude faster than more general data-parallel systems.

The graph-parallel systems also have certain shortcomings that make it difficult to express many of the important stages of a typical graph-analytics pipeline. The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API.

Lastly, I need to fix Eclipse. I should have it up and running by the end of this weekend. At which point, I plan to begin practicing simple programs in Scala and revising my Java and also begin plotting out the setup of my final program.

See? This was sort of shorter than last week's post.

Anyway, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts. And if you go to Google Maps right now, Peg Man is dressed up as Link from the Legend of Zelda, at least I think that's what it is.

Have a great day!

Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System

Pages