Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System: 2016

Monday, April 25, 2016

Week 11 Concluding Post

Hello Readers,

Next week is my presentation and so this means that I have reached the end of my Senior Research Project, and, therefore, updating my blog. Although I may post a couple additional updates, this is most likely the final post. My program is finalized and works well with no issues as far as I can tell.

Thank you all for reading. If you have any questions about whatever, just comment on this post. If your question is about a specific post, reference the title of that post in your comment.

Thanks for reading and have a great day!

Tuesday, April 12, 2016

Week 10 Quick Updates

Hello Readers,

I believe that I figured out the problem in the program. After more testing with the edited algorithm from the USC paper and with a different, smaller dataset, I think that the outputted ranks align better with what should be coming up. Now, the greatest rank is around 0.003 or so, and the total of all the ranks is about 0.9. The smaller dataset only has about 10,000 nodes, just an eighth of the other dataset that I was testing with. I switched to this smaller one because all of the nodes would load into the program, while the larger dataset would be truncated. The smaller one performed well when run through the program. It looks as though all nodes were loaded and ranked. I also reduced the number of iterations. It appears that as you increase the number of iterations, the total rank decreases. I tried to do that with the larger dataset to see if increasing the number of iterations would have any effect on the outputted ranks. However, I ran into an issue. The most iterations that I had run so far was 100. So, I tried running 300 iterations. It ran for about 20 minutes, then ran into a Stack Overflow error. This is when the call stack pointer exceeds the stack bound. Essentially, the call stack consists of a limited amount of predefined space and if a program attempts to access memory beyond the call stack's bounds, it results in a program crash. This is usually due to infinite recursion (when a function calls itself so many times that the space needed to store the variables and information exceeds the stack limit). However, since my program has worked successfully with lower iteration amounts, it was not infinite recursion. More likely it was due to very deep recursion, which is just a recursive function that will, in theory terminate, but requires too much memory to execute completely. For the sake of being thorough, I will mention the third major cause of stack overflow errors: very large stack variables. This is an attempt to allocate more memory than will fit on the stack. For example, if you declare an array, but the array has some huge number for its index, the program may run into a stack overflow error. Anyway, back to stack overflow for my purposes. I tried again with 250 iterations and then again with 200 and once more 150, but calculating the page rank iteratively for 80,000 nodes 150 times requires a lot of space allocation. In any case, I think I solved my problem but if I want to try it out for the large dataset, I will need to increase the heap size of the JVM (Java Virtual Machine).

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Monday, April 11, 2016

Week 9 Updates

Hello Readers,

Last week was just more debugging. There is something wrong, I think, with the algorithm that I have implemented within the program. The problem, again, is that some of the ranks that it is outputting are greater than one, the largest around 60. The ranks are supposed to add up to one altogether and with close to 80,000 nodes, each rank should be very small. I found a research paper from USC (University of Southern California) that had a page rank implementation that is different from the algorithms that I have seen in other papers. I tried that one out and it seems to work, but I need to conduct further testing. Now, the ranks are greatly reduced, but they don't seem to add up to 1, instead they are around 0.5.

Besides that, I am just working on my presentation, as the end of the research project approaches (in early May, so I have just around three more weeks).

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Monday, April 4, 2016

Week 8 Updates

Hello Readers,

Sorry for the late post. This post is for last week.

Not too much to add from last week. I finished debugging the PageRank program, so it outputs a massive list of the ranks for each node and finds the top ten. I also added code for outputting the top 10 nodes in terms of degree (the amount of neighbors that the node has). I added this to determine whether there is a relationship between rank and number of neighbors, as in will the node with the highest rank also have the highest degree. I completed this code and it worked; however, I found another bug that I think is external. I found that previously, the file that I have been testing with should have 80,000 nodes, but Excel can only read some 1.5 million lines. This may seem like enough but it isn't because each node can have several thousand neighbors, so the entire CSV file cannot load. I tried to load the CSV file on word pad and it works after loading for several minutes, but when I run it through my program, I find it is outputting nodes with ranks of greater than one, the largest being a rank of 30, which violates the algorithm. All ranks are supposed to be less than one and sum to a total of one. I may have to end up rewriting the algorithm and using a different method for determining page rank in my code.

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Sunday, March 27, 2016

Week 7 Updates

Hello Readers,

I am working on debugging the page rank program. In addition, I am going to be testing it using information provided by an ASU website. This site provides network datasets for various websites (primarily social media). They load as CSV (comma separated values) files which are opened through programs such as Excel or can be loaded as a text file (which is the file that I will most likely be using when testing my program).

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Title Change

Hello Readers,

Based on the focus of my project, I do not think that the current title accurately describes my topic anymore. Therefore I have changed my title and description.

Here is the new title: Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System for use in Big Data Analysis

Here is the new description: As technology becomes a larger part of our everyday lives, the cloud, and therefore cloud computing, becomes essential. Large-scale data processing is an important part of cloud computing, and can be run through programs such as Spark. Spark is a shell that allows developers to write a program that utilizes Spark libraries. This allows for the driver program to be run on Spark and quickly process large amounts of information. One such program is the Page Rank algorithm, which analyzes the importance of a web page by assigning it a rank based on the outgoing and incoming links to and from other webpages. This program was made famous by Google and is used to order the listed search results by importance.

Sunday, March 20, 2016

Week 6 Updates

Hello Readers,

Sorry for the late post. This is for last week. I'll just give a very brief overview of what I did.

I wrote a naive implementation of the Page Rank algorithm in a Java program utilizing Spark's libraries. I plan to test it using a small set of sample web pages. I will run it on Amazon Web Services (AWS) EC2 servers. I made an AWS account recently and basically it allows you to use virtual servers for use in development, computation, storage, and networking. As long as you stay under 750 hours/month, it is free for the first year of use.

I also wrote a naive implementation of a simple word count program in Java that also utilizes Spark, but this was more for getting familiar with Spark in Java than it was for the actual project. The word count program is Spark's 'Hello World' program.

Finally, I worked on resolving certain issues within my Virtual Box that pertained to importing an AWS security key into my local disk in order to be able to run instances of EC2 servers from a program.

Anyway, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Tuesday, March 8, 2016

Week 5 Quick Updates

I just wanted to let you all know that I fixed Eclipse (with some help).
Here was the error that was listed when trying to open Eclipse, found in the Configuration file of the Eclipse folder (please click on the picture if you have trouble viewing the error).

This is the important part: java.lang.UnsatisfiedLinkError: Cannot load 64-bit SWT libraries on 32-bit JVM. This is basically saying that the system cannot load the Eclipse Standard Widget Toolkit (this particular one is 64-bit) on a 32-bit Java Virtual Machine. When Eclipse is downloaded, both a 32-bit JVM and a 64-bit are downloaded. A 64-bit is downwards compatible with 32-bit but not vice versa. Sometimes you intentionally use a 32-bit for developing certain programs. That is why it is necessary for Eclipse to download both 64 and 32-bit JVMs. Sometimes when installing other Java-based products, they may change your path and could result in a different Java VM being used when you next launch Eclipse. I'm pretty sure that in my case, the "other Java-based product" was Android Studio.

To fix this issue, you need to fix the target JVM in Eclipse.

To create a Windows shortcut to an installed Eclipse:

1. Navigate to eclipse.exe in Windows Explorer and use Create Shortcut on the content menu.

2. Select the shortcut and edit its Properties. In the Target: field append the command line arguments.

This is a typical command line argument: eclipse -vm c:\jdk6u22\jre\bin\javaw

The following is what I changed my target to (the highlighted portion is what I added):

E:\Programs\EclipseForAndroid\adt-bundle-windows-x86_64-20140702\eclipse\eclipse.exe -vm "C:\Program Files\Java\jdk1.7.0_05\bin"

Note the quotes around C:\ ... bin. This is because there is a space between the words Program and Files in the file name "Program Files". As this is an auto generated file, you should not just change the name of the File, but instead add quotes around the intended command line. If quotes are not added then the system throws and error that "C:\Program" was not found.

In addition, I downloaded the newest version of Eclipse, Mars. I still have Juno to use for other projects; however, I installed Eclipse Mars because I wanted to install the Scala IDE and the newest IDE (4.3.0 Release) supports only Eclipse 4.4 and 4.5 (Luna and Mars, respectively). For information on downloading Scala as a plug-in for Eclipse, please refer to the following links: Getting Started and Download. The two links provide a step-by-step explanation of how to download Scala.

I have also begun to write some basic Scala programs (any similar Java programs, I attempted to translate into Scala). So far it is going OK, but certain declarations that were easy in Java, are different, and somewhat more complicated, in Scala.

Friday, March 4, 2016

Week 4 Updates

Hi Readers!

Sorry for the super long post from last week. I doubt anyone actually read through it all and that's my fault. So, this time I will try to just review what I did this week and only give a little bit of content. Hopefully it will be a lot shorter and easier to understand this time around.

First off, I am learning the Scala Programming Language, which is rooted in Java and was created to address the shortcomings of Java. Although my final program will be written in Java, not Scala, I will still need to learn Scala in order to understand the code of Spark and the GraphX Documentation, so that I will be able to properly implement Spark. This is the Scala tutorial that I am using.

Secondly, I am still reading the book: "Mining of Massive Datasets" (see my syllabus for the citation; I will refer to the book as MMDS for short). Here is the link to the online PDF of MMDS if, for whatever reason, you want to read a college textbook on Data mining (maybe for a bit of light-513 page-reading). The big picture that I got out of MMDS this week was about MapReduce (Again, please note that some description may come directly out of the text). This is a programming system that is central to the new software stack. Implementations of MapReduce enable many of the most common calculations on large-scale data to be performed on computing clusters efficiently and in a way that is tolerant of hardware failures during the computation. Implementations of MapReduce are tolerant of hardware faults through redundant file storage and division of computations into tasks. MapReduce is composed of two parts, Map and Reduce, as the name suggests. The system the execution and coordination of the tasks that execute Map or Reduce, but it also deals with the possibility of failure in execution (if you would like more depth about any of this, please leave a comment below and I will elaborate to the best of my ability). A Map task takes an input consisting of certain elements and produces some amount of key-value pairs. The types of keys and values are each arbitrary. The "keys" themselves do not have to be unique. A Map task can produce several key-value pairs all with the same key, from even the same element. As soon as each Map task is complete, the key-value pairs are grouped by key and the values are compiled into a list. Lets say that r is the number of tasks. A Master controller picks a hash function that applies to keys and produces a bucket number from 0 to r-1. Each key that is output by a Map task is hashed and its key-value pair is put in one of r local files, each file destined for one of the Reduce tasks. The Reduce function's input is a pair consisting of a key and its associated list of values. The output is a sequence of some amount of key-value pairs. That is an extremely shortened version of what I got out of MMDS.

Thirdly, I read up on the GraphX Documentation. Basically, GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge. Some background on Graph-Parallel Computation. Graph-parallel systems restrict the type of computation that can be expressed introduces new techniques to partition and distribute graphs. In this way, graph-parallel systems can execute graph algorithms orders of magnitude faster than more general data-parallel systems.

The graph-parallel systems also have certain shortcomings that make it difficult to express many of the important stages of a typical graph-analytics pipeline. The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API.

Lastly, I need to fix Eclipse. I should have it up and running by the end of this weekend. At which point, I plan to begin practicing simple programs in Scala and revising my Java and also begin plotting out the setup of my final program.

See? This was sort of shorter than last week's post.

Anyway, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts. And if you go to Google Maps right now, Peg Man is dressed up as Link from the Legend of Zelda, at least I think that's what it is.

Have a great day!

Friday, February 26, 2016

Eclipse Issues

Hello again readers!

I also am having an issue with Eclipse, an integrated development environment (IDE) that is used for programming Java. When clicking on the application it says that it is loading then stops loading without opening the application. I'm not sure what the issue is, but I think that it may be related to my recent download of Android Studio (an IDE for programming Android applications). I feel like the Java Development Kit (JDK) was disconnected from Eclipse and redirected to Android Studio, but I am not really sure. This probably will go over mostly everyone's head, but if you have any ideas as to how to fix this, I would appreciate any comments.

Week 3 Updates

Hello Readers!

This week, I completed reading the articles that I mentioned in my post "Week 2 Updates," and I continued reading the 3 books that were mentioned in the previous post. I finished reading the following articles: "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," "Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality." I continued reading the books on Java and C++, but I focused primarily on "Mining of Massive Datasets." For citations of each text, please see the tab on the top crossbar labeled "Syllabus."

Here's a quick overview of what I learned this week. (Notes: some definitions and phrases may be directly from the article/text).

1) "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing." Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs enable efficient data reuse in a broad range of applications. They are motivated by the inefficiencies of current computing frameworks in two types of applications: iterative applications and interactive data mining tools. In both cases, keeping data in memory can improve performance by a factor of 10. Fault tolerance is the property that enables a system to continue operating properly in the event of a failure within one or more of its systems. To achieve fault-tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. Coarse-grained transformations are desirable, in this case, over fine-grained updates because a fine-grained update will update one record in a database at a time, while a coarse update will use a general functional operator to update the entire database. Although a coarse update is not as flexible as a fine update, it comes with the benefit of working around the cost of updating billions of records one at a time. With fine grained updates, you cannot recompute because the cost of saving the updates could potentially be as much as the cost of saving the data itself. When I say cost, I don't necessarily mean monetary cost. Here, cost can also mean additional time or data or server capacity. Additionally, the researcher show that RDDs are expressive enough to capture a wide class of computations. They then go on to demonstrate the efficiency of RDDs through their implementation on the Spark system, which was evaluated through a variety of user applications and benchmarks.

2) "Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality." This article had a lot of math, proofs and lemmas, that I did not quite understand, so instead, I attempted to take away the concepts that the researchers were demonstrating through the math. MapReduce/Hadoop is a simple, yet powerful, framework for processing large-scale datasets in a distributed and parallel fashion. A production MapReduce cluster may even consist of tens of thousands of machines. The stored data are typically organized on distributed file systems, which divide a large dataset into chunks, and store multiple replicas of each chunk on different machines. A data processing request under the MapReduce framework, called a job, consists of two types of tasks: map and reduce. In assigning map tasks, a critical consideration is to place map tasks on or close to machines that store the input data chunks, a problem called data locality. For each task, we call a machine a local machine for the task if the data chunk associated with the task is stored locally, and we call this task a local task on the machine; otherwise, the machine is called a remote machine for the task and correspondingly this task is called a remote task on the machine. The term locality is also used to refer to the fraction of tasks that run on local machines. Improving locality can reduce both processing time and network traffic load. But assigning all task to local machines can lead to an uneven distribution of tasks among machines. So, there is a need to strike a balance between data locality and load balancing in MapReduce.

3) "Mining of Massive Datasets." I am currently on the second chapter of this book, so I will give a brief summary of the terms and concepts covered in the first chapter.
What is Data Mining?
Data mining is the discovery of "models" for data. Statisticians view data mining as the construction of a statistical model, or an underlying distribution from which the visible data is drawn. Data Mining is not synonymous with machine learning. Although they retain some similarities, there are instances in which data mining would not be useful in machine learning. More recently, computer scientists have viewed data mining as an algorithmic problem. The model of the data is simply the answer to a complex query about it.
Summarization
One of the most interesting forms of summarization is the PageRank idea, which made Google successful. In this form of Web mining, the entire complex structure of the Web is summarized by a single number for each page. the "PageRank" of a page is the probability that a random walker on the graph would be at that page at any given time. The remarkable property this ranking has is that it reﬂects very well the “importance” of the page – the degree to which typical searchers would like that page returned as an answer to their search query. I went more in depth here as it is directly related to my final deliverable. Another form of summary is clustering. Here, data is viewed as points in a multidimensional space. Points that are “close” in this space are assigned to the same cluster.
Bonferroni's Principle
Suppose you have a certain amount of data, and you look for events of a certain type within that data. You can expect events of this type to occur, even if the data is completely random, and the number of occurrences of these events will grow as the size of the data grows. These occurrences are “bogus,” in the sense that they have no cause other than that random data will always have some number of unusual features that look signiﬁcant but aren’t. A theorem of statistics, known as the Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses to a search through the data. Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is signiﬁcantly larger than the number of real instances you hope to ﬁnd, then you must expect almost anything you ﬁnd to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for.
TF .IDF
The measure called TF.IDF lets us identify words in a collection of documents that are useful for determining the topic of each document. A word has high TF.IDF score in a document if it appears in relatively few documents, but appears in this one, and when it appears in a document it tends to appear many times.

Hash Functions

A hash function maps hash-keys of some data type to integer bucket numbers. A good hash function distributes the possible hash-key values approximately evenly among buckets. Any data type can be the domain of a hash function.

Indexes

An index is a data structure that allows us to store and retrieve data records eﬃciently, given the value in one or more of the ﬁelds of the record. Hashing is one way to build an index.

Storage on Disk

When data must be stored on disk (secondary memory), it takes very much more time to access a desired data item than if the same data were stored in main memory. When data is large, it is important that algorithms strive to keep needed data in main memory.

Power Laws

Many phenomena obey a law that can be expressed as y = cx^a for some power a, often around −2. Such phenomena include the sales of the xth most popular book, or the number of in-links to the xth most popular page.

So, that was probably more than what you might call a brief summary, but whatever.

This week I also realized that the Spark system is not just a scheduler. It is actually a parallel computing program, which contains a scheduler. I also found that Spark covers most programming languages with APIs, but to do page rank, I would need to use its GraphX portion. I am currently learning more about GraphX, mostly from this link: (https://spark.apache.org/docs/0.9.0/graphx-programming-guide.html). This section only covers Java and Scala. This means that I can use the Java programming language to write my program, and therefore do not need to learn Scala, although I may continue to learn Scala, just for knowing the language. Below is an image for the Spark stack:

(Note: If this image is too small, simply click on it to enlarge the image.)

As you can see in the image, Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. The Spark Core is powered by a Standalone Scheduler.

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Have a great rest of the day!

Saturday, February 20, 2016

Week 2 Updates

Hello Readers!

This week I did a lot of reading up on cloud computing and scheduling. I read and completed 3 articles and am almost done with 2 others. I am continuing to read 3 other books: "Mining of Massive Datasets," a book about Java, and a book about C++.

Here are the three articles that I read: "Sparrow: Distributed, Low Latency Scheduling," "Load Balancing in Public Cloud by Division of Cloud Based on the Geographical Location," and "Review of Load Balancing in Cloud Computing." (Please see the tap on the top crossbar labeled Syllabus for citations of each text).

Here's a quick overview and terminology from each complete article (Note: some definitions and phrases may be directly from the article):

1) "Sparrow: Distributed, Low Latency Scheduling." Sparrow is a scheduler. A scheduler is a software application for the allocation of jobs. Basically, a scheduler is an application that takes a job off of a queue and then decides onto which machine to place it, with the goal of achieving the shortest possible run-time. See image, below (if you have difficulty viewing this image, click on it and a larger version will appear).

The researchers demonstrated that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. With implementation on a 110-machine cluster, the researchers demonstrated that Sparrow performs within 12% of an ideal scheduler.

Terminology:

Worker Machines execute tasks.

Schedulers assign tasks to worker machines.

A job consists of m tasks that are each allocated to a worker machine. Jobs can be handled by any scheduler.

If a worker machine is assigned more tasks than it can run concurrently, then it will queue new tasks until it has enough available resources to handle the next one.

Wait time is the time from which a task is submitted to the scheduler to when the task begins executing.

Service time is the amount of time that the task spend executing on a worker machine.

Job Response Time is the amount of time from when the job is submitted to the scheduler to when the last task within the job is finished executing.

Delay is the total delay within a job due to both scheduling and queuing.

The next two articles reviewed Load Balancing within Cloud Computing

2) "Load Balancing in Public Cloud by Division of Cloud Based on the Geographical Location." Load Balancing is a method of controlling the traffic in a cloud environment. Effective load balancing results in the efficient allocation of resources to all competing jobs. The researchers describe load balancing in a public cloud by partitioning the cloud into several sub-clouds. This division is done by geographical location. A Central Controlling System (CCS) monitors each of the sub-clouds. Every sub-cloud has a balancing system which monitors and allocates resources to competing jobs and relays information back to the CCS. From this information, the CCS can choose the optimal sub-cloud.

3) "Review of Load Balancing in Cloud Computing." This paper gave an overview of various aspects of cloud computing, its evolution, and its issues, particularly issues related to load balancing. The paper focused on one issue of load balancing: the consequences of inefficient load balancing may result in the detriment of a business's performance in the cloud environment.

This was just a sample of what I learned from these articles.

I am also beginning to learn the Scala programming language, as this is the language in which the Spark system is rooted. Hopefully, my prior experience with Java Programming Language will help with the foundations of Scala.

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Wednesday, February 17, 2016

Week 2 Quick Update

Hi everybody!

A couple of days ago, I downloaded the Oracle VM VirtualBox. This is a type of virtual machine, which is essentially a program that emulates a computer and stores information on an allocated area of your disk. On this Virtual Machine, I downloaded Ubuntu, which I will use as the Operating System (OS) on my Virtual Machine (VM). Ubuntu us a Debian-based (meaning it is a Unix-like OS) Linux OS. It is an open source software platform. The Virtual Machine basically puts Windows users and Linux Users on the same page in terms of Operating System.

Within this Virtual Machine, I downloaded the Spark Scheduling System. Because Spark's native language is Scala, which is built with Java, I needed to install Oracle Java 7 in order to utilize Spark's resources. After installing Java, I also needed to install Scala.

To ensure that Spark had successfully installed, I wrote a very simple program in Scala, that would find the quote "To be, or" in the text file of Hamlet. In addition, I wrote a small program that counted word frequency and produced the top 20 occurrences.

Over the next few weeks, I hope to learn more about Virtual Machines and Scala, in addition to the Spark System.

As always, if you have any questions or comments, feel free to leave a comment below and I will try to get back to you as soon as possible.

Thanks!

Friday, February 12, 2016

Week 1 Updates

Although it is possible that it may change and evolve over the course of my research, as of now, I plan to create a program that ranks webpages according to their importance, similar to the way in which Google ranks its webpages. In my program, I will implement the Apache Spark system to process large amounts of data.

An explanation of page ranking:

Think of a series of webpages as an interconnected web; a connected, directed graph. Each webpage is a vertex on the graph and each in-link (a directed edge) is a vote. Links from important pages count more. Each link’s vote is proportional to the importance of its source page, so a ‘vote’ from an important page is worth more. A page is important if it is pointed to by other important pages.

If node j has n out-links, each link gets Rj/n value.

Where n is the number of in-links from a node. Node j's importance is the sum of the votes on its in-links.

How does Google rank Webpages?

It ranks webpages according to their importance. How to quantify the importance of a webpage? An important webpage should have a lot of important webpages point to it.

A Web as a Graph

Web as a directed graph. Nodes are webpages and edges are hyperlinks (Graph theory!).

Here's an example of a small portion of a web acting as a directed graph.

Ranking Nodes on the Graph

All web pages are not equally "important"

Feel free to leave a comment or reply. Let me know if this is confusing and I will try to clarify it to you. Thanks for reading!

Source: www.mmds.org

Senior Research Project Syllabus

Student: Farhan Khan

Project Title: Scheduling in the Spark Cloud Computing System

Location: ASU Goldwater Center

BASIS Advisor: Dr. Pearson

On-site Advisor: Dr. Lei Ying

On-site Advisor Contact Information:

Office: 436 Goldwater Center

Phone (o): 480-965-7003

Fax: 480-965-3837

Email: lei.ying.2@asu.edu

Mail: Arizona State University
PO Box 875706
Tempe, AZ 85287-5706

Mode of Daily Contact: Blog

Course Goals: Scheduling in the Spark Cloud Computing System has a few main objectives: first, I will get familiar with the Spark System and scheduling algorithm; second, I will answer the question, how can scheduling improve cloud computing systems; and third, I will attempt to implement scheduling in the Spark Cloud Computing System. In pursuit of these goals, I will firstly answer the following question. What is networking for big data? Secondly, I will familiarize myself with the programming language, syntax, and different types of computations within Data Centers. Thirdly, I will learn Spark system, understand the architecture and learn to use Spark to process big data. All of this information will be learned from books, scholarly articles, and Dr. Lei Ying and his students.

Course Texts:

Leskovec, Jure, Anand Rajaraman, Jeffrey Ullman. Mining of Massive Datasets. Palo Alto: Cambridge University Press, 2014. Print.

Ousterhout, K., Wendell, P., Zaharia, M., & Stoica, I. (2013, September 24). Sparrow: Distributed, Low Latency Scheduling. Retrieved from http://www.eecs.berkeley.edu/~keo/publications/sosp13-final17.pdf

Flanagan, David. Java_TM in a Nutshell, Second Edition. Sebastopol: O’Reilly & Associates, Inc., 1997. Print

Lippman, Stanley B. C++ Primer. 2nd ed. Reading: Addison-Wesley, 1997. Print.

Wang, Weina, Kai Zhu, Lei Ying, Jian Tan, Li Zhang. Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Trafﬁc Optimality. PDF.

Zaharia, Matei, Mosharaf Chowdury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. PDF.

Purohit, Abhijeet, Md. Abdul Waheed, Asma Parveen. “LOAD BALANCING IN PUBLIC CLOUD BY DIVISION OF CLOUD BASED ON THE GEOGRAPHICAL LOCATION.” International Journal of Research in Engineering and Technology 3.3 (2014): 316 – 320. Print.

Begum, Suriya, Dr. Prashanth. “Review of Load Balancing in Cloud Computing.” International Journal of Research in Engineering and Technology 10.1 (2013): 343 – 352. Print.

Project Product Description:

I will write code for a program that conveys my understanding of the programming language, scheduling, and load balancing. The final product will be a working program that has been thoroughly tested.

Senior Research Project Proposal

Statement of Purpose:

In this project, I want to determine how scheduling can improve cloud computing systems. Additionally, I hope to implement scheduling in the Spark Cloud Computing System.

BASIS Advisor: Patricia Pearson, Ph.D.

Arizona State University Advisor: Lei Ying, Ph.D.

Background:

I am interested in combining scheduling with cloud computing, because with technology becoming a larger part of our everyday lives more data needs to be stored. Advances in cloud computing will decrease the time it takes to process large amounts of data. I have taken courses in the Java programming language, and have researched the “Sparrow” randomized scheduling algorithm.

Significance:

The Spark system is relatively new, with a speed one hundred times greater than that of its predecessor, Hadoop MapReduce. Because it is new, Spark is advanced in software and in hardware, but there is not too much yet on scheduling. The implementation of scheduling algorithms in this program can help to increase speed and, therefore, decrease expense.

Research Methodology:

For this project, I will primarily utilize websites, scholarly research, and library research. In my internship, I will be working closely with a professor at ASU and some of the PhD students working with him. They will help to guide me in my project and explain difficult concepts. I will consult scholarly research surrounding randomized scheduling algorithms and cloud computing methods. Through these readings, I will collect relevant information and compile it. I will also consult the source code of the Spark program, as it is open source online.

Anticipated Problems:

I anticipate that I will have problems understanding the source code for Spark and scheduling/random scheduling algorithms. In addition, I may have trouble building up knowledge in math (statistics and probability is the primary form of mathematics that is utilized in this area) and the concepts of Spark and the language in which Spark is written. For issues in understanding Spark and other concepts, I will consult with Professor Ying and some of his PhD students. In order to overcome obstacles in learning Statistics, I will utilize online resources for examples and lessons. The language in which Spark is written is rooted in the Java programming language. Although I have experience with Java, I may have some trouble with some of the specific classes and keywords utilized in Spark’s source code. To resolve these issues, I will use websites that provide information on specific programming questions, such as Stack Exchange.

Bibliography:

1. Ousterhout, K., Wendell, P., Zaharia, M., & Stoica, I. (2013, September 24). Sparrow: Distributed, Low Latency Scheduling. Retrieved from http://www.eecs.berkeley.edu/~keo/publications/sosp13-final17.pdf

2. Over 800 contributors (2015). Apache Spark (Version 1.5.2) [Software]. Available from http://spark.apache.org/downloads.html

3. Leskovec, Jure, Anand Rajaraman, Jeffrey Ullman. Mining of Massive Datasets. Palo Alto: Cambridge University Press, 2014. Print.

4. Flanagan, David. Java_TM in a Nutshell, Second Edition. Sebastopol: O’Reilly & Associates, Inc., 1997. Print

5. Lippman, Stanley B. C++ Primer. 2nd ed. Reading: Addison-Wesley, 1997. Print.

6. Wang, Weina, Kai Zhu, Lei Ying, Jian Tan, Li Zhang. Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Trafﬁc Optimality. PDF.

7. Zaharia, Matei, Mosharaf Chowdury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. PDF.

8. Purohit, Abhijeet, Md. Abdul Waheed, Asma Parveen. “LOAD BALANCING IN PUBLIC CLOUD BY DIVISION OF CLOUD BASED ON THE GEOGRAPHICAL LOCATION.” International Journal of Research in Engineering and Technology 3.3 (2014): 316 – 320. Print.

9. Begum, Suriya, Dr. Prashanth. “Review of Load Balancing in Cloud Computing.” International Journal of Research in Engineering and Technology 10.1 (2013): 343 – 352. Print.

Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System

Pages