Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System: February 2016

Friday, February 26, 2016

Eclipse Issues

Hello again readers!

I also am having an issue with Eclipse, an integrated development environment (IDE) that is used for programming Java. When clicking on the application it says that it is loading then stops loading without opening the application. I'm not sure what the issue is, but I think that it may be related to my recent download of Android Studio (an IDE for programming Android applications). I feel like the Java Development Kit (JDK) was disconnected from Eclipse and redirected to Android Studio, but I am not really sure. This probably will go over mostly everyone's head, but if you have any ideas as to how to fix this, I would appreciate any comments.

Week 3 Updates

Hello Readers!

This week, I completed reading the articles that I mentioned in my post "Week 2 Updates," and I continued reading the 3 books that were mentioned in the previous post. I finished reading the following articles: "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," "Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality." I continued reading the books on Java and C++, but I focused primarily on "Mining of Massive Datasets." For citations of each text, please see the tab on the top crossbar labeled "Syllabus."

Here's a quick overview of what I learned this week. (Notes: some definitions and phrases may be directly from the article/text).

1) "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing." Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs enable efficient data reuse in a broad range of applications. They are motivated by the inefficiencies of current computing frameworks in two types of applications: iterative applications and interactive data mining tools. In both cases, keeping data in memory can improve performance by a factor of 10. Fault tolerance is the property that enables a system to continue operating properly in the event of a failure within one or more of its systems. To achieve fault-tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. Coarse-grained transformations are desirable, in this case, over fine-grained updates because a fine-grained update will update one record in a database at a time, while a coarse update will use a general functional operator to update the entire database. Although a coarse update is not as flexible as a fine update, it comes with the benefit of working around the cost of updating billions of records one at a time. With fine grained updates, you cannot recompute because the cost of saving the updates could potentially be as much as the cost of saving the data itself. When I say cost, I don't necessarily mean monetary cost. Here, cost can also mean additional time or data or server capacity. Additionally, the researcher show that RDDs are expressive enough to capture a wide class of computations. They then go on to demonstrate the efficiency of RDDs through their implementation on the Spark system, which was evaluated through a variety of user applications and benchmarks.

2) "Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality." This article had a lot of math, proofs and lemmas, that I did not quite understand, so instead, I attempted to take away the concepts that the researchers were demonstrating through the math. MapReduce/Hadoop is a simple, yet powerful, framework for processing large-scale datasets in a distributed and parallel fashion. A production MapReduce cluster may even consist of tens of thousands of machines. The stored data are typically organized on distributed file systems, which divide a large dataset into chunks, and store multiple replicas of each chunk on different machines. A data processing request under the MapReduce framework, called a job, consists of two types of tasks: map and reduce. In assigning map tasks, a critical consideration is to place map tasks on or close to machines that store the input data chunks, a problem called data locality. For each task, we call a machine a local machine for the task if the data chunk associated with the task is stored locally, and we call this task a local task on the machine; otherwise, the machine is called a remote machine for the task and correspondingly this task is called a remote task on the machine. The term locality is also used to refer to the fraction of tasks that run on local machines. Improving locality can reduce both processing time and network traffic load. But assigning all task to local machines can lead to an uneven distribution of tasks among machines. So, there is a need to strike a balance between data locality and load balancing in MapReduce.

3) "Mining of Massive Datasets." I am currently on the second chapter of this book, so I will give a brief summary of the terms and concepts covered in the first chapter.
What is Data Mining?
Data mining is the discovery of "models" for data. Statisticians view data mining as the construction of a statistical model, or an underlying distribution from which the visible data is drawn. Data Mining is not synonymous with machine learning. Although they retain some similarities, there are instances in which data mining would not be useful in machine learning. More recently, computer scientists have viewed data mining as an algorithmic problem. The model of the data is simply the answer to a complex query about it.
Summarization
One of the most interesting forms of summarization is the PageRank idea, which made Google successful. In this form of Web mining, the entire complex structure of the Web is summarized by a single number for each page. the "PageRank" of a page is the probability that a random walker on the graph would be at that page at any given time. The remarkable property this ranking has is that it reﬂects very well the “importance” of the page – the degree to which typical searchers would like that page returned as an answer to their search query. I went more in depth here as it is directly related to my final deliverable. Another form of summary is clustering. Here, data is viewed as points in a multidimensional space. Points that are “close” in this space are assigned to the same cluster.
Bonferroni's Principle
Suppose you have a certain amount of data, and you look for events of a certain type within that data. You can expect events of this type to occur, even if the data is completely random, and the number of occurrences of these events will grow as the size of the data grows. These occurrences are “bogus,” in the sense that they have no cause other than that random data will always have some number of unusual features that look signiﬁcant but aren’t. A theorem of statistics, known as the Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses to a search through the data. Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is signiﬁcantly larger than the number of real instances you hope to ﬁnd, then you must expect almost anything you ﬁnd to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for.
TF .IDF
The measure called TF.IDF lets us identify words in a collection of documents that are useful for determining the topic of each document. A word has high TF.IDF score in a document if it appears in relatively few documents, but appears in this one, and when it appears in a document it tends to appear many times.

Hash Functions

A hash function maps hash-keys of some data type to integer bucket numbers. A good hash function distributes the possible hash-key values approximately evenly among buckets. Any data type can be the domain of a hash function.

Indexes

An index is a data structure that allows us to store and retrieve data records eﬃciently, given the value in one or more of the ﬁelds of the record. Hashing is one way to build an index.

Storage on Disk

When data must be stored on disk (secondary memory), it takes very much more time to access a desired data item than if the same data were stored in main memory. When data is large, it is important that algorithms strive to keep needed data in main memory.

Power Laws

Many phenomena obey a law that can be expressed as y = cx^a for some power a, often around −2. Such phenomena include the sales of the xth most popular book, or the number of in-links to the xth most popular page.

So, that was probably more than what you might call a brief summary, but whatever.

This week I also realized that the Spark system is not just a scheduler. It is actually a parallel computing program, which contains a scheduler. I also found that Spark covers most programming languages with APIs, but to do page rank, I would need to use its GraphX portion. I am currently learning more about GraphX, mostly from this link: (https://spark.apache.org/docs/0.9.0/graphx-programming-guide.html). This section only covers Java and Scala. This means that I can use the Java programming language to write my program, and therefore do not need to learn Scala, although I may continue to learn Scala, just for knowing the language. Below is an image for the Spark stack:

(Note: If this image is too small, simply click on it to enlarge the image.)

As you can see in the image, Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. The Spark Core is powered by a Standalone Scheduler.

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Have a great rest of the day!

Saturday, February 20, 2016

Week 2 Updates

Hello Readers!

This week I did a lot of reading up on cloud computing and scheduling. I read and completed 3 articles and am almost done with 2 others. I am continuing to read 3 other books: "Mining of Massive Datasets," a book about Java, and a book about C++.

Here are the three articles that I read: "Sparrow: Distributed, Low Latency Scheduling," "Load Balancing in Public Cloud by Division of Cloud Based on the Geographical Location," and "Review of Load Balancing in Cloud Computing." (Please see the tap on the top crossbar labeled Syllabus for citations of each text).

Here's a quick overview and terminology from each complete article (Note: some definitions and phrases may be directly from the article):

1) "Sparrow: Distributed, Low Latency Scheduling." Sparrow is a scheduler. A scheduler is a software application for the allocation of jobs. Basically, a scheduler is an application that takes a job off of a queue and then decides onto which machine to place it, with the goal of achieving the shortest possible run-time. See image, below (if you have difficulty viewing this image, click on it and a larger version will appear).

The researchers demonstrated that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. With implementation on a 110-machine cluster, the researchers demonstrated that Sparrow performs within 12% of an ideal scheduler.

Terminology:

Worker Machines execute tasks.

Schedulers assign tasks to worker machines.

A job consists of m tasks that are each allocated to a worker machine. Jobs can be handled by any scheduler.

If a worker machine is assigned more tasks than it can run concurrently, then it will queue new tasks until it has enough available resources to handle the next one.

Wait time is the time from which a task is submitted to the scheduler to when the task begins executing.

Service time is the amount of time that the task spend executing on a worker machine.

Job Response Time is the amount of time from when the job is submitted to the scheduler to when the last task within the job is finished executing.

Delay is the total delay within a job due to both scheduling and queuing.

The next two articles reviewed Load Balancing within Cloud Computing

2) "Load Balancing in Public Cloud by Division of Cloud Based on the Geographical Location." Load Balancing is a method of controlling the traffic in a cloud environment. Effective load balancing results in the efficient allocation of resources to all competing jobs. The researchers describe load balancing in a public cloud by partitioning the cloud into several sub-clouds. This division is done by geographical location. A Central Controlling System (CCS) monitors each of the sub-clouds. Every sub-cloud has a balancing system which monitors and allocates resources to competing jobs and relays information back to the CCS. From this information, the CCS can choose the optimal sub-cloud.

3) "Review of Load Balancing in Cloud Computing." This paper gave an overview of various aspects of cloud computing, its evolution, and its issues, particularly issues related to load balancing. The paper focused on one issue of load balancing: the consequences of inefficient load balancing may result in the detriment of a business's performance in the cloud environment.

This was just a sample of what I learned from these articles.

I am also beginning to learn the Scala programming language, as this is the language in which the Spark system is rooted. Hopefully, my prior experience with Java Programming Language will help with the foundations of Scala.

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Wednesday, February 17, 2016

Week 2 Quick Update

Hi everybody!

A couple of days ago, I downloaded the Oracle VM VirtualBox. This is a type of virtual machine, which is essentially a program that emulates a computer and stores information on an allocated area of your disk. On this Virtual Machine, I downloaded Ubuntu, which I will use as the Operating System (OS) on my Virtual Machine (VM). Ubuntu us a Debian-based (meaning it is a Unix-like OS) Linux OS. It is an open source software platform. The Virtual Machine basically puts Windows users and Linux Users on the same page in terms of Operating System.

Within this Virtual Machine, I downloaded the Spark Scheduling System. Because Spark's native language is Scala, which is built with Java, I needed to install Oracle Java 7 in order to utilize Spark's resources. After installing Java, I also needed to install Scala.

To ensure that Spark had successfully installed, I wrote a very simple program in Scala, that would find the quote "To be, or" in the text file of Hamlet. In addition, I wrote a small program that counted word frequency and produced the top 20 occurrences.

Over the next few weeks, I hope to learn more about Virtual Machines and Scala, in addition to the Spark System.

As always, if you have any questions or comments, feel free to leave a comment below and I will try to get back to you as soon as possible.

Thanks!

Friday, February 12, 2016

Week 1 Updates

Although it is possible that it may change and evolve over the course of my research, as of now, I plan to create a program that ranks webpages according to their importance, similar to the way in which Google ranks its webpages. In my program, I will implement the Apache Spark system to process large amounts of data.

An explanation of page ranking:

Think of a series of webpages as an interconnected web; a connected, directed graph. Each webpage is a vertex on the graph and each in-link (a directed edge) is a vote. Links from important pages count more. Each link’s vote is proportional to the importance of its source page, so a ‘vote’ from an important page is worth more. A page is important if it is pointed to by other important pages.

If node j has n out-links, each link gets Rj/n value.

Where n is the number of in-links from a node. Node j's importance is the sum of the votes on its in-links.

How does Google rank Webpages?

It ranks webpages according to their importance. How to quantify the importance of a webpage? An important webpage should have a lot of important webpages point to it.

A Web as a Graph

Web as a directed graph. Nodes are webpages and edges are hyperlinks (Graph theory!).

Here's an example of a small portion of a web acting as a directed graph.

Ranking Nodes on the Graph

All web pages are not equally "important"

Feel free to leave a comment or reply. Let me know if this is confusing and I will try to clarify it to you. Thanks for reading!

Source: www.mmds.org

Senior Research Project Syllabus

Student: Farhan Khan

Project Title: Scheduling in the Spark Cloud Computing System

Location: ASU Goldwater Center

BASIS Advisor: Dr. Pearson

On-site Advisor: Dr. Lei Ying

On-site Advisor Contact Information:

Office: 436 Goldwater Center

Phone (o): 480-965-7003

Fax: 480-965-3837

Email: lei.ying.2@asu.edu

Mail: Arizona State University
PO Box 875706
Tempe, AZ 85287-5706

Mode of Daily Contact: Blog

Course Goals: Scheduling in the Spark Cloud Computing System has a few main objectives: first, I will get familiar with the Spark System and scheduling algorithm; second, I will answer the question, how can scheduling improve cloud computing systems; and third, I will attempt to implement scheduling in the Spark Cloud Computing System. In pursuit of these goals, I will firstly answer the following question. What is networking for big data? Secondly, I will familiarize myself with the programming language, syntax, and different types of computations within Data Centers. Thirdly, I will learn Spark system, understand the architecture and learn to use Spark to process big data. All of this information will be learned from books, scholarly articles, and Dr. Lei Ying and his students.

Course Texts:

Leskovec, Jure, Anand Rajaraman, Jeffrey Ullman. Mining of Massive Datasets. Palo Alto: Cambridge University Press, 2014. Print.

Ousterhout, K., Wendell, P., Zaharia, M., & Stoica, I. (2013, September 24). Sparrow: Distributed, Low Latency Scheduling. Retrieved from http://www.eecs.berkeley.edu/~keo/publications/sosp13-final17.pdf

Flanagan, David. Java_TM in a Nutshell, Second Edition. Sebastopol: O’Reilly & Associates, Inc., 1997. Print

Lippman, Stanley B. C++ Primer. 2nd ed. Reading: Addison-Wesley, 1997. Print.

Wang, Weina, Kai Zhu, Lei Ying, Jian Tan, Li Zhang. Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Trafﬁc Optimality. PDF.

Zaharia, Matei, Mosharaf Chowdury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. PDF.

Purohit, Abhijeet, Md. Abdul Waheed, Asma Parveen. “LOAD BALANCING IN PUBLIC CLOUD BY DIVISION OF CLOUD BASED ON THE GEOGRAPHICAL LOCATION.” International Journal of Research in Engineering and Technology 3.3 (2014): 316 – 320. Print.

Begum, Suriya, Dr. Prashanth. “Review of Load Balancing in Cloud Computing.” International Journal of Research in Engineering and Technology 10.1 (2013): 343 – 352. Print.

Project Product Description:

I will write code for a program that conveys my understanding of the programming language, scheduling, and load balancing. The final product will be a working program that has been thoroughly tested.

Senior Research Project Proposal

Statement of Purpose:

In this project, I want to determine how scheduling can improve cloud computing systems. Additionally, I hope to implement scheduling in the Spark Cloud Computing System.

BASIS Advisor: Patricia Pearson, Ph.D.

Arizona State University Advisor: Lei Ying, Ph.D.

Background:

I am interested in combining scheduling with cloud computing, because with technology becoming a larger part of our everyday lives more data needs to be stored. Advances in cloud computing will decrease the time it takes to process large amounts of data. I have taken courses in the Java programming language, and have researched the “Sparrow” randomized scheduling algorithm.

Significance:

The Spark system is relatively new, with a speed one hundred times greater than that of its predecessor, Hadoop MapReduce. Because it is new, Spark is advanced in software and in hardware, but there is not too much yet on scheduling. The implementation of scheduling algorithms in this program can help to increase speed and, therefore, decrease expense.

Research Methodology:

For this project, I will primarily utilize websites, scholarly research, and library research. In my internship, I will be working closely with a professor at ASU and some of the PhD students working with him. They will help to guide me in my project and explain difficult concepts. I will consult scholarly research surrounding randomized scheduling algorithms and cloud computing methods. Through these readings, I will collect relevant information and compile it. I will also consult the source code of the Spark program, as it is open source online.

Anticipated Problems:

I anticipate that I will have problems understanding the source code for Spark and scheduling/random scheduling algorithms. In addition, I may have trouble building up knowledge in math (statistics and probability is the primary form of mathematics that is utilized in this area) and the concepts of Spark and the language in which Spark is written. For issues in understanding Spark and other concepts, I will consult with Professor Ying and some of his PhD students. In order to overcome obstacles in learning Statistics, I will utilize online resources for examples and lessons. The language in which Spark is written is rooted in the Java programming language. Although I have experience with Java, I may have some trouble with some of the specific classes and keywords utilized in Spark’s source code. To resolve these issues, I will use websites that provide information on specific programming questions, such as Stack Exchange.

Bibliography:

1. Ousterhout, K., Wendell, P., Zaharia, M., & Stoica, I. (2013, September 24). Sparrow: Distributed, Low Latency Scheduling. Retrieved from http://www.eecs.berkeley.edu/~keo/publications/sosp13-final17.pdf

2. Over 800 contributors (2015). Apache Spark (Version 1.5.2) [Software]. Available from http://spark.apache.org/downloads.html

3. Leskovec, Jure, Anand Rajaraman, Jeffrey Ullman. Mining of Massive Datasets. Palo Alto: Cambridge University Press, 2014. Print.

4. Flanagan, David. Java_TM in a Nutshell, Second Edition. Sebastopol: O’Reilly & Associates, Inc., 1997. Print

5. Lippman, Stanley B. C++ Primer. 2nd ed. Reading: Addison-Wesley, 1997. Print.

6. Wang, Weina, Kai Zhu, Lei Ying, Jian Tan, Li Zhang. Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Trafﬁc Optimality. PDF.

7. Zaharia, Matei, Mosharaf Chowdury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. PDF.

8. Purohit, Abhijeet, Md. Abdul Waheed, Asma Parveen. “LOAD BALANCING IN PUBLIC CLOUD BY DIVISION OF CLOUD BASED ON THE GEOGRAPHICAL LOCATION.” International Journal of Research in Engineering and Technology 3.3 (2014): 316 – 320. Print.

9. Begum, Suriya, Dr. Prashanth. “Review of Load Balancing in Cloud Computing.” International Journal of Research in Engineering and Technology 10.1 (2013): 343 – 352. Print.

Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System

Pages