Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System: Week 3 Updates

Hello Readers!

This week, I completed reading the articles that I mentioned in my post "Week 2 Updates," and I continued reading the 3 books that were mentioned in the previous post. I finished reading the following articles: "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," "Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality." I continued reading the books on Java and C++, but I focused primarily on "Mining of Massive Datasets." For citations of each text, please see the tab on the top crossbar labeled "Syllabus."

Here's a quick overview of what I learned this week. (Notes: some definitions and phrases may be directly from the article/text).

1) "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing." Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs enable efficient data reuse in a broad range of applications. They are motivated by the inefficiencies of current computing frameworks in two types of applications: iterative applications and interactive data mining tools. In both cases, keeping data in memory can improve performance by a factor of 10. Fault tolerance is the property that enables a system to continue operating properly in the event of a failure within one or more of its systems. To achieve fault-tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. Coarse-grained transformations are desirable, in this case, over fine-grained updates because a fine-grained update will update one record in a database at a time, while a coarse update will use a general functional operator to update the entire database. Although a coarse update is not as flexible as a fine update, it comes with the benefit of working around the cost of updating billions of records one at a time. With fine grained updates, you cannot recompute because the cost of saving the updates could potentially be as much as the cost of saving the data itself. When I say cost, I don't necessarily mean monetary cost. Here, cost can also mean additional time or data or server capacity. Additionally, the researcher show that RDDs are expressive enough to capture a wide class of computations. They then go on to demonstrate the efficiency of RDDs through their implementation on the Spark system, which was evaluated through a variety of user applications and benchmarks.

2) "Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality." This article had a lot of math, proofs and lemmas, that I did not quite understand, so instead, I attempted to take away the concepts that the researchers were demonstrating through the math. MapReduce/Hadoop is a simple, yet powerful, framework for processing large-scale datasets in a distributed and parallel fashion. A production MapReduce cluster may even consist of tens of thousands of machines. The stored data are typically organized on distributed file systems, which divide a large dataset into chunks, and store multiple replicas of each chunk on different machines. A data processing request under the MapReduce framework, called a job, consists of two types of tasks: map and reduce. In assigning map tasks, a critical consideration is to place map tasks on or close to machines that store the input data chunks, a problem called data locality. For each task, we call a machine a local machine for the task if the data chunk associated with the task is stored locally, and we call this task a local task on the machine; otherwise, the machine is called a remote machine for the task and correspondingly this task is called a remote task on the machine. The term locality is also used to refer to the fraction of tasks that run on local machines. Improving locality can reduce both processing time and network traffic load. But assigning all task to local machines can lead to an uneven distribution of tasks among machines. So, there is a need to strike a balance between data locality and load balancing in MapReduce.

3) "Mining of Massive Datasets." I am currently on the second chapter of this book, so I will give a brief summary of the terms and concepts covered in the first chapter.
What is Data Mining?
Data mining is the discovery of "models" for data. Statisticians view data mining as the construction of a statistical model, or an underlying distribution from which the visible data is drawn. Data Mining is not synonymous with machine learning. Although they retain some similarities, there are instances in which data mining would not be useful in machine learning. More recently, computer scientists have viewed data mining as an algorithmic problem. The model of the data is simply the answer to a complex query about it.
Summarization
One of the most interesting forms of summarization is the PageRank idea, which made Google successful. In this form of Web mining, the entire complex structure of the Web is summarized by a single number for each page. the "PageRank" of a page is the probability that a random walker on the graph would be at that page at any given time. The remarkable property this ranking has is that it reﬂects very well the “importance” of the page – the degree to which typical searchers would like that page returned as an answer to their search query. I went more in depth here as it is directly related to my final deliverable. Another form of summary is clustering. Here, data is viewed as points in a multidimensional space. Points that are “close” in this space are assigned to the same cluster.
Bonferroni's Principle
Suppose you have a certain amount of data, and you look for events of a certain type within that data. You can expect events of this type to occur, even if the data is completely random, and the number of occurrences of these events will grow as the size of the data grows. These occurrences are “bogus,” in the sense that they have no cause other than that random data will always have some number of unusual features that look signiﬁcant but aren’t. A theorem of statistics, known as the Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses to a search through the data. Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is signiﬁcantly larger than the number of real instances you hope to ﬁnd, then you must expect almost anything you ﬁnd to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for.
TF .IDF
The measure called TF.IDF lets us identify words in a collection of documents that are useful for determining the topic of each document. A word has high TF.IDF score in a document if it appears in relatively few documents, but appears in this one, and when it appears in a document it tends to appear many times.

Hash Functions

A hash function maps hash-keys of some data type to integer bucket numbers. A good hash function distributes the possible hash-key values approximately evenly among buckets. Any data type can be the domain of a hash function.

Indexes

An index is a data structure that allows us to store and retrieve data records eﬃciently, given the value in one or more of the ﬁelds of the record. Hashing is one way to build an index.

Storage on Disk

When data must be stored on disk (secondary memory), it takes very much more time to access a desired data item than if the same data were stored in main memory. When data is large, it is important that algorithms strive to keep needed data in main memory.

Power Laws

Many phenomena obey a law that can be expressed as y = cx^a for some power a, often around −2. Such phenomena include the sales of the xth most popular book, or the number of in-links to the xth most popular page.

So, that was probably more than what you might call a brief summary, but whatever.

This week I also realized that the Spark system is not just a scheduler. It is actually a parallel computing program, which contains a scheduler. I also found that Spark covers most programming languages with APIs, but to do page rank, I would need to use its GraphX portion. I am currently learning more about GraphX, mostly from this link: (https://spark.apache.org/docs/0.9.0/graphx-programming-guide.html). This section only covers Java and Scala. This means that I can use the Java programming language to write my program, and therefore do not need to learn Scala, although I may continue to learn Scala, just for knowing the language. Below is an image for the Spark stack:

(Note: If this image is too small, simply click on it to enlarge the image.)

As you can see in the image, Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. The Spark Core is powered by a Standalone Scheduler.

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Have a great rest of the day!

Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System

Pages

Friday, February 26, 2016

Week 3 Updates

No comments:

Post a Comment

Blog Archive