Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System

Monday, April 25, 2016

Week 11 Concluding Post

Hello Readers,

Next week is my presentation and so this means that I have reached the end of my Senior Research Project, and, therefore, updating my blog. Although I may post a couple additional updates, this is most likely the final post. My program is finalized and works well with no issues as far as I can tell.

Thank you all for reading. If you have any questions about whatever, just comment on this post. If your question is about a specific post, reference the title of that post in your comment.

Thanks for reading and have a great day!

Tuesday, April 12, 2016

Week 10 Quick Updates

Hello Readers,

I believe that I figured out the problem in the program. After more testing with the edited algorithm from the USC paper and with a different, smaller dataset, I think that the outputted ranks align better with what should be coming up. Now, the greatest rank is around 0.003 or so, and the total of all the ranks is about 0.9. The smaller dataset only has about 10,000 nodes, just an eighth of the other dataset that I was testing with. I switched to this smaller one because all of the nodes would load into the program, while the larger dataset would be truncated. The smaller one performed well when run through the program. It looks as though all nodes were loaded and ranked. I also reduced the number of iterations. It appears that as you increase the number of iterations, the total rank decreases. I tried to do that with the larger dataset to see if increasing the number of iterations would have any effect on the outputted ranks. However, I ran into an issue. The most iterations that I had run so far was 100. So, I tried running 300 iterations. It ran for about 20 minutes, then ran into a Stack Overflow error. This is when the call stack pointer exceeds the stack bound. Essentially, the call stack consists of a limited amount of predefined space and if a program attempts to access memory beyond the call stack's bounds, it results in a program crash. This is usually due to infinite recursion (when a function calls itself so many times that the space needed to store the variables and information exceeds the stack limit). However, since my program has worked successfully with lower iteration amounts, it was not infinite recursion. More likely it was due to very deep recursion, which is just a recursive function that will, in theory terminate, but requires too much memory to execute completely. For the sake of being thorough, I will mention the third major cause of stack overflow errors: very large stack variables. This is an attempt to allocate more memory than will fit on the stack. For example, if you declare an array, but the array has some huge number for its index, the program may run into a stack overflow error. Anyway, back to stack overflow for my purposes. I tried again with 250 iterations and then again with 200 and once more 150, but calculating the page rank iteratively for 80,000 nodes 150 times requires a lot of space allocation. In any case, I think I solved my problem but if I want to try it out for the large dataset, I will need to increase the heap size of the JVM (Java Virtual Machine).

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Monday, April 11, 2016

Week 9 Updates

Hello Readers,

Last week was just more debugging. There is something wrong, I think, with the algorithm that I have implemented within the program. The problem, again, is that some of the ranks that it is outputting are greater than one, the largest around 60. The ranks are supposed to add up to one altogether and with close to 80,000 nodes, each rank should be very small. I found a research paper from USC (University of Southern California) that had a page rank implementation that is different from the algorithms that I have seen in other papers. I tried that one out and it seems to work, but I need to conduct further testing. Now, the ranks are greatly reduced, but they don't seem to add up to 1, instead they are around 0.5.

Besides that, I am just working on my presentation, as the end of the research project approaches (in early May, so I have just around three more weeks).

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Monday, April 4, 2016

Week 8 Updates

Hello Readers,

Sorry for the late post. This post is for last week.

Not too much to add from last week. I finished debugging the PageRank program, so it outputs a massive list of the ranks for each node and finds the top ten. I also added code for outputting the top 10 nodes in terms of degree (the amount of neighbors that the node has). I added this to determine whether there is a relationship between rank and number of neighbors, as in will the node with the highest rank also have the highest degree. I completed this code and it worked; however, I found another bug that I think is external. I found that previously, the file that I have been testing with should have 80,000 nodes, but Excel can only read some 1.5 million lines. This may seem like enough but it isn't because each node can have several thousand neighbors, so the entire CSV file cannot load. I tried to load the CSV file on word pad and it works after loading for several minutes, but when I run it through my program, I find it is outputting nodes with ranks of greater than one, the largest being a rank of 30, which violates the algorithm. All ranks are supposed to be less than one and sum to a total of one. I may have to end up rewriting the algorithm and using a different method for determining page rank in my code.

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Sunday, March 27, 2016

Week 7 Updates

Hello Readers,

I am working on debugging the page rank program. In addition, I am going to be testing it using information provided by an ASU website. This site provides network datasets for various websites (primarily social media). They load as CSV (comma separated values) files which are opened through programs such as Excel or can be loaded as a text file (which is the file that I will most likely be using when testing my program).

If you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Title Change

Hello Readers,

Based on the focus of my project, I do not think that the current title accurately describes my topic anymore. Therefore I have changed my title and description.

Here is the new title: Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System for use in Big Data Analysis

Here is the new description: As technology becomes a larger part of our everyday lives, the cloud, and therefore cloud computing, becomes essential. Large-scale data processing is an important part of cloud computing, and can be run through programs such as Spark. Spark is a shell that allows developers to write a program that utilizes Spark libraries. This allows for the driver program to be run on Spark and quickly process large amounts of information. One such program is the Page Rank algorithm, which analyzes the importance of a web page by assigning it a rank based on the outgoing and incoming links to and from other webpages. This program was made famous by Google and is used to order the listed search results by importance.

Sunday, March 20, 2016

Week 6 Updates

Hello Readers,

Sorry for the late post. This is for last week. I'll just give a very brief overview of what I did.

I wrote a naive implementation of the Page Rank algorithm in a Java program utilizing Spark's libraries. I plan to test it using a small set of sample web pages. I will run it on Amazon Web Services (AWS) EC2 servers. I made an AWS account recently and basically it allows you to use virtual servers for use in development, computation, storage, and networking. As long as you stay under 750 hours/month, it is free for the first year of use.

I also wrote a naive implementation of a simple word count program in Java that also utilizes Spark, but this was more for getting familiar with Spark in Java than it was for the actual project. The word count program is Spark's 'Hello World' program.

Finally, I worked on resolving certain issues within my Virtual Box that pertained to importing an AWS security key into my local disk in order to be able to run instances of EC2 servers from a program.

Anyway, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!