Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System: April 2016

Monday, April 25, 2016

Week 11 Concluding Post

Hello Readers,

Next week is my presentation and so this means that I have reached the end of my Senior Research Project, and, therefore, updating my blog. Although I may post a couple additional updates, this is most likely the final post. My program is finalized and works well with no issues as far as I can tell.

Thank you all for reading. If you have any questions about whatever, just comment on this post. If your question is about a specific post, reference the title of that post in your comment.

Thanks for reading and have a great day!

Tuesday, April 12, 2016

Week 10 Quick Updates

Hello Readers,

I believe that I figured out the problem in the program. After more testing with the edited algorithm from the USC paper and with a different, smaller dataset, I think that the outputted ranks align better with what should be coming up. Now, the greatest rank is around 0.003 or so, and the total of all the ranks is about 0.9. The smaller dataset only has about 10,000 nodes, just an eighth of the other dataset that I was testing with. I switched to this smaller one because all of the nodes would load into the program, while the larger dataset would be truncated. The smaller one performed well when run through the program. It looks as though all nodes were loaded and ranked. I also reduced the number of iterations. It appears that as you increase the number of iterations, the total rank decreases. I tried to do that with the larger dataset to see if increasing the number of iterations would have any effect on the outputted ranks. However, I ran into an issue. The most iterations that I had run so far was 100. So, I tried running 300 iterations. It ran for about 20 minutes, then ran into a Stack Overflow error. This is when the call stack pointer exceeds the stack bound. Essentially, the call stack consists of a limited amount of predefined space and if a program attempts to access memory beyond the call stack's bounds, it results in a program crash. This is usually due to infinite recursion (when a function calls itself so many times that the space needed to store the variables and information exceeds the stack limit). However, since my program has worked successfully with lower iteration amounts, it was not infinite recursion. More likely it was due to very deep recursion, which is just a recursive function that will, in theory terminate, but requires too much memory to execute completely. For the sake of being thorough, I will mention the third major cause of stack overflow errors: very large stack variables. This is an attempt to allocate more memory than will fit on the stack. For example, if you declare an array, but the array has some huge number for its index, the program may run into a stack overflow error. Anyway, back to stack overflow for my purposes. I tried again with 250 iterations and then again with 200 and once more 150, but calculating the page rank iteratively for 80,000 nodes 150 times requires a lot of space allocation. In any case, I think I solved my problem but if I want to try it out for the large dataset, I will need to increase the heap size of the JVM (Java Virtual Machine).

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Monday, April 11, 2016

Week 9 Updates

Hello Readers,

Last week was just more debugging. There is something wrong, I think, with the algorithm that I have implemented within the program. The problem, again, is that some of the ranks that it is outputting are greater than one, the largest around 60. The ranks are supposed to add up to one altogether and with close to 80,000 nodes, each rank should be very small. I found a research paper from USC (University of Southern California) that had a page rank implementation that is different from the algorithms that I have seen in other papers. I tried that one out and it seems to work, but I need to conduct further testing. Now, the ranks are greatly reduced, but they don't seem to add up to 1, instead they are around 0.5.

Besides that, I am just working on my presentation, as the end of the research project approaches (in early May, so I have just around three more weeks).

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Monday, April 4, 2016

Week 8 Updates

Hello Readers,

Sorry for the late post. This post is for last week.

Not too much to add from last week. I finished debugging the PageRank program, so it outputs a massive list of the ranks for each node and finds the top ten. I also added code for outputting the top 10 nodes in terms of degree (the amount of neighbors that the node has). I added this to determine whether there is a relationship between rank and number of neighbors, as in will the node with the highest rank also have the highest degree. I completed this code and it worked; however, I found another bug that I think is external. I found that previously, the file that I have been testing with should have 80,000 nodes, but Excel can only read some 1.5 million lines. This may seem like enough but it isn't because each node can have several thousand neighbors, so the entire CSV file cannot load. I tried to load the CSV file on word pad and it works after loading for several minutes, but when I run it through my program, I find it is outputting nodes with ranks of greater than one, the largest being a rank of 30, which violates the algorithm. All ranks are supposed to be less than one and sum to a total of one. I may have to end up rewriting the algorithm and using a different method for determining page rank in my code.

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

Page Rank Algorithm Implementation in the Apache Spark Cloud Computing System

Pages