Monday, April 4, 2016

Week 8 Updates

Hello Readers,

Sorry for the late post. This post is for last week.

Not too much to add from last week. I finished debugging the PageRank program, so it outputs a massive list of the ranks for each node and finds the top ten. I also added code for outputting the top 10 nodes in terms of degree (the amount of neighbors that the node has). I added this to determine whether there is a relationship between rank and number of neighbors, as in will the node with the highest rank also have the highest degree. I completed this code and it worked; however, I found another bug that I think is external. I found that previously, the file that I have been testing with should have 80,000 nodes, but Excel can only read some 1.5 million lines. This may seem like enough but it isn't because each node can have several thousand neighbors, so the entire CSV file cannot load. I tried to load the CSV file on word pad and it works after loading for several minutes, but when I run it through my program, I find it is outputting nodes with ranks of greater than one, the largest being a rank of 30, which violates the algorithm. All ranks are supposed to be less than one and sum to a total of one. I may have to end up rewriting the algorithm and using a different method for determining page rank in my code.

As always, if you have any questions or comments please feel free to leave a comment below and I will try to get back to you as soon as possible. Same goes for any clarification of topics or concepts.

Thanks for reading and have a great day!

1 comment:

  1. Hi Farhan,
    Thank you for sharing your experience. So, I do almost same thing.but i implement pagarank algorithm in java on spark with 2 nodes at beginning. Did you use any tools for monitoring, measuring execution time? CPU and memory usage? i m looking for another tool than spark UI fo monitoring spark application deployed in cluster standalone mode. thank you in advance

    ReplyDelete