Friday, February 12, 2016

Week 1 Updates

Although it is possible that it may change and evolve over the course of my research, as of now, I plan to create a program that ranks webpages according to their importance, similar to the way in which Google ranks its webpages. In my program, I will implement the Apache Spark system to process large amounts of data. 

An explanation of page ranking:
Think of a series of webpages as an interconnected web; a connected, directed graph. Each webpage is a vertex on the graph and each in-link (a directed edge) is a vote. Links from important pages count more. Each link’s vote is proportional to the importance of its source page, so a ‘vote’ from an important page is worth more. A page is important if it is pointed to by other important pages. 

If node j has n out-links, each link gets Rj/n value.

Where n is the number of in-links from a node. Node j's importance is the sum of the votes on its in-links.


How does Google rank Webpages?
It ranks webpages according to their importance. How to quantify the importance of a webpage? An important webpage should have a lot of important webpages point to it.

A Web as a Graph
Web as a directed graph. Nodes are webpages and edges are hyperlinks (Graph theory!). 
Here's an example of a small portion of a web acting as a directed graph.
Ranking Nodes on the Graph
All web pages are not equally "important"

Feel free to leave a comment or reply. Let me know if this is confusing and I will try to clarify it to you. Thanks for reading!

Source: www.mmds.org

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. How do you calculate the actual values of these scores? I can see some kind of approximation using iterations but maybe there's a better way like a system of equations?

    ReplyDelete
  3. There is a better way of calculating page rank, with such methods as the Power Iteration Method. Unfortunately, I am still researching and reading up on these methods, so I cannot answer properly.

    Yes, you could use a system of equations, at least for smaller examples, but it is not scalable, so you would have to use other methods for larger examples, like in real life. It would take a lot of processing power to run a system of equations for, say, 1 TB of information, so it would be more efficient to utilize other methods for faster run times.

    Hopefully that answered your question. If not, feel free to leave another comment and I'll try to answer to the best of my ability.

    ReplyDelete