본문 바로가기
  • Machine Learning Paper Reviews (Mostly NLP)
Methodologies

Google : The Most Resilient System for the Most Intricately Flawed Web Artifact

by wlqmfl 2022. 10. 2.
The PageRank Citation Ranking : Bringing Order to Web

[BP] Sergey Brin and Larry Page. Google search engine.

January 29, 1998


Google
Some might think that the expression “the most intricately flawed web artifact” would be too offensive. However, this is the expression that the paper uses to explain what the “web structure” is. The world best search engine, known as google, had made this artifact into a super flexible and comfortable complex. That’s how google started its journey ahead of all the well-known software companies.
Google was established in 1988, by Sergey Brin and Larry Page. The expanded their corporation by consecutively launching Google News, Gmail and Google Maps etc. And finally it became a subsidary of a holding company called Alphabet Inc., still heading towards the future. However, before all these efforts, they were able to construct a search engine Google by a single paper, “The PageRank Citation Ranking : Bringing Order to Web”.

Abstract, but more like a content
The paper suggests an idea of PageRank, which rates the importance of each web page, ultimately ranking the pages while searching. Key of PageRank is importance, which appromixates the relative human interest and attention to each web page. Google is, or once was, a search engine in order to test the utility of PageRank. Moreover Google is a full-text search engine, not a simple title-based engine.
Note that the importance is derived from and only from the link structure of a web arcitecture. The paper assumes that the highly linked pages are considered important that the ones with few links. So the abstract definition of PageRank turns out to be: A page has high rank if the sum of the ranks of its backlinks is high. However, based on this simple definition of PageRank, two fatal problems occurs while iterating through the web artifact: the rank-sink and the dangling links. The paper introduces two theoretical basis respectively before it actually implements the calculation of PageRank. First of all, consider a trap of two or more pages forming a loop by linking each other. If this loop has no outgoing edge, but having a incoming edges pointing to one of the pages. Then the iteration will infinately accumulate rank but not distribute any rank. This is called rank-sink, and the paper suggests an “E factor” while calculating the importance. In real world, humans never repeatably go through same links as the trap rank-sink does. In fact they jump to a new page through a new link. In a humanistic point of view, the E factor makes PageRank a random surfur model, meaning the phenemenon of iteration jumping to a random page when it gets bored. Moreover, this E factor could be used to customize the rank level by assigning its value differently on each page. Second, links that point to any page with no outgoing links are called dangling links. The problem of dangling link is that it is not clear where their weight should be distributed. So during the calculation, the paper first iterate few times without all dangling links, even though there are bunch of them, and then go through some iterations again adding the dangling links.

Review
Google is search engine which had lots of potentials, and maybe it got bigger nowadays. It was surprising to see how they crawl all the web pages and how they manage the database well even though its size isn’t small. Also, they’ve foung out the convergence properties which proves that PageRank will scale very well even for extremely large collections as the scaling factor is roughly linear in log n.
These days, as a subsidary of Alphabet Inc., Google seems to focus only on business prosperity of the Inc.. However, as sergey and larry said, “Google is not a conventional company. We do not intend to become one.”. I am always rooting for the Alphabet Inc.’s prosperity, and as a user of all the application it provides, I just want to say: Google and the Inc. will forever shine while they keep their origin, the search engine, modern.