Search engines and linear algebra

Search engines and linear algebra
July 10, 2014

Nowadays information is at hand. Just open your browser, type some words about what you want to know, and thousands of pages will arise to satisfy your information needs; and surprisingly, you will find the most relevant information in the first proposed pages. It’s not a mystery the way the search engines work to make their magic, it’s just mathematics. Behind their complex algorithms is hidden the linear algebra and other mathematical theories.

Every search engine needs three basic elements: a web crawler, a database to save the data it finds, and an algorithm to determine the order of pages returned by any search inquiry. The two first elements can be automated easily, the main problem lies on the third one, and here is where the mathematical tools are the key to the solution.

Undoubtedly one of the most successful search engines is Google, and its success is attributed, mainly, to its search algorithm, which provides the most relevant results first. The heart of this algorithm, known as PageRank, evaluates each webpage and accesses its importance relative to other web pages, where the importance of a webpage can be determined by the importance of pages linking to it. Google’s true PageRank algorithm is extremely complex, and it’s updated continuously, however, we pretend to give you the basic mathematical ideas behind it.

The basic idea

For any page P, its importance is equal to the total sum of all importance values from pages linking to P. In addition, P will impart its own importance to all pages that it links to, which value, for each page, can be computed as the quotient between the importance of page P and the total of links. For any network of pages, the importance of these pages can be represented as a matrix.

Let’s look an example. Suppose we have a theoretical network of 5 websites. We can create a graph for this network, where sites are represented as nodes, and links between pages are depicted as arrows.

network

We can see that page 1 links to page 5, page 2 links to pages 1, 3 and 4, page 3 links to page 4 and 5, page 4 links to page 3, and page 5 links to pages 1 and 2. To express this graph as a matrix, let’s assume that the importance of each page is 1 for now, and build the n-th column of the matrix using the outgoing links for the n-th page; so, the third column, for instance, will have 1/2 at cells four and five, because page 3 links to page 4 and 5, and it imparts its own importance (1) to all pages that it links to, dividing its importance into equal parts. Using this representation, the n-th row represents the incoming links for the n-th page. The matrix A is the matrix representation of the previous network.

    ┌                     ┐
    │ 0  1/3    0  0  1/2 │
    │ 0    0    0  0  1/2 │
A = │ 0  1/3    0  1    0 │
    │ 0  1/3  1/2  0    0 │
    │ 1    0  1/2  0    0 │
    └                     ┘

Thinking about our previous established concepts, we can see that the importance of a page can be computed as the sum of its row; for example, the importance of page 5 would be x1 + 1/2•x3, where x1 and x3 are the importance values for page 1 and page 3, respectively.

This lead us to a system of linear equations, where the constant vector of the system is the own vector x (the vector composed by the importance of each page), and that can be written as Ax = x. This problem is reduced to find the eigenvector with an eigenvalue of 1.

The resulting answer is:

    ┌     ┐
    │ 2/3 │
    │ 1/3 │
x = │ 2/3 │
    │ 1/3 │
    │   1 │
    └     ┘

You can see that page 5, the one corresponding to the last value of the vector, is the most important page. If we sort the pages according to their importance, the order will be: 5, 1, 3, 2, 4; and this would be the order returned by the search engine.

This implementation needs some considerations to guaranteed it always works, and some more complex ideas to support the real model, but we only pretend to give you a brief idea aimed to show real applications of Linear Algebra. If you are interesting on how PageRank works in details, there are several articles that will help you to deep on this amazing subject. To find them, just Google.

Related posts

1 Comment


Your comment

We'll never share your email with anyone else.
Comments are firstly moderated before they are made visible to everyone. Profile picture is obtained from Gravatar using your email.
Rick
Rick
May 16, 2015
Very interesting this post. Google is the max