How does a Search Engine work?

10 min readMar 26, 2020

What’s behind the magic of a search engine?

I deepened my knowledge of search engines, with the goal of being able to explain it to someone who is curious about it, but not in the weeds of technology — like my kids or my parents.

https://www.google.com/search/howsearchworks/

What is a search engine supposed to do?

There’s a lot of publicly available content on the internet. A lot of WEB PAGES. People need to find the ones that are relevant to answering their specific question.

“What is the weather in San Mateo?”, “What’s the latest on the coronavirus?”, “Easy HIIT workout” or “Boys black dress shoes”.

There’s vastness on both sides of this problem: vastness in the range of questions people ask, called QUERIES, and vastness in the number of web pages. In 2018 there were 1.7 billion websites.

“There’s vastness on both sides: vastness in the range of questions people ask, called QUERIES, and vastness in the number of web pages.”

How does a search engine do that?

There’s three big buckets in getting the right pages to the query.

Collect all the pages that are publicly available. This involves CRAWLING. Google crawls hundreds of billions of webpages, keeping up with pages that are new, have changed, or become dead links.
Put the pages into a format where they can be easily looked up, in readiness for when a specific query comes in. This involves INDEXING to produce a Search index. As Google says, “The index is like a library, except it contains more information than in all the world’s libraries put together.”
Build a system to sort through hundreds of billions of webpages in the Search index to find the most relevant, useful results in a fraction of a second, for any query. This involves a series of ALGORITHMS.

There’s one more bucket which appears trivial compared to the computer science difficulty of the previous steps, but it is hugely important in giving back the user exactly what they want:

4. Giving back the user exactly what they want involves presenting the information in a way that makes sense, or VISUALIZING the results. If I ask for “San Mateo weather”, the search engine could give a long list of links to weather websites. Sure, that would be pretty useful. But giving the user exactly what they want involves presenting me with the weather in San Mateo on the search results page, not that long list of links.

Going a bit deeper on Algorithms

Multiple algorithms are involved in figuring out how to find the right content for any query. Google ranking systems sort through hundreds of billions of webpages in their Search index to find the most relevant, useful results in a fraction of a second. These ranking systems are made up of not one, but a whole series of algorithms.

There are numerous factors that are involved in bubbling up the best content for any specific query. It’s helpful to think of factors on the user side, related to the query, and on the content side, related to web pages.

— User side factors include understanding the meaning of the user’s query itself. What is the user actually looking for? As Google says, “We build language models to try to decipher WHAT STRINGS OF WORDS (emphasis added) we should look up in the index.”.

A system detects and corrects mis-spellings
A synonym system determines that “Change a lightbulb” and “Replace a lightbulb” are about the same thing; but not the same as “Change the brightness of my screen”.
A categorization process determines what broad category the query fits into: is it a question (who is the Secretary of State)? Is it looking for fresh content (what is the score in the Liverpool game)? Is it looking for broad or specific information?

None of these systems are easy to build. Google says about its synonym system: “This system took over five years to develop”.

— Page-side factors include understand the relevance of the web page to the query, the trustworthiness or quality of the page, and the usability of the page.

Initially the page relevance system was based around matching the words in the query to what is on the page, using things like word proximity and density. If I asked for “Beyonce songs” the pages related to this would be the ones that had those words or similar words repeated multiple times.
It has since become much more sophisticated.
Relevance of a page is established through Knowledge Graph and Page Rank
Relevance of the page is determined through a Knowledge Graph that Google maintains. This contains information and relationships between real world people, places and things (about 1B things and 50B facts about them), for example, Beyonce is singer, the songs she’s performed, her age, etc. Using this, algorithms can better assess the relevance of a page to understand whether it is about an entity or not. If I ask for “Beyonce songs” the pages related to this don’t need to have those exact words repeated multiple times.
Trustworthiness of the page is determined through PageRank, which is based on the idea of authority. Authoritative content will be referred to (or linked to) by many other websites. Interestingly, by some estimates, 95% of the pages on the internet now are spam or fake pages, designed simply to generate ad revenue for the page owners and not really significant or useful. Spam detection algorithms are used to avoid these.

Crawling

Crawling is the discovery process. Search engines send robots (known as crawlers or spiders!) to scour the internet to find new and updated content, loking over the code or content for each URL they find. Content can vary — it could be a webpage, an image, a video, a PDF, etc.

Beyond Crawling: Feeds & Structured Data

While search engines started with crawling, they have gone the next step, and not all of the information that appears in Search results is because it has been gathered by crawling. There are direct ways for content owners to transmit the information on their web pages to the search engine.

Did you know that a web page has two components? One component is the information (known as data) and the other component is the formatting (how it looks visually, known as rendering). What the search engine cares about most is the information / data.

Feeds: The website sends data to Google directly, using an agreed format, as a feed.

Shopping — data about products is submitted by merchants via a feed
Transit information —data about transit information, such as schedules and routes, is submitted by agencies that provide public transportation services, like Caltrain or BART. Google tells agencies how they should format and send the feed, so it’s clear and accurate.
Flight status — data about flight status is supplied by a company that already gathers this information, flightstats.com
Weather — data about weather is supplied by a different company that already gathers this information, weather.com

Sometimes, Google even buys companies to get the relevant information. You can look up all kinds of flight planning and price information on Google flights. This is because Google acquired a company, ITA Software which bult software that used algorithms to generate up-to-date flight information from multiple airlines, such as prices and seat availability.

Structured Data: One great thing about crawling is that all the work is done by Google to find your website and understand the content. But if the web page owner does some work to help the process that makes is much easier for Google to get information accurately and perhaps with more detail than just by crawling. If the web page owner puts in effort to “mark up” different elements on the page to let Google know what they are. Think of this as “highlighting and labelling” the various elements on the web page so that Google knows exactly what is what.

For example, if I have a website that lists movies, I would go through my movie listing webpage and mark up (or label) the various the elements, such as movie title, release date, audience rating, director, actors (etc). This way Google’s crawling and indexing process can interpret the meaning of the page content without ambiguity or error.

Structured data is most helpful when all the websites in a particular field — say, all the websites that offer information about movies — use the same labelling scheme. Schema.org is an industry organization that creates and maintains the labelling scheme in a cooperative way.

Structured data has been implemented for numerous categories making it much better for users to find what they are looking for, including some of my favorites:

books (information about books, or allow them to be purchased in the Google store)
job listings
job training (beta)
recipes (marking up useful things like ingredients and reviews)

And many others.

Visualizing

Search was initially about presenting the most relevant links on the page, often referred to as the “10 blue links”. It was a basic format and very useful.

There’s a ton being done to help the user to really answer their question, as mentioned above, like showing the weather in San Mateo, rather than links to weather websites.

Thought is given to how to save the user having to click multiple times to get to their answer, to help them understand the information more easily, help them act on the information more quickly, or help save their device bandwidth.

With the increased ability to understand the user’s query and categorize their intent, there is more innovation on how to really answer the user’s question. If a street address is typed into the search bar, the user is likely looking for how to go to a certain place, or its business hours, so it makes sense to show a map, phone number and business hours all on the search results page, saving the user multiple clicks and steps to collect all that information.

Sometimes the answer to a question is several things — for example “Former US Presidents” — and these are best represented in rich lists, rather than as a typical links on the page.

There are numerous examples of these visualization efforts to make the user’s experience much less frictionless, and a few of them are depicted below.

Indexing and Algorithms are connected

It’s worth noting that how you store data (Indexing) and how quickly you can search through it (Algorithms) are related.

For instance, if I was to ask you to hand me something from your closet — like “a red sweater” — how quickly you could do that depends on how you stored your clothing in the first place. If it was simply by the order in which clothing was purchased, then sweaters, shirts, jeans, etc would all be intermingled. When I ask you for a “red sweater” you are going to start at the first cubby, then look in every cubby until you find a red sweater, and you won’t even be sure there is (or isn’t) one until you’ve searched all the cubbies, one by one. Perhaps instead you decided to store according to type —jeans in one section, shirts in one section, and sweaters in another. Now, when I ask for a red sweater, you go to the sweater section immediately, and then through each of the cubbies in that section, one by one. That would be a lot more efficient. Perhaps you decided to store by color — for instance, all the red items in one section, all the blue items in another section and so on. There’s a lot you can do in choosing how to store the items to make the subsequent search more efficient. There’s an art to these decisions and that’s the realm of computer scientists who work on implementing efficient data structures and search algorithms.

Why is searching on Google different (better) than on Yahoo or Bing?

5-10 years ago you might not have cared which engine you used. Now you probably do have a preference. Try searching for something on all 3 search engines to see how different the experience is.

Some of the reasons that Google has higher quality and market share arise from a data network effect:

More usage, means more data for training Google’s algorithms, which produce better search results, which drives more usage. It becomes a virtuous cycle.

The other network effect is around revenue.

More usage means more ad revenue — driven by two factors: more users means more ads served (ad volume) and more relevance means advertisers willing to pay more per ad (ad price). The revenue advantage drives investments — attracting and retaining teams of engineers and scientists, acquisitions, and primary research and development — which drives more usage. Also, a virtuous cycle.

p.s: What inspired me to write about Search?

I’ve been involved in search since 2007 when I joined a startup that built software that executed large-scale digital marketing campaigns in the largest, most competitive digital advertising industries. Our data science team used machine learning to understand intent, amongst other things.

Oddly, though, I hadn’t ever explained how a search engine works to anyone.

With time at home to do new things, I decided to dig deeper. It also connected to two personal learning tasks: it gave me the opportunity to apply material from CS50, as well as creating a new speech for my Toastmaster’s club.