What is a Search Engine?

Today we launch Part I of our 3 Part Series
Part I: What is a Search Engine? by Nitin Karandikar (Mon)
Part II: What is Not a Search Engine? by Kaila Colbin (Tues)
Part III: What is an Alternative Search Engine? by me (Wed)
Introduction
The Internet - and especially the Web - has revolutionized the world we live in. Leading this charge are the Search engines: first, Excite, Lycos and Hotbot, in the early days; then later, Alta Vista; and now Google and Yahoo!. These services are ubiquitous, free and easy to use, and they help us make sense of the huge, amorphous cloud of global information that’s available at our fingertips. They help us with the findability of meaningful information, which in the current state of the art means locating relevant documents, images, and media of interest.
There is a lot of interest in these Web search engines, with numerous observers globally watching and writing about their every move, every new feature, every mis-step. But what exactly is a Web Search Engine? In this post, we try to define the term in more detail.
What makes a car, a car?
Defining things is not always easy. Science and academia strive to define terms and concepts very crisply, but the end results often tend to be too dry and contrived. The human brain has an amazing capacity for recognizing patterns while tolerating ambiguity, so that we can understand and use terms without defining them completely.
Take a look at the grouping of images below; most of us would have no trouble identifying each of these objects as belonging to the set “automobile”, in spite of the dramatic variations among them.
One way to define this set, is in terms of functionality and core features. An automobile provides transportation over land for individuals and small groups; in addition, I think most of us would agree that some basic systems need to be represented in any machine that calls itself a car - wheels, an engine, a fuel feed system, a throttle, a steering system and a braking system. For practical purposes, you almost certainly need gears for climbing and reversing, and a differential for turning.
Other items - like headlights and rear-view mirrors - are highly desirable, but their absence does not prevent it from being called a car.
The core features of Search
In the same way, one could define the core features that are required for any service to be considered a Web Search Engine, along with a set of highly desirable features for practical use. I’ve previously proposed an abstract architecture for Search Engines, composed of basic building blocks; here we will try to identify the requirements for a minimal search engine.
To start with, a search engine requires the availability of a web crawler - either one of its own or an external one that it leverages. The need for an index of the retrieved metadata and content (or some other form of access) is a given. At its most general, a search engine must give the user some way to provide the search specification and to access the search results. A UI of the search results is highly desirable, but not essential; an API-only web search engine is still valid.
Given below are the criteria I use to define a Web Search Engine; a web service is a Search Engine only if it satisfies all of them.
Criteria for defining a Search Engine:
1. It enhances findability of relevant web content for the user
2. It searches the entire web or a large subset thereof
(this excludes publisher search engines that search only a single site or group of sites)
3. Searches are specified using a keyword, phrase or question, or using input parameters, without the need for undue navigation
(I don’t consider pure directories like dmoz to be Search Engines)
4. It provides search results on demand, not periodically
5. It provides some kind of unique or special processing of its own: either in the search algorithm, or in UI improvements, or both
(this excludes pure Rollyo or Google Coop-based search engine subsets)
Search Engines of the Future
The criteria described above will not remain static; as technology progresses, Search Engines will need to support increasing levels of functionality to be taken seriously.
Going back to the car analogy, would you buy a car today that does not have an electronic ignition or power-assisted steering? Yet the early versions of the Ford Model T, the Google of automobiles in its day, had neither of these available.
In the same way, web search users of the next decade will consider today’s search behavior as archaic; they will expect many more features to be included as a matter of course. Given below is a list of ones that could easily become required over time.
Essential Features for the future:
- Personalization (but without storing personal info )
- Social Input / Wisdom-of-Crowds (which has its pitfalls )
- Semantic Processing: of both, the query AND the content
(will this let the Search Engine find Answers that we never knew we had?)
- Parametric Input: including freshness, source and domain-specific
- Rich content types: audio, video, images, news, blogs, …
- UI enhancements: better visualization of results
- Findability support: notifications of interest, a database of intentions
- Follow-up: results clustering and drill-down
- Repeat queries (as Greg Linden points out )
- Trusted sources: e.g. a slider to select the level of trust, from high to low
Which of these features are the most valuable for Web Search in the future? That is left as an exercise for the reader! If you have an opinion, vote here in our online poll and let us know!
Guest Author Nitin Karandikar has his own blog: Software Abstractions.











July 30th, 2007 at 11:57 am
To “Rich Content Types” you can add mathematics. There are many, many technical documents and research papers on the web that contain mathematical equations. These are at least as important as the words for the purpose of search. While equations have often been represented in web pages using graphical file formats like GIF, mathematical notation is more like text than pictures. More and more equations are being represented in pages using MathML, the W3C’s XML standard for math, making search possible.
Paul Topping
Design Science, Inc.
July 30th, 2007 at 3:17 pm
[…] Abstractions has just gone live! Make sure to have a look at Nitin’s excellent dissection of what constitutes search engines—he really did a thought-provoking job of proposing a definition for something we all […]
July 30th, 2007 at 6:23 pm
[…] has launched a fascinating series today. It’s a 3-part series attempting to define what is a search engine. While it’s focused on 2007, the series will also address what a search engine might look […]
July 31st, 2007 at 1:31 am
I will add that a search engine has to have its own index of the Web or build it.
July 31st, 2007 at 3:13 am
Paul: Great point, thanks for pointing that out. Mathematical symbols and formulae are indeed a different kind of rich content, as are Chemical formulae - which Charles recently wrote about on this blog (http://altsearchengines.com/2007/07/27/a-molecular-search-engine-chemxseer/). Using a specialized XML schema or microformat will allow Search Engines to parse and make semantic sense of this data, rather than treating it as an image.
Yakov: Great comment! So does a web service that builds its own innovative UI engine, but leverages an underlying web index from another search engine (say Google or Yahoo!) not count as a Search Engine itself? The authors of this series of posts have been debating exactly this point!
I submit that any service that provides *innovative* features - whether in the UI, algorithms or even data sources - should count as a real search engine. If it’s purely re-purposing an existing service [e.g. Google Co-op or Rollyo], then I would agree that it’s not a search engine in itself.
But this is surely a debatable point!
July 31st, 2007 at 3:50 am
You’re completely wrong, I don’t know why on earth you’d try to reclassify what a search engine is when we’ve known what search engines are for a long time.
A search engine is simply “an information retrieval system designed to help find information stored on a computer system” (Wikipedia).
1. It enhances findability of relevant web content for the user
- It doesn’t need to have anything to do with the web. Findability is not a word, even in italics.
2. It searches the entire web or a large subset thereof
(this excludes publisher search engines that search only a single site or group of sites)
- No search engine searches the entire web. Don’t listen to the Google PR machine so much, and again, it doesn’t need to touch the web to be a search engine. Plus you’re on AltSearchEngines here… how many verticals do you guys cover?
3. Searches are specified using a keyword, phrase or question, or using input parameters, without the need for undue navigation
(I don’t consider pure directories like dmoz to be Search Engines)
- So you’re saying you need an input to get an output? That’s genius.
4. It provides search results on demand, not periodically
- I don’t even know what the hell you’re trying to say this for. It’s still wrong. Why does it have to do as a person asks it?
5. It provides some kind of unique or special processing of its own: either in the search algorithm, or in UI improvements, or both
(this excludes pure Rollyo or Google Coop-based search engine subsets)
- This is far and away the worst thing you’ve written, you’re clearly grasping at straws. That is until you said:
The criteria described above will not remain static; as technology progresses, Search Engines will need to support increasing levels of functionality to be taken seriously.
- No, i’m afraid a search engine, will always be a search engine. No matter how technology progresses it will still be a search engine.
The article you should have written is, “What search engines should have on my holidays”.
Yakov: A search engine doesn’t need to have its own index of the web or build it. A crawler of some description is responsible for building an index - that can take many forms and is often included in the search engine software itself. If you want examples of search engines without their own index, then take a look at the recent Digg API contest for some examples.
I’m hoping Charles gives you a massive kick up the backside and stops you writing what essentially is a load of bollocks.
July 31st, 2007 at 4:23 am
[…] I: What is a Search Engine? by Nitin Karandikar […]
July 31st, 2007 at 7:05 am
[…] Source […]
July 31st, 2007 at 8:12 am
[…] Search Engines wartet mit einem interessanten Dreiteiler auf: Part I: What is a Search Engine? Part II: What is Not a Search Engine? Part III: What is an Alternative Search Engine? (erscheint […]
July 31st, 2007 at 10:08 am
[…] Même si à l’heure actuelle on peut considérer que les microformats aident au référencement, ce n’est pas explicitement mentionné. Mais l’avenir est à la recherche sémantique afin d’affiner les résultats et de mieux répondre aux requêtes. C’est un énorme défi puisqu’entre indexer du contenu et comprendre ce qu’il dit, il y a tout un fossé. Cette recherche sémantique s’appliquera aussi bien au texte qu’aux ressources audio (musique, interview etc.) et vidéo. Hakia s’intéresse à la recherche sémantique musicale tandis que d’un autre côté, la recherche sémantique est considérée comme étant une fonctionnalité nécessaire pour le futur…. […]
July 31st, 2007 at 10:12 am
1. It enhances findability of relevant web content for the user
Can we define what is web content? I want to define it as collection of texts, rich texts, photos , blogs, and videos and one more important item may be web services(email, buy, sell, download, yp,weather,map,search ,etc).
2. It searches the entire web or a large subset thereof
(this excludes publisher search engines that search only a single site or group of sites)
Search Engines can crawl texts and rich texts. But, audios, photos and videos
are not visible to crawlers. Flicker and YouTube kind of web sites allow users to
manually tag their content.
At Command Engine(http://www.commandengine.com/) we allow users to tag web services manually as commands.
Our opinion is, having a search crawler is a optional one.
July 31st, 2007 at 10:35 am
[…] What is a Search Engine? […]
July 31st, 2007 at 12:02 pm
I think entirely too much is made in the tech community of personalized search and semantic search.
First, if you want personalized search, you’re going to have to let the search engine know something about you, otherwise the results can’t be personalized. And, since most people won’t go to the trouble of letting the search engine learn about them, or feel that would invade too much on their privacy, it’s never going to happen on a wide scale. So to put personalized search without stored data as an aspect of a future search engine seems silly. Unless the search engine will temporarily read your mind while you’re using it, it’s not going to happen.
Second, the whole thought of semantic search is great, but the truth is that people are unpredictable and will often search for things in unpredictable ways. There are time that I look at some of the search queries on our search engine Bessed and as a human I can’t figure out what the person was searching for. No matter how good semantic search gets, it will never have the human mind figured out. That’s not to bash the notion of semantic search making search engines better, but a lot of techies seem to see it as some Holy Grail of search, and I think they’re fooling themselves in terms of whether it can be accomplished and whether the vast majority of searchers even care about it anyway. No search engine is going to beat out Google because of superior semantic search.
July 31st, 2007 at 2:42 pm
Phill:
Thank you for your comment! You make some great points, although they’re somewhat obscured by the aggressive personal attack. Let me see if I can clarify:
The first big disconnect is that I interpret the term “Search Engine” to mean Web Search Engine. Call it artistic license - although the usage is consistent with Charles’s blog. So the article looks at the Search space in that light.
Second, as I’ve explicitly said in the article, I was setting out *not* to come up with a rigorous, academic definition of a Web Search Engine, but rather, to attempt to tease out an engaging, practical understanding of the concept - in order to provide a framework for the ongoing discussion about Search.
So I define it by looking at the features that need to be supported, as a simple test of Search Engines.
Were you disappointed because the article was too casual, not rigorous enough? For a more detailed and technical treatment of Web Search, check out these two articles: Top 17 Search Innovations on Read/WriteWeb and A Conceptual Architecture for Search on the Software Abstractions blog, respectively. Their urls are as follows:
http://www.readwriteweb.com/archives/top_17_search_innovations.php
http://blog.softwareabstractions.com/the_software_abstractions/2007/05/a_conceptual_ar.html
July 31st, 2007 at 4:19 pm
I don’t remember making an aggressive personal attack, why are you somewhat bruised?
It matters not whether you mean ’search engine’ to be ‘web search engine’. Even a web search engine, does not need to search the web. It can be web based and search something else entirely.
I’m at a loss to find anywhere you explicitly explain that you are trying to “tease out an engaging, practical understanding” as opposed to a “definition”, but I understand the two as being pretty similar no matter which adjectives you attach. Indeed I would, “try to define the term in more detail”, but you’re not able to read your own article apparently.
I was disappointed because the article was bollocks, and now I’m disappointed because you’re pointing me to urls without having done a little research. I suggest you check out my blog - it’s occasionally a bit aggressive but I’ve been told i’m like that.
July 31st, 2007 at 4:26 pm
Nice comment Adam, I have to agree with you.
I’ve developed a semantic layer for the search tech I’m currently working on and it’s pretty awesome - but by itself it’s no use at all.
Just like any other search engine it needs indexed data to work from. This is the kicker: by manipulating exactly which data you store in your semantic index it allows you to change the targeting of the semantic search.
So the semantic layer is then tied into the indexing technology, it uses user history or pre selected fields indicating their preferred thinking/interests to only index that data which follows their own semantic patterning. Hopefully this pulls out results that are much closer to how they search…
This needs to be interleaved then with traditional ranking methods (based on salton’s vector model), timeframes, moods even. Whether you’re search at work or at home then you’ll be in a different mindset right?
The collaborative approach doesn’t rely on just the semantic index - it’s merely an extra ranking tool, perhaps in future on top of that there may be NLP, or facial recognition?
July 31st, 2007 at 5:42 pm
The future of search is about finding the most relevant content for users and you pointed out some of those smart ways of achieving that. I think the most essential factors will be personalization and semantic web, both making search even simpler and relevant rather then making the search process more complicated with user/social inputs. For “advance” users, social inputs would help in relevancy but for the majority of users, social input is just too much work.
August 1st, 2007 at 2:19 am
In all honesty, I was expecting this to be an entertaining read about how information retrieval will be done better/differently in the future. I must admit I agree with Phill on this one (maybe not the tone
), all you’ve managed to do is define what search currently is. I think it’s mildly entertaining to think about how current search engines will evolve, but I don’t think that’s going to be as relevant, or interesting as how we will find information in the future. The key here is that Search is a term we’ve given to the process of finding information. We only have to Search, because there is nothing better yet. I don’t think we should have to look for information, it should just be available. Highly specialised IR systems will provide us with the majority of information we need depending on the context we are in. Search engines will become less important, simply because, the first relevant information we get, is all we need most of the time. There are ways of doing that without ever having to search. If i get the info I need, I rarely go looking for more. I’ve linked to it once before from R/WW, but I’ll do it again…. http://dpn.name/index.php/2007/07/28/the-state-of-research/ .
August 1st, 2007 at 8:46 am
[…] what Nitin said and ask yourself, “are these all search engines?” Would Kaila say that any of them […]
August 1st, 2007 at 7:03 pm
[…] to my article and the 3 part series that we just concluded this morning on “What is a Search Engine?” (Nitin Karandikar) The other two parts covered “What is Not a Search Engine?” […]
August 5th, 2007 at 6:29 pm
Paul Midwinter said…
[You’re completely wrong, I don’t know why on earth you’d try to reclassify what a search engine is when we’ve known what search engines are for a long time.]
Amen to that Paul. The authors of this site, have no clue at all to what search engine is. I myself have implemented a number of search engine algorithms, and there are tons of them available from the literatures. Also there are new improved versions (lower error rate) of existing algorithms or completely new ones that are being published from time to time. Search engines evolved and the same thing to vendors such as Google and the rest.
August 5th, 2007 at 6:46 pm
Ratu Mariappan said…
[But, audios, photos and videos are not visible to crawlers. Flicker and YouTube kind of web sites allow users to manually tag their content.]
May be you’re out of your depth here Ratu. Yes, unstructured data (audio, image, video) could be searched and indexed. This domain is largely from algorithms in Digital Signal Processing, Digital Image Processing, Computer Vision, Machine Learning & Data-Mining. The research in this area is very active . There are a number of commercial vendors who have products in this area, such as BMat from Spain, with their music recommendation engine. The algorithms searches the binary file directly (digital signals) and not on annotated words (or strings) that tagged the songs, this is signal spectral (analysis) pattern recognition. Eg: the user could upload a sample tune of Michael Jackson’s Beat It song, and the engine retrieves songs that have similar tunes.
- BMat
http://www.bmat.com/music-search-bguide.html
All their peer review publications on the subject are freely available to download from here. The algorithms for developing such as system are found in those publications:
- Publications
http://mtg.upf.edu/publicacions.php
The same thing happens to image indexing and retrieval, just do a Google search on the topic and there will be tons of peer review papers returned.
I specialise in Signal Processing, Datamining, Machine Learning, and that is how I know these things, because I have previously come across those published work in the literatures.
August 5th, 2007 at 7:11 pm
Adam Jusko said…
[Unless the search engine will temporarily read your mind while you’re using it, it’s not going to happen.]
Amen. This is what the author of this site is proposing, for a search engine that reads the mind, even a 5 year old knows that it is a wishful thinking. I wondered if the authors of this site are the new age thinkers, proposing things that are unachievable in human life.
Adam Jusko said…
[No search engine is going to beat out Google because of superior semantic search.]
Umm, may be Microsoft one day and it is a big MAY. Google and all other vendors (including Microsoft & Yahoo) are working on Semantic Web technology to improve searches. I am impressed with the R&D work that Microsoft Asia is doing. Microsoft do publish their researches in different international peer review journals, such as ACM, IEEE, etc. There is very little publication from Google & Yahoo, preferring to keep their R&D closely guarded. I read lots of research papers from Microsoft all the time and have made contact with some of their authors to ask for clarification about the algorithm derivation in their papers. They are always keen to collaborate with you (or anyone really) who wants to implement their algorithm by answering any question you fired them. Two of those generous researchers from Microsoft that are:
- Dr. Benyu Zhang
http://research.microsoft.com/users/byzhang/
- Dr. Zheng Chen
http://research.microsoft.com/~zhengc/
Check out the topics of their researches. One just have to think of how Microsoft is putting a huge amount of money into R&D, and this is why I think, MAY BE Microsoft will knock off Google at some stage in the future.
Finally, I think that the authors of this website are some types of new age thinkers. They write irrelevant topics on search engine despite not reading research peer papers on the subject. I recommend that authors of this site to start reading peer review articles on search engine, so you familiarise yourself with the technology, before making ridiculous comments here about search engines.
August 7th, 2007 at 7:06 am
[…] of a crawler and an index. (Nitin Karandikar has an interesting post on Alt Search Engine called “What is a Search Engine” in which he lays out a search engine’s essential elements.) So a search engine without a […]
August 15th, 2007 at 1:26 am
3-Part Series: What Is a Search Engine?…
I’ve been busy and away from this blog for a while. But I’m back now! Over on Alt Search Engines, the 3-part series ran as planned. Here are the links: What is a Search Engine? by yours truly What is…
August 26th, 2007 at 3:31 pm
[…] the question of crawling the web keeps resurfacing, as it did in Nitin’s post about “What is a search engine?”. Do you have to crawl the entire web to be a “true” search engine? Are some of these […]
April 28th, 2008 at 8:48 am
All their peer review publications on the subject are freely available to download from here. The algorithms for developing such as system are found in those publications: