Beyond Words: Waiting for the new Internet Research Engines

"Words are blunt" Shakespeare wrote. I found that on the Internet--by being precise, and entering "words are blunt" because I knew it existed in that "didn't someone famous say that?" kind of way.

For that search, for that moment, Google was perfect for my purposes. Took less than four seconds from start to finish--an incredible achievement of information technology.

But words *are* a blunt tool for anything beyond a mere fact.

Blunt words are the pulp that search engines use to determine what it thinks you want in return for your search term. They use many other strategies as well--link analysis, linguistic analysis, keyword-to-non-keyword ratios, recency of content, and much more.

They implement strategies to go beyond blunt words.

My purpose today is to talk about the underlying strategies and limitations of a handful of search engines, so that you can more effectively judge the new engines that arrive tomorrow. First I'll talk about engines that address the Web as a whole, and after a little show and tell, I'll demonstrate some more focused, targetted engines as a way of seeing what's coming soon.

I write code, but I'm not going to be talking as a programmer. Rather, I'll look at it as a social scientist, almost as the interaction between an individual and her tribe.

The results of any search on any engine can be thought of almost as a small temporary tribe, with diverse relationships between multiple documents. It's the result of many priorities, paradigms, ethical stances, commercial purposes, familial relationships, and environmental pressures.

Take Google, for example--I think of its "result tribe" (of which my search term is a new member) in this way: "People in this tribe who are highly respected--by other generally respected people--are likely to be people I want to get to know. And someone who knows lots of respected people is also someone I'd like to know."

What made Google rock was that it was the first engine to pay deep attention to who links to whom. Google computes a "site" or "page" value in at least two ways: how many links to me from how many respected places, and how many links from me, to respected places, I provide.

There's much more massaging that's done, but that core feat established a means to be sure to show, in the top 10, both authoritative sites linked-to by many respected hubs, and the "hubs" linking to lots of respected material.

This creates deep strength for fact-based general research, at least of the "free to the world" content (which is all I'm discussing today, nothing about Lexis-Nexis and the like).

General searching will always be useful, but I hope to see more "niche search engines," tailored to particular types of general content, and to see that become as important as "niche publishing" is--that 20% of publishing that is generally about ideas.

The information exploration habits of engineers vary dramatically from those in the medical or social sciences. How can research engines be developed to adapt to differing research styles?

As members of those audiences, we need to explore these tools, respond to them, and ask for the development of more of these tools, within the communities that use them--universities, organizations like mine, foundations, specialist communities, and the like.

The commercial reality is that search engines are far more likely to make some money (via clickthroughs, targeted advertising, book sales, etc.) from a Britney Spears searcher than the science writer researching on the Web. Folks like you use up precious resources someone else could be using trying to find the all-chocolate diet.

That reality is one reason that we haven't seen many superpowered "research engines." in the public sphere.

Commercial realities also dictate a lot about the tribes a search engine can congregate. The vast majority of editorially controlled material published worldwide is not available to them. "Dynamically generated pages" often are not picked up by search engines.

But these are relatively temporary hesitations in an inevitable accelleration of capabilities.

I'm certain that we're still in the beginning stages of a much larger knowledge revolution. Evolutionary niches are beginning to develop. What fills them will enrich the genepool of even the dominant species. That is, Google & AltaVista and others not yet born will be improved by the techniques of niche research engines.

For example, the search and discovery tools that PubMed Central applies to their data emphasize discipline-specific characteristics of biomedical publication: highly structured bibliographic information, predictable abstract structure, semi-contained categories described within journals, and the like. It's designed for biomedical researchers.

Citeseer ( is another example, predominantly structured for computer- and information-scientists, applying bibliographic content in creative ways for a particular kind of researcher.

But all of these apply systems which make presumptions about what matters to you. Those presumptions dictate the "temporary tribes" of material that is returned to you.

As I said, my purpose is to help clarify some of those presumptions, in the hopes that over the next year or two, that will help you judge and select the best research tools for your purposes.

So some quick samples, and some show & tell. I used the term "smallpox virus" and "smallpox vaccine" throughout these examples.

WebBrain is way-cool-looking, and I was immediately attracted to the notion of visible interconnections between ideas. But the results were a small set, and it seems that WebBrain may be troubled by overprecision and small content, likely specifically coded by human editors. My tribe metaphor might be: "This is a hand-picked culture from which we generate tribes likely to be of interest to you." For a researcher, it may be utilitarian in collecting connected ideas.

Citeseer ( is that example of "niche searching" in that it's mostly focused on computer science, and weights content on a variety of ways. It allows participation by submitters in the weighting of resources and external influence on "respectability." It analyzes references and recency as well, and boosts value accordingly: My tribe metaphor is: "This is a participatory democracy where we vote on which of our friends are likely to get along well with each other and with you..."

WebCrawler ( is an example of a "meta-engine," not unlike AskJeeves, Webdog, and others. It integrates the results of multiple search engines, and processes the results into a unified result set: The tribe might be described this way: "Those individuals respected by multiple respected tribes are likely to be people I want to meet." It's a little parasitic, but it's also a business.

Teoma ( is more explicit about the "hubs and target" model than Google is--that is, it shows the targets, catagorizes the hubs as "Resources" and then processes key terms (either from the history of other peoples' searches, or from the content of the target set). The tribe model might be: "Here's a tribe of likely folks, but if they don't look right, here's some other tribes..."

These are examples of some of the ways that engines are trying to get around the bluntness of words, the limits of language, and the apalling fact of billions of diverse documents in multiple formats, all of uncertain provenance.

More than fifteen years ago I was at a CDROM trade show, and a search engine salesman (an engine for a CDROM, that is) gave me a great explanation. If I search the sports pages of the NYTimes for the last year for "baseball," I'll get only 30% of the articles. Instead, it's something lke "The Twins and the Yankees split a doubleheader yesterday in a rain-delayed bruiser full of steals and strikes."

Add to that problem the differences between PDF, html, xml, txt, xhtml, postscript, etc., and it's very difficult to fully engage all the relevant material in the information universe.

Within a small universe, however, different things are possible. For the last part of this talk I will discuss smaller, more controlled data sets, because it's there, I think, that we can look for what will increasingly be possible in the larger information environment.

For the past five years I've been the Director of Publishing Technologies at the National Academies Press. Working with a budget underwritten completely by book sales, we have made over half a million book pages searchable, browsable, and printable online. If you haven't gone to, you should do so. We gave away 50 million book pages and millions of other PDF and HTML pages last year. You can search all the books, any single book, or any single chapter. Gripe about the page image if you want, but it's free and successful and continuing to improve--by summer we'll have replaced most page images with text.

The NAP is a controlled, predictable data set, and so we could put in some useful, deep searching. Because we have all the books in one consistent format, we can code to those expectations. For example, we're able to provide reporters like you a different kind of access than the average reader, and can produce special research-engine applications for users like you.

Let me show you the National Academies News Gateway--an opt-in service that can provide a reporter with a broad array of special services.

That's one example of something that can be done with a predictable set of content. Because I know the structure of the file naming conventions, the database records, and the like, I'm able to write scripts which run through those files and do something to them. Because we've developed some linguistic analysis tools, we can apply those tools to this particular (and highly useful) purpose, based on the predictable structures underlying the content.

But let's take this one step further. About a week ago we made available the first fully working version of a National Academies-wide "discovery engine." It was put up for the Executive Council and presidents of the Academies to see, before making it more public. This is the first public presentation of this discovery engine.

This is only possible, as I said, because we know about the underlying content--what the urls signify, where a cached copy of the content resides, what characteristics it has, etc.

That's a mountain of data, but think of the information needs of a webwide search!

But we're supposed to have 5GHz processors later this year. Storage is ever cheaper. Parallel processing systems are being developed. It seems to me a sure thing that more and more of these sorts of Research Engines will be arriving. They'll have their own biases, and generate their own "tribes," but they will also nudge the more general search engines to improve their "research capabilities," and that will help you, just like these other things can.

Words are blunt indeed, but they can lead to cracking the more complex process of discovery and research.

Hope this has been useful. Thanks.