Beyond Words: Waiting for the new Internet Research Engines
see links page for demonstrations
"Words are blunt" Shakespeare wrote.
I found that on the Internet--by being precise, and entering "words
are blunt" because I knew it existed in that "didn't someone famous
say that?" kind of way.
For that search, for that moment, Google was perfect for
my purposes. Took less than four seconds from start to finish--an
incredible achievement of information technology.
But words *are* a blunt tool for anything beyond a mere fact.
Blunt words are the pulp that search engines use to determine what it
thinks you want in return for your search term.
They use many other strategies as well--link
analysis, linguistic analysis, keyword-to-non-keyword ratios, recency
of content, and much more.
They implement strategies to go beyond blunt words.
My purpose today is to talk about the underlying strategies and
limitations of a handful of search engines, so that you can more
effectively judge the new engines that arrive tomorrow. First I'll
talk about engines that address the Web as a whole, and after a little
show and tell, I'll demonstrate some more
focused, targetted engines as a way of seeing what's coming soon.
I write code, but I'm not going to be talking as a programmer.
Rather, I'll look at it as a social scientist, almost as the
interaction between an individual and her tribe.
The results of any search on any engine can be thought of
almost as a small temporary tribe, with diverse relationships between
multiple documents. It's the result of many priorities, paradigms,
ethical stances, commercial purposes, familial relationships, and
environmental pressures.
Take Google, for example--I think of its "result tribe" (of which my
search term is a new member) in this way:
"People in this tribe who are highly respected--by other generally
respected people--are likely to be
people I want to get to know. And someone who knows lots of respected
people is also someone I'd like to know."
What made Google rock was that it was the first engine to pay deep attention
to who links to whom. Google computes a "site" or "page" value in at
least two ways: how many links to me from how many respected places, and how
many links from me, to respected places, I provide.
There's much more massaging that's done, but that core feat established a
means to be sure to show, in the top 10, both authoritative sites
linked-to by many respected hubs, and the "hubs" linking to
lots of respected material.
This creates deep strength for fact-based general research, at least
of the "free to the world" content (which is all I'm discussing today,
nothing about Lexis-Nexis and the like).
General searching will always be useful, but I hope to see more "niche
search engines," tailored to particular types of general content, and to
see that become as important as "niche publishing" is--that 20% of publishing
that is generally about ideas.
The information exploration habits of engineers vary dramatically from
those in the medical or social sciences. How can research engines be
developed to adapt to differing research styles?
As members of those audiences, we need to explore these tools,
respond to them, and ask for the development of more of these tools, within
the communities that use them--universities, organizations like mine,
foundations, specialist communities, and the like.
The commercial reality is that search engines are far more likely to
make some money (via
clickthroughs, targeted advertising, book sales, etc.) from a Britney
Spears searcher than the science writer researching on the Web. Folks
like you use up precious resources someone else could be using trying to
find the all-chocolate diet.
That reality is one reason that we haven't seen many superpowered "research
engines." in the public sphere.
Commercial realities also dictate a lot about the tribes a search engine
can congregate. The vast majority of editorially controlled material
published worldwide is not available to them. "Dynamically generated
pages" often are not picked up by search engines.
But these are relatively temporary hesitations in an inevitable
accelleration of capabilities.
I'm certain that we're still in the beginning stages
of a much larger knowledge revolution. Evolutionary niches are
beginning to develop. What fills them will enrich the genepool of even
the dominant species. That is, Google & AltaVista and others not yet
born will be improved by the techniques of niche research engines.
For example, the search and discovery tools that PubMed Central applies
to their data emphasize discipline-specific characteristics of
biomedical publication:
highly structured bibliographic information, predictable abstract
structure, semi-contained categories described within journals, and
the like. It's designed for biomedical researchers.
Citeseer (citeseer.nj.nec.com) is another example, predominantly
structured for computer- and information-scientists, applying
bibliographic content in creative ways for a particular kind of researcher.
But all of these apply systems which make presumptions about what matters
to you. Those presumptions dictate the "temporary tribes" of material
that is returned to you.
As I said, my purpose is to help clarify some of those
presumptions, in the hopes that over the next year or two, that will
help you judge and select the best research tools for your purposes.
So some quick samples, and some show & tell. I used the term "smallpox
virus" and "smallpox vaccine" throughout these examples.
(most of following done ad lib):
WebBrain is way-cool-looking, and I was immediately attracted to the
notion of visible interconnections between ideas. But the results were
a small set, and it seems that WebBrain may be troubled by
overprecision and small content, likely specifically coded by human
editors. My tribe metaphor might be:
"This is a hand-picked culture from which we generate tribes likely to
be of interest to you." For a researcher, it may be utilitarian in
collecting connected ideas.
Citeseer (citeseer.nj.nec.com) is that example of "niche searching" in
that it's mostly focused on computer science, and weights content on a
variety of ways. It allows participation by submitters in the weighting of
resources and external influence on "respectability." It analyzes
references and recency as well, and boosts value accordingly: My
tribe metaphor is: "This is a
participatory democracy where we vote on which of our friends are likely
to get along well with each other and with you..."
WebCrawler (www.webcrawler.com) is an example of a "meta-engine," not
unlike AskJeeves, Webdog, and others. It integrates the results of
multiple search engines, and processes the results into a unified
result set: The tribe might be described this way:
"Those individuals respected by multiple respected tribes
are likely to be people I want to meet." It's a little parasitic, but
it's also a business.
Teoma (www.teoma.com) is more explicit about the "hubs and target"
model than Google is--that is, it shows the targets, catagorizes the hubs
as "Resources" and then processes key terms (either from the history
of other peoples' searches, or from the content of the target set).
The tribe model might be:
"Here's a tribe of likely folks, but if they don't look right, here's
some other tribes..."
These are examples of some of the ways that engines are trying to get
around the bluntness of words, the limits of language, and the
apalling fact of billions of diverse documents in multiple formats,
all of uncertain provenance.
More than fifteen years ago I was at a CDROM trade show, and a
search engine salesman (an engine for a CDROM, that is) gave me a
great explanation. If I search the sports pages of the NYTimes for the
last year for "baseball," I'll get only 30% of the articles. Instead,
it's something lke "The Twins and the Yankees split a doubleheader
yesterday in a rain-delayed bruiser full of steals and strikes."
Add to that problem the differences between PDF, html, xml, txt,
xhtml, postscript, etc., and it's very difficult to fully engage all
the relevant material in the information universe.
Within a small universe, however, different things are possible. For
the last part of this talk I will discuss smaller, more controlled data
sets, because it's there, I think, that we can look for what will
increasingly be possible in the larger information environment.
For the past five years I've been the Director of Publishing Technologies
at the National Academies Press. Working with a budget underwritten
completely by book sales, we have made over half a million book pages
searchable, browsable, and printable online. If you haven't gone to
www.nap.edu, you should do so. We gave away 50 million book pages and
millions of other PDF and HTML pages last year. You can search all the
books, any single book, or any single chapter. Gripe about the page
image if you want, but it's free and successful and continuing to
improve--by summer we'll have replaced most page images with text.
The NAP is a controlled, predictable data set, and so we could put in some
useful, deep searching. Because we have all the books in one
consistent format, we can code to those expectations. For example,
we're able to provide reporters like you a different kind of access
than the average reader, and can produce special research-engine
applications for users like you.
Let me show you the National Academies News Gateway--an opt-in service
that can provide a reporter with a broad array of special services.
log in, note the topical info, etc.
demo the Reference Finder
That's one example of something that can be done with a predictable
set of content. Because I know the structure of the file naming
conventions, the database records, and the like, I'm able to write
scripts which run through those files and do something to them.
Because we've developed some linguistic analysis tools, we can apply
those tools to this particular (and highly useful) purpose, based on
the predictable structures underlying the content.
But let's take this one step further. About a week ago
we made available the first fully working version of
a National Academies-wide "discovery engine."
It was put up for the Executive Council and presidents of the
Academies to see, before making it more public. This is the first
public presentation of this discovery engine.
Demo of discovery engine
This is only possible, as I said, because we know about the underlying
content--what the urls signify, where a cached copy of the content
resides, what characteristics it has, etc.
That's a mountain of data, but think of the information needs of a
webwide search!
But we're supposed to have 5GHz processors later this year. Storage is
ever cheaper. Parallel processing systems are being developed.
It seems to me a sure thing that more and more of these sorts of
Research Engines will be arriving. They'll have their own biases, and
generate their own "tribes," but they will also nudge the more general
search engines to improve their "research capabilities," and that will
help you, just like these other things can.
Words are blunt indeed, but they can lead to cracking
the more complex process of discovery and research.
Hope this has been useful. Thanks.