| A tool that could search the book would be a nice tool in its own
right. There would be many cases though where searching the book
would be a bit awkward, for example, if I did a SEARCH BOOK "FILE"
the tool might return 4 zillion instances where the word file was
used.
|
| A search-the-text function could be a very powerful tool, but I
suspect that it needs at least the following features:
1) the ability to find only those places where at least `m' of a list
of `n' terms or phrases occur `sufficently near' to each other; without
this it seems likely that the signal-to-noise ratio will frequently be
too low for the search to be useful.
2) the ability to efficiently retry a search with modified criteria,
typically for use when a search finds too many places (e.g.,
requiring that more of the items on the list be found in one
place, or adding additional items to the list). This implies
that the search function must announce how many places it has found
rather than just starting to display them. It would also seem
desirable to retain the original (or all intermediate?) search
results, so that one can efficiently return to them and/or try
additional variations.
3) an automatic `thesaurus' function, so that instead of searching
only for the exact terms that the user specified the search will
find all `sufficiently closely related' terms (this relationship
may need to be decided independently for each book). This should
include terms which don't even appear in the book, if they are ones
that users are likely to try to search for; it should certainly
include spelling variations. Without this it may be excessively
difficult to determine the `correct' search specification to
find any given bit of information.
4) in order to do this sort of search efficiently enough to be useful
for real-time navigation thru a book, it is probably necessary
to maintain an index for each possible term to all of the sections
of the book in which it occurs (`section' here meaning the minimum
size chunk of the book that the search function will resolve --
which is most likely also used to define `sufficiently near'
in #1). This probably requires setting up standard lists of
`uninteresting' words (a, an, the, ...) and of related words
(disk, disc, winchester, floppy, ...) which can be modified by
directives in the source of each book, and manually setting up
those directives (or modifying the standard lists) based on the
actual usage in each book. It may also be necessary to flag
particular sections of a book as specifically matching or not
matching a given term when a literal check of the text of that
section would give a misleading result.
All of which looks like a real pain for the implementor and the
writer to deal with, but the results could be well worth it.
|