Topic: problems with information retrieval

topics > computer science > information > Group: information retrieval

meaning and truth
problems with hypertext

comparing paper to electronic access to information
loosely structured data
private language argument for skepticism about meaning
problem of assigning names
problem of classifying information
problem of information overload
problem of screen size
problems with reading hypertext
searching the Web
skepticism about knowledge


Problems of information retrieval include scale, relevance, the tradeoff between precision and recall, archiving information, junk knowledge. (cbb 11/07)
Subtopic: measuring retrieval effectiveness up

Quote: determine recall by exhaustive search, preidentified relevant documents, random sample in a relevant domain [»blaiDC_1990]
Quote: can use Zipf's law to determine retrieval system effectiveness; want Zipfian rank:frequency for context and subject description usage [»blaiDC_1990]

Subtopic: precision/recall tradeoff up

Quote: high precision if narrow query, low precision if wide query [»saltG7_1986]

Subtopic: problem of relevance up

Quote: only the seeker of information is sure of what he or she is looking for; the push concept of information delivery only sells ads [»dvorJC3_1997]
Quote: only the user can judge what is relevant to his or her own need [»greeR9_1995]
Quote: relevance is being potentially helpful to a user in the resolution of a need [»greeR9_1995]
Quote: a document may be rejected as irrelevant, even though it would help resolve the need at hand

Subtopic: problem of scale up

Quote: as a document retrieval system becomes larger, queries require intersecting terms to satisfy the futility point [»blaiDC_1990]
Quote: with pervasive networking will want access to a trillion resources; use agents
Quote: compare describing someone to meet at an airport gate vs. someone attending a baseball game; like information retrieval [»blaiDC_1990]
Quote: pre-tests of STAIRS were successful because of small-scale databases [»blaiDC3_1985]
Quote: vector space model does not work well on the web; returns short documents [»brinS4_1998]
Quote: the fallacy of abundance: in a large information retrieval system, it is hard to write reasonable queries that do not retrieve at least some relevant documents [»blaiDC1_1996]
Quote: the fallacy of abundance: the ease of retrieving information about a subject creates an illusion that little remains hidden [»swanDR10_1960]

Subtopic: brute fact vs. meaning up

Quote: information retrieval is based on the brute facts of documents; can not capture the meaning of a document

Subtopic: data vs. document retrieval up

Quote: in data retrieval, queries and data descriptions are fairly precise; simple matching is sufficient [»blaiDC1_1996]
Quote: in document retrieval, queries and data descriptions are imprecise; especially for documents with certain intellectual content

Subtopic: location/application vs. keyword/pattern up

Quote: users overwhelmingly prefer to locate files by location or application rather than by keyword or filename pattern [»barrD7_1995]
Quote: people prefer location-based searching over keywords and filenames; better reminding; place files where they will be seen [»barrD7_1995]

Subtopic: problem of recall up

Quote: the implicit assumption of simple full-text retrieval systems is that we recall words and phrases in a document exactly; but psychologists have shown that memory is inexact [»blaiDC1_1996]

Subtopic: problem of manipulation up

Quote: problem of manipulating search engines for profit; e.g., metadata is easily abused since it is invisible [»brinS4_1998]

Subtopic: problem of junk up

Quote: if an organization keeps all of its documents, searchers must wade through irrelevant information to find important documents; this noise degrades search performance [»blaiDC1_1996]

Subtopic: problems with archiving information up

Quote: since an individual's total information needs are large and complex, his Smalltalk system will also be large and complex
Quote: designers need help in keeping track of notes and recalling them appropriately; requires understanding of the design process and the problem [»soloE5_1984]
Quote: out-of-control users of electronic mail are archivers, read mail often, read and often file everything, keep a large inbox; and can't find messages [»mackWE10_1988]
Quote: archiver-type users of electronic mail try to read and file everything; many distribution lists; problems with finding old mail [»mackWE10_1988]
Quote: personal, computer files are ephemeral, working, or archival; users archived little information

Subtopic: problem of change up

Quote: a fully successful, manual index would imply that knowledge can be organized by an immutable and unambiguous indexing scheme; but, knowledge and language change [»swanDR10_1960]

Subtopic: multiple index terms -- anchor vs. qualifiers up

Quote: search queries for STAIRS may have four or five intersecting terms; performed poorly [»blaiDC3_1985]
Quote: inquirers will tend to fix an anchor set of terms and add additional ones; since they can't judge the anchor set, they blame the added terms [»blaiDC_1990]

Subtopic: STAIRS study of information retrieval up

Quote: twenty percent recall during evaluation of STAIRS full-text document-retrieval system [»blaiDC3_1985]
Quote: the STAIRS study used interactive retrieval; searchers could revise their queries until they believed that they had retrieved all of the documents they wanted [»blaiDC1_1996]
Quote: the STAIRS study used the lawyers and paralegals who selected the 40,000 documents in the collection; like a personal document collection [»blaiDC1_1996]
Quote: in STAIRS, users believed they were retrieving 75% instead of actual 20% [»blaiDC3_1985]
Quote: searches in the first half of the STAIRS study had the same mean level of success as searches in the second half; evidence that searchers were operating at the best of their ability [»blaiDC1_1996]
Quote: lawyers very surprised at low recall rate for STAIRS [»blaiDC3_1985]
Quote: in STAIRS many search terms would retrieve ten thousand documents [»blaiDC3_1985]
Quote: the STAIRS database concerned San Francisco's BART system [»blaiDC1_1996]
Quote: the lawsuit between San Francisco and BART contractors was settled before the STAIRS evaluation

Subtopic: studies of information retrieval up

Quote: all known indexing procedures produce relatively mediocre results [»saltG4_1970]
Quote: tested retrieval with questions that used words from text and/or headings, or neither [»eganDE5_1989, OK]
Quote: in Viewdata study, half of questions answered incorrectly and search strategy often failed [»graySH2_1989]
Quote: early comparison of full-text, computerized search with a manual index [»swanDR10_1960]
Quote: in a 100 article collection, manual and full-text search retrieved less than half of the relevant documents on average
Quote: full-text search retrieved more relevant documents than a manual index

Subtopic: estimating recall up

Quote: a candidate set is formed by negating one or more query terms; the STAIRS study estimated recall by sampling the candidate sets; usually small enough and rich enough in unretrieved, relevant documents to sample confidently [»blaiDC1_1996]
Quote: recall studies of large document retrieval systems depended on the persistence of the evaluators and where they looked for unretrieved, relevant documents [»blaiDC1_1996]
Quote: in a large collection, the percentage of unretrieved, relevant documents is too low to sample with confidence

Subtopic: directed search vs. boolean queries up

Quote: scan and select did as well as boolean queries when searching an electronic encyclopedia [»marcG1_1988]

Subtopic: implicit structure up

Quote: with named relationships can not follow paths implicitly defined by the data [»kentW_1978]

Subtopic: automatic indexing up

Quote: automatic indexing must input and verify twenty times as much data as manual indexing

Related Topics up

Group: meaning and truth   (18 topics, 634 quotes)
Group: problems with hypertext   (7 topics, 98 quotes)

Topic: comparing paper to electronic access to information (35 items)
Topic: loosely structured data (20 items)
Topic: private language argument for skepticism about meaning (34 items)
Topic: problem of assigning names (25 items)
Topic: problem of classifying information (42 items)
Topic: problem of information overload (23 items)
Topic: problem of screen size (12 items)
Topic: problems with reading hypertext (9 items)
Topic: searching the Web (53 items)
Topic: skepticism about knowledge
(34 items)

Updated barberCB 3/06
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.