Topic: approximate string matching and pattern matching with errors

topics > computer science > programming > Group: patterns

information retrieval

hash filter
pattern matching
search algorithms
searching compressed data
signature files
spelling errors
string operations
suffix trie and suffix array


Text contains spelling errors and users make spelling errors in entering a query. Genetic sequence alignment is another application of approximate string matching. Similar sequences indicate a common ancestor.

Here are several algorithms for approximate string matching. (cbb 4/98)

Subtopic: approximate matching up

Quote: simple, fast algorithm for approximate string matching; special cases k-break periodic strings; uses suffix tree; short, overlapping text partitions; pattern at least 5k^3 where k is the maximum edit distance [»coleR9_2002]
Quote: survey of approximate string matching; focus on edit distance; includes performance tests [»navaG3_2001]
Quote: Glimpse features a small index and approximate matching; divide data into blocks, index the blocks, full text search of possible matches [»manbU10_1993]
Quote: about approximate matching of strings, e.g., due to spelling mistakes [»hallPA12_1980]
Quote: fast algorithms for approximate string matching with hashing [»duMW8_1994]
Quote: simple, efficient algorithm for pattern matching with k mismatches; based on shift-add and character skip; O(nk) [»elmaN6_1996]
Quote: use trie methods for k-approximate string matching; 4 times faster than agrep for k=1, otherwise slower [»shanH8_1996]
Quote: use dynamic suffix arrays to detect, on-line, the occurrences of a string subject to incremental insertions and deletions [»ferrP1_1995]
Quote: perform approximate matching on words in index, then use inverted index to locate data [»manbU10_1993]
Quote: approximate matching with k errors by exact matching on k+1 parts and search the resulting matches [»manbU4_1997]
Quote: use trie methods for k-approximate string matching; 4 times faster than agrep for k=1, otherwise slower [»shanH8_1996]
Quote: improved filtration algorithm for approximate string matching with moderate error levels; faster than agrep [»navaG1_1999]

Subtopic: approximate range up

Quote: optimal, approximate range query using skip quadtrees; produces a convex or non-convex k-fat region [»eppsD6_2005]

Subtopic: compressed text up

Quote: first practical algorithm for approximate pattern matching over Ziv-Lempel compressed text; multipattern search; faster than decompressing plus searching [»navaG3_2001a]

Subtopic: n-grams, proper names, membership up

Quote: gives approximate membership tester that does not need an independent sequence of hash functions [»cartL5_1978]
Quote: Synoname matches personal names by heuristics derived from correct matches of personal names [»grosAD4_1991]
Quote: n-grams of text streams are efficient, robust, and capture statistically significant phrases [»coheJD12_1998]

Subtopic: misspelling, and transcription errors up

Quote: the number of distinct words in a text is nearly linear in the size of the text; e.g., misspellings [»moffA3_1994]
Quote: the largest source of errors in nucleotide sequence data is transcription errors during publication

Related Topics up

Group: information retrieval   (25 topics, 674 quotes)

Topic: hash filter (18 items)
Topic: pattern matching (42 items)
Topic: search algorithms (40 items)
Topic: searching compressed data (9 items)
Topic: signature files (21 items)
Topic: spelling errors (18 items)
Topic: string operations (20 items)
Topic: suffix trie and suffix array
(20 items)

Updated barberCB 3/04
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.