An efficient algorithm for pattern discovery in large text databases

Dan Li, Kefei Wang, Jitender S. Deogun, Ruben O. Donis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we present novel text mining algorithms that are useful for pattern discovery in large gene sequence databases. Our approach allows us to work with a small subset of all possible patterns thus enhancing space and time complexity. We call this algorithm Generating All Frequent Patterns, GAFP. Representative subword association rules are introduced to express associations between subword patterns and user-specified target conditions. A rule is of the form P ⇒ C, where P is a subword association pattern in the form of (α1, α2, ⋯, αk,d), and C is a target condition. Pattern (α1, α2, ⋯, αk, d) is called a k-subword association pattern where αi are subwords from input text sequences, and d is the distance constraint which specifies the maximum distance between two subwords adjacent in the pattern. GAFP presents an efficient approach for computing frequent patterns that optimize the rule confidence.

Original languageEnglish (US)
Title of host publicationProceedings of the International Conference on Information and Knowledge Engineering 2003
EditorsN. Goharian, N. Goharian
Pages96-102
Number of pages7
StatePublished - Dec 1 2003
EventProceedings of the International Conference on Information and Knowledge Engineering 2003 - Las Vegas, NV, United States
Duration: Jun 23 2003Jun 26 2003

Publication series

NameProceedings of the International Conference on Information and Knowledge Engineering
Volume1

Conference

ConferenceProceedings of the International Conference on Information and Knowledge Engineering 2003
CountryUnited States
CityLas Vegas, NV
Period6/23/036/26/03

Fingerprint

Association rules
Genes
confidence
time

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Li, D., Wang, K., Deogun, J. S., & Donis, R. O. (2003). An efficient algorithm for pattern discovery in large text databases. In N. Goharian, & N. Goharian (Eds.), Proceedings of the International Conference on Information and Knowledge Engineering 2003 (pp. 96-102). (Proceedings of the International Conference on Information and Knowledge Engineering; Vol. 1).

An efficient algorithm for pattern discovery in large text databases. / Li, Dan; Wang, Kefei; Deogun, Jitender S.; Donis, Ruben O.

Proceedings of the International Conference on Information and Knowledge Engineering 2003. ed. / N. Goharian; N. Goharian. 2003. p. 96-102 (Proceedings of the International Conference on Information and Knowledge Engineering; Vol. 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, D, Wang, K, Deogun, JS & Donis, RO 2003, An efficient algorithm for pattern discovery in large text databases. in N Goharian & N Goharian (eds), Proceedings of the International Conference on Information and Knowledge Engineering 2003. Proceedings of the International Conference on Information and Knowledge Engineering, vol. 1, pp. 96-102, Proceedings of the International Conference on Information and Knowledge Engineering 2003, Las Vegas, NV, United States, 6/23/03.
Li D, Wang K, Deogun JS, Donis RO. An efficient algorithm for pattern discovery in large text databases. In Goharian N, Goharian N, editors, Proceedings of the International Conference on Information and Knowledge Engineering 2003. 2003. p. 96-102. (Proceedings of the International Conference on Information and Knowledge Engineering).
Li, Dan ; Wang, Kefei ; Deogun, Jitender S. ; Donis, Ruben O. / An efficient algorithm for pattern discovery in large text databases. Proceedings of the International Conference on Information and Knowledge Engineering 2003. editor / N. Goharian ; N. Goharian. 2003. pp. 96-102 (Proceedings of the International Conference on Information and Knowledge Engineering).
@inproceedings{dde2ffbb2bc640ed891e52a3607b6128,
title = "An efficient algorithm for pattern discovery in large text databases",
abstract = "In this paper, we present novel text mining algorithms that are useful for pattern discovery in large gene sequence databases. Our approach allows us to work with a small subset of all possible patterns thus enhancing space and time complexity. We call this algorithm Generating All Frequent Patterns, GAFP. Representative subword association rules are introduced to express associations between subword patterns and user-specified target conditions. A rule is of the form P ⇒ C, where P is a subword association pattern in the form of (α1, α2, ⋯, αk,d), and C is a target condition. Pattern (α1, α2, ⋯, αk, d) is called a k-subword association pattern where αi are subwords from input text sequences, and d is the distance constraint which specifies the maximum distance between two subwords adjacent in the pattern. GAFP presents an efficient approach for computing frequent patterns that optimize the rule confidence.",
author = "Dan Li and Kefei Wang and Deogun, {Jitender S.} and Donis, {Ruben O.}",
year = "2003",
month = "12",
day = "1",
language = "English (US)",
isbn = "1932415076",
series = "Proceedings of the International Conference on Information and Knowledge Engineering",
pages = "96--102",
editor = "N. Goharian and N. Goharian",
booktitle = "Proceedings of the International Conference on Information and Knowledge Engineering 2003",

}

TY - GEN

T1 - An efficient algorithm for pattern discovery in large text databases

AU - Li, Dan

AU - Wang, Kefei

AU - Deogun, Jitender S.

AU - Donis, Ruben O.

PY - 2003/12/1

Y1 - 2003/12/1

N2 - In this paper, we present novel text mining algorithms that are useful for pattern discovery in large gene sequence databases. Our approach allows us to work with a small subset of all possible patterns thus enhancing space and time complexity. We call this algorithm Generating All Frequent Patterns, GAFP. Representative subword association rules are introduced to express associations between subword patterns and user-specified target conditions. A rule is of the form P ⇒ C, where P is a subword association pattern in the form of (α1, α2, ⋯, αk,d), and C is a target condition. Pattern (α1, α2, ⋯, αk, d) is called a k-subword association pattern where αi are subwords from input text sequences, and d is the distance constraint which specifies the maximum distance between two subwords adjacent in the pattern. GAFP presents an efficient approach for computing frequent patterns that optimize the rule confidence.

AB - In this paper, we present novel text mining algorithms that are useful for pattern discovery in large gene sequence databases. Our approach allows us to work with a small subset of all possible patterns thus enhancing space and time complexity. We call this algorithm Generating All Frequent Patterns, GAFP. Representative subword association rules are introduced to express associations between subword patterns and user-specified target conditions. A rule is of the form P ⇒ C, where P is a subword association pattern in the form of (α1, α2, ⋯, αk,d), and C is a target condition. Pattern (α1, α2, ⋯, αk, d) is called a k-subword association pattern where αi are subwords from input text sequences, and d is the distance constraint which specifies the maximum distance between two subwords adjacent in the pattern. GAFP presents an efficient approach for computing frequent patterns that optimize the rule confidence.

UR - http://www.scopus.com/inward/record.url?scp=1642337790&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1642337790&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:1642337790

SN - 1932415076

T3 - Proceedings of the International Conference on Information and Knowledge Engineering

SP - 96

EP - 102

BT - Proceedings of the International Conference on Information and Knowledge Engineering 2003

A2 - Goharian, N.

A2 - Goharian, N.

ER -