A Digital Filter Model for Data Mining of Text Documents Jeffrey Alan Goldman Computer Science 1998 Carlo Zaniolo Edward Stabler Co-Chair D. Stott Parker Co-Chair Wesley W. Chu Abstract With the proliferation of data, whether in the form of records, contingency tables, objects, images, audio, video, or free-form text, it is important to extract meaningful knowledge. The field of Knowledge Discovery in Databases is a new science that addresses this problem. It attempts to answer the cry ``we are drowning in data but starving for knowledge.'' While much of the attention has focused on traditional databases, only a scattering of research addresses Data Mining of free-form text documents. This dissertation introduces a new model and architecture, the Digital Filter, to add a new media to the field. The Digital Filter uses ideas from Data Mining, Information Retrieval, and Computational Linguistics in a cogent theory of Text Mining. The methodology of the Digital Filter exploits the inherent information and word distributions of text documents in order to advance hypotheses that may lead to knowledge when the documents meet conditions making them amenable to knowledge discovery and the Digital Filter approach. The Digital Filter is capable of finding anomalous words, phrases, or other linguistic structures within a given context. It is also able to detect unusual distributions of objects. The overall knowledge discovery architecture of the Digital Filter uses an iterative process with a facilitator in the loop. In the dissertation, we present results from the Digital Filter on real world text data collections. Some of the discoveries come from collections of newsletters, journal case reports, data from free text fields in relational databases, court decisions, and movie reviews. We also explore two cases in which the Digital Filter led to a significant discovery. The first was a collection of thoracic lung cancer patient data and the second was a year of earthquake activity reports. Moreover, the Digital Filter is a general model applicable to problems in Information Retrieval such as indexing and routing as well as problems in Stylometrics such as authorship and plagiarism detection.