A Digital Filter Model for Data Mining of Text Documents 
Jeffrey Alan Goldman 
Computer Science 
1998 

Carlo  Zaniolo 
Edward  Stabler 
Co-Chair D. Stott  Parker 
Co-Chair Wesley W. Chu 


Abstract
  
With the proliferation of data, whether in the form of records, 
contingency tables, objects, images, audio, video, or free-form text, 
it is important to extract meaningful knowledge.
The field of Knowledge Discovery in Databases is a new
science that addresses this problem.  It attempts to answer the
cry ``we are drowning in data but starving for knowledge.''
While much of the attention has focused on traditional databases,
only a scattering of research addresses Data Mining of free-form 
text documents.  This dissertation introduces a new
model and architecture, the Digital Filter, to add a new media
to the field.
The Digital Filter uses ideas from Data Mining, Information Retrieval,
and Computational Linguistics in a cogent theory of Text Mining.
The methodology of the Digital Filter exploits the inherent information
and word distributions of text documents 
in order to advance hypotheses that may lead to knowledge
when the documents meet conditions making them
amenable to knowledge discovery and the Digital Filter approach.
The Digital Filter is capable of finding anomalous words, phrases, or
other linguistic structures within a given context.  It is also 
able to detect unusual distributions of objects.
The overall knowledge discovery architecture of the Digital
Filter uses an iterative process with a
facilitator in the loop. 
In the dissertation, we present results
from the Digital Filter on real world text data collections.
Some of the discoveries come from collections of newsletters, journal
case reports, data from free text fields in relational databases, 
court decisions, and movie reviews.
We also explore two cases in which the Digital Filter 
led to a significant discovery.
The first was a collection of thoracic lung cancer patient
data and the second was a year of earthquake activity reports.
Moreover, the Digital Filter is a general model applicable to problems
in Information Retrieval such as indexing and routing as well as problems
in Stylometrics such as authorship and plagiarism detection.