Tonia Durfee
Appalachian State University
Date and Time: Friday, November 30, 2007 @ 2:00 PM
Place: 126 Tyler Hall
Abstract:
Web 2.0 with its wikis and blogs allows many users to produce, store,
share, manipulate and retrieve many types, shapes and sizes of data.
Previous generations of digitalization and read-only Internet allowed
fast access to mostly structured numeric or alphanumeric data types.
Structured data came from many transaction-based or measurement systems
aimed at creating and providing accurate access to facts. As opposed to
structured data, which typically resides in tightly controlled
applications, unstructured data such as text and video either does not
have a specific structure or has a structure that is not easily readable
and interpretable by a computer. Unstructured data is an appealing and
natural way to convey messages among people. Storage systems and
computer processing power enable creation and storage of massive
quantities of both structured and unstructured data. This data holds a
tremendous potential for analysis and knowledge sharing. Computer
scientists and statisticians have promoted this as a computers ability
to discover previously unknown pattern(s) that can be useful for
particular purpose(s). As long as a business or a scientific unit
accumulates enough data, they are promised the ability to discover
Ògolden nuggetsÓ of information to provide sustainable competitive
advantage and answers to challenging problems. The discovery became
branded as knowledge discovery from databases, data mining or predictive
analytics processes. Data discovery builds on pattern recognition,
association, classification, and prediction techniques applied to
various data types: structured data streams such as stock market time
series or biological sequence, multirelational data such as graphs and
social networks, and spatial and multimedia data such as text, audio and
video.
Text is the most popular vehicle of modern communication from which
semantic structure is opened for a multitude of interpretations. The
meaning of any textual document depends on a context and the
comprehension ability of a reader. Structural principles exist in the
formation of words, in the creation of grammatical sentences, and
representation of meaning. The authors and readers of the text often
represent the same semantics using different words or describe different
meanings using words that have various meanings. Morphology and syntax
form a foundation for modern information retrieval systems such as
document management systems, automatic thesauruses, and search engines
based on keywords, co-occurrences, indexes or meta text properties, such
as author, subject, type, word count, printed page count, and time last
written. Keyword matching approaches, while powerful, remain bound to
word counts, dictionary composition or word choice. Automatic tagging,
keyword based association analysis, and clustering should be employed
for multi-level analysis of complexity of syntactic construction, and
multi-variance of the interpretations discussed above. Text mining
technologies aim at discovery of previously unknown golden nuggets
from large volumes of text which freely resides on the Internet, in
corporate intranets, online libraries and emails. The user of TM
technology is furnished with an ability to automatically categorize,
prioritize, compare documents, and understand and utilize the meaning of
any particular document without browsing, reading and analyzing an
entire document collection. Text mining pattern recognition is built on
the recognition of specific word choice, grammar constructions and other
stylistically characteristics which are inherent for any given author.
As a result text mining is widely used for authorship attribution,
deception and plagiarism detection, summarization and content comparison
and visualization. This has its roots in computational linguistics,
natural language processing, content analysis, cognitive psychology,
information retrieval, machine learning, statistics, and information and
library sciences.
The focus of my talk will be an overview of the main techniques and
cutting edge approaches that have been developed for text mining, a
presentation of sample text mining applications and a discussion of the
limitations of text mining.