DELVE - Data Exploration, Learning, and Visualization Environment
This is a proposal for an Academic Equipment Grant from Sun Microsystems,
Inc. There are two sections to this proposal:
-
Introduction to Datamining
-
Proposed Work
Introduction to Datamining
Datamining is the extraction of interesting information
from large repositories of data. Such data can be generated from
various sources such as retail sales transactions in supermarkets, service
center call logs, web access records, weather measurements, etc.
Prior to the advent of datamining such data was primarily retained for
record keeping. Statistical analyses of such records were largely
perfunctory and usually yielded limited information about the structural
aspects of the domain that lay under the data. Logs were eventually purged
from disks onto long-term storage devices such as tapes, or data was summarized
and the bulk discarded. However, with the creation of new tools that
draw upon resources from machine learning (artificial intelligence), statistics
and databases, administrators can now discover deep rules that underlie
the observational records of their business enterprise. The automated
discovery of patterns in data can then be used to improve the efficiency,
efficacy and profitability of the venture.
Irrespective of the context of the application, decision makers can
use datamining tools to make at least three different (but related) kinds
of inferences[1]:
-
Associations: These are rules that group objects together
based on some relation between them. Consider the following scenario: A
grocery store manager, whose database records individual purchase transactions,
is interested in the buying patterns of her customers. Association
rules would tell her that the purchase of certain items at her store as
usually accompanied (and hence associated) with the purchase of other items.
While some of these may be merely commonsense (salsa sales are usually
associated with the sale of corn chips), others may be surprising (people
who buy diapers tend to also buy beer). Such rules can help with
product placement, advertising, pricing, and so on.
-
Predictions: Prediction (or classification) rules are used
to categorize data into disjoint groups. Consider the situation at
a bank that makes loans. Based on the performance of loans made in the
past, it may create profiles for safe and for risky loan seekers.
When a new customer applies for a loan, the bank has to make a decision
regarding the credit worthiness of the applicant based on information made
available to them. By matching the profile of this customer against
prior customer profiles, the bank might classify the loan application
for approval, or for rejection.
-
Sequences: Association rules and prediction rules usually involve
intra-record data. That is, these rules do not concern themselves
with the order of data. In other words, the time in between transactions
is not considered a factor. There are, however, situations where
the occurrence of a prior event or transaction influences the occurrence
of subsequent transactions. Sequence problems are those in which
such temporal connections need to be uncovered. Such situations frequently
occur in the stock market, or domains where seasonal variance occurs.
Datamining draws upon techniques initially developed in the machine learning
and statistics communities. These methods, however precise they may
claim to be, suffer from one serious flaw: they cannot handle large quantities
of data. Datamining alleviates this performance issue by adding in
the database perspective, where the performance issue has been addressed
for a long time.
Proposed Work
We propose to create new tools/methods
and modify existing techniques for datamining. In particular we aim
to:
-
Create scalable and flexible mining algorithms that can incorporate existing
knowledge about the enterprise: We will draw upon our background
in machine learning to enhance existing algorithms for datamining.
By using rich knowledge representations such as branching programs, we
expect to contribute innovative algorithms to the datamining community.
-
Develop cross-platform visualization tools using Java: A key aspect of
datamining is the visual representation of both raw data, and of the discovered
rules. The computationally intensive mining process will use servers, while
lightweight Java clients will display results.
-
Promote XML as a standard vehicle
for data representation: One of the hindrances to the development
of datamining is the absence of a well-accepted standard for the representation
of data. Recently, XML (eXtensible Markup Language) was suggested
as a universal standard for data representation in datamining applications
[2].
We intend to go one step further as use XML as a representation for mined
information as well. We will use Java-based tools to browse, edit
and manipulate XML-formatted data.
The tools we develop will be applicable in a wide variety of domains. We,
however, anticipate testing our tools on three principal classes of data
sources: student data from academia, transactional data from retail vendors,
and demographic data from geographic information systems.
This project will involve graduate students in the departments
of the two co-investigators. The source code for the project will be open
to all, and results from the project will be disseminated in conferences
and over the Internet. We will also host any extensions made by others
to the toolkits developed in this project.
References
[1] R. Agrawal, T. Imielinski, A. Swami: "Database
Mining: A Performance Perspective", IEEE Transactions on Knowledge and
Data Engineering, Special issue on Learning and Discovery in Knowledge-Based
Databases, Vol. 5, No. 6, December 1993, 914-925.
[2] Rakesh Agrawal. "Data Mining: Crossing the Chasm".
Invited talk at the 5th ACM SIGKDD Int'l Conference on Knowledge Discovery
and Data Mining (KDD-99), San Diego, California, August 1999.