DELVE - Data Exploration, Learning, and Visualization Environment

This is a proposal for an Academic Equipment Grant from Sun Microsystems, Inc. There are two sections to this proposal:

Introduction to Datamining
Proposed Work

Introduction to Datamining

Datamining is the extraction of interesting information from large repositories of data. Such data can be generated from various sources such as retail sales transactions in supermarkets, service center call logs, web access records, weather measurements, etc. Prior to the advent of datamining such data was primarily retained for record keeping. Statistical analyses of such records were largely perfunctory and usually yielded limited information about the structural aspects of the domain that lay under the data. Logs were eventually purged from disks onto long-term storage devices such as tapes, or data was summarized and the bulk discarded. However, with the creation of new tools that draw upon resources from machine learning (artificial intelligence), statistics and databases, administrators can now discover deep rules that underlie the observational records of their business enterprise. The automated discovery of patterns in data can then be used to improve the efficiency, efficacy and profitability of the venture.

Irrespective of the context of the application, decision makers can use datamining tools to make at least three different (but related) kinds of inferences[1]:

Associations: These are rules that group objects together based on some relation between them. Consider the following scenario: A grocery store manager, whose database records individual purchase transactions, is interested in the buying patterns of her customers. Association rules would tell her that the purchase of certain items at her store as usually accompanied (and hence associated) with the purchase of other items. While some of these may be merely commonsense (salsa sales are usually associated with the sale of corn chips), others may be surprising (people who buy diapers tend to also buy beer). Such rules can help with product placement, advertising, pricing, and so on.
Predictions: Prediction (or classification) rules are used to categorize data into disjoint groups. Consider the situation at a bank that makes loans. Based on the performance of loans made in the past, it may create profiles for safe and for risky loan seekers. When a new customer applies for a loan, the bank has to make a decision regarding the credit worthiness of the applicant based on information made available to them. By matching the profile of this customer against prior customer profiles, the bank might classify the loan application for approval, or for rejection.
Sequences: Association rules and prediction rules usually involve intra-record data. That is, these rules do not concern themselves with the order of data. In other words, the time in between transactions is not considered a factor. There are, however, situations where the occurrence of a prior event or transaction influences the occurrence of subsequent transactions. Sequence problems are those in which such temporal connections need to be uncovered. Such situations frequently occur in the stock market, or domains where seasonal variance occurs.

Datamining draws upon techniques initially developed in the machine learning and statistics communities. These methods, however precise they may claim to be, suffer from one serious flaw: they cannot handle large quantities of data. Datamining alleviates this performance issue by adding in the database perspective, where the performance issue has been addressed for a long time.

Proposed Work

We propose to create new tools/methods and modify existing techniques for datamining. In particular we aim to:

Create scalable and flexible mining algorithms that can incorporate existing knowledge about the enterprise: We will draw upon our background in machine learning to enhance existing algorithms for datamining. By using rich knowledge representations such as branching programs, we expect to contribute innovative algorithms to the datamining community.
Develop cross-platform visualization tools using Java: A key aspect of datamining is the visual representation of both raw data, and of the discovered rules. The computationally intensive mining process will use servers, while lightweight Java clients will display results.
Promote XML as a standard vehicle for data representation: One of the hindrances to the development of datamining is the absence of a well-accepted standard for the representation of data. Recently, XML (eXtensible Markup Language) was suggested as a universal standard for data representation in datamining applications [2]. We intend to go one step further as use XML as a representation for mined information as well. We will use Java-based tools to browse, edit and manipulate XML-formatted data.

The tools we develop will be applicable in a wide variety of domains. We, however, anticipate testing our tools on three principal classes of data sources: student data from academia, transactional data from retail vendors, and demographic data from geographic information systems.

This project will involve graduate students in the departments of the two co-investigators. The source code for the project will be open to all, and results from the project will be disseminated in conferences and over the Internet. We will also host any extensions made by others to the toolkits developed in this project.

References

[1] R. Agrawal, T. Imielinski, A. Swami: "Database Mining: A Performance Perspective", IEEE Transactions on Knowledge and Data Engineering, Special issue on Learning and Discovery in Knowledge-Based Databases, Vol. 5, No. 6, December 1993, 914-925.

[2] Rakesh Agrawal. "Data Mining: Crossing the Chasm". Invited talk at the 5th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining (KDD-99), San Diego, California, August 1999.