DELVE - Data Exploration, Learning, and Visualization Environment

A project by the Datamining Group at Hofstra University

People

Krishnan Pillaipakkamnatt, Assistant Professor
Tomasz Dolinski, Graduate Student
Rona Eisenberg, Graduate Student
Lisa Zanella, Graduate Student
Anthony Iadevaia, independent software consultant

Link to Sun AEG proposal

Intro to Datamining

Datamining is the extraction of interesting information from large repositories of data. Such data can be generated from various sources such as retail sales transactions in supermarkets, service center call logs, web access records, weather measurements, etc. With the creation of new tools that draw upon resources from machine learning (artificial intelligence), statistics and databases, administrators can now discover deep rules that underlie the observational records of their business enterprise.

Irrespective of the context of the application, decision makers can use datamining tools to make at least three different (but related) kinds of inferences:

Associations: These are rules that group objects together based on some relation between them. Consider the following scenario: A grocery store manager, whose database records individual purchase transactions may be interested in a rule like the following: people who buy diapers tend to also buy beer. Such rules can help with product placement, advertising, pricing, and so on.

Predictions: Prediction (or classification) rules are used to categorize data into disjoint groups. Consider the situation at a bank that makes loans. Based on the performance of loans made in the past, it may create profiles for safe and for risky loan seekers. When a new customer applies for a loan, the bank has to make a decision regarding the credit worthiness of the applicant based on information made available to them. By matching the profile of this customer against prior customer profiles, the bank might classify the loan application for approval, or for rejection.

Sequences: Association rules and prediction rules usually involve intra-record data. There are, however, situations where the occurrence of a prior event or transaction influences the occurrence of subsequent transactions. Sequence problems are those in which such temporal connections need to be uncovered. Such situations frequently occur in the stock market, or domains where seasonal variance occurs.

Datamining draws upon techniques initially developed in the machine learning and statistics communities. These methods, however precise they may claim to be, suffer from one serious flaw: they cannot handle large quantities of data. Datamining alleviates this performance issue by adding in the database perspective, where the performance issue has been addressed for a long time.

References

R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996.
R. Agrawal, T. Imielinski, A. Swami: "Database Mining: A Performance Perspective", IEEE Transactions on Knowledge and Data Engineering, Special issue on Learning and Discovery in Knowledge-Based Databases, Vol. 5, No. 6, December 1993, 914-925.
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. Expanded version available as IBM Research Report RJ9839, June 1994.