DELVE - Data Exploration, Learning, and Visualization Environment
A project by the Datamining Group at Hofstra
University
People
-
Krishnan Pillaipakkamnatt,
Assistant Professor
-
Tomasz Dolinski, Graduate Student
-
Rona Eisenberg, Graduate Student
-
Lisa Zanella, Graduate Student
-
Anthony Iadevaia, independent software consultant
Intro to Datamining
Datamining is the extraction of interesting information
from large repositories of data. Such data can be generated from
various sources such as retail sales transactions in supermarkets, service
center call logs, web access records, weather measurements, etc.
With the creation of new tools that draw upon resources from machine learning
(artificial intelligence), statistics and databases, administrators can
now discover deep rules that underlie the observational records of their
business enterprise.
Irrespective of the context of the application, decision makers can
use datamining tools to make at least three different (but related) kinds
of inferences:
-
Associations: These are rules that group objects together
based on some relation between them. Consider the following scenario: A
grocery store manager, whose database records individual purchase transactions
may be interested in a rule like the following: people who buy diapers
tend to also buy beer. Such rules can help with product placement,
advertising, pricing, and so on.
-
Predictions: Prediction (or classification) rules are used
to categorize data into disjoint groups. Consider the situation at
a bank that makes loans. Based on the performance of loans made in the
past, it may create profiles for safe and for risky loan seekers.
When a new customer applies for a loan, the bank has to make a decision
regarding the credit worthiness of the applicant based on information made
available to them. By matching the profile of this customer against
prior customer profiles, the bank might classify the loan application
for approval, or for rejection.
-
Sequences: Association rules and prediction rules usually involve
intra-record data. There are, however, situations where the occurrence
of a prior event or transaction influences the occurrence of subsequent
transactions. Sequence problems are those in which such temporal
connections need to be uncovered. Such situations frequently occur
in the stock market, or domains where seasonal variance occurs.
Datamining draws upon techniques initially developed in the machine learning
and statistics communities. These methods, however precise they may
claim to be, suffer from one serious flaw: they cannot handle large quantities
of data. Datamining alleviates this performance issue by adding in
the database perspective, where the performance issue has been addressed
for a long time.
References
-
R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The
Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge
Discovery in Databases and Data Mining, Portland, Oregon, August, 1996.
-
R. Agrawal, T. Imielinski, A. Swami: "Database Mining: A Performance Perspective",
IEEE Transactions on Knowledge and Data Engineering, Special issue on Learning
and Discovery in Knowledge-Based Databases, Vol. 5, No. 6, December 1993,
914-925.
-
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile,
Sept. 1994. Expanded version available as IBM Research Report RJ9839, June
1994.