NSF Grant IRI-96-31952 Data Warehousing and Decision Support
Jeffrey D. Ullman
Department of Computer Science
Stanford University
Contact Information
411 Gates Hall, 4A Wing
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
Phone: (650) 725-4802
Fax : (650) 725-2588
Email: ullman@cs.stanford.edu
WWW PAGE
NSF Grant IRI-96-31952.
Keywords
Data cube, materialized view, information integration, semistructured data,
data warehouse.
Project Award Information
Began: 9/1/96; ends: 8/31/99.
Project Summary
In business, government, and all aspects of life where planning is important,
there is a need for database management systems that can handle ``decision-support
queries.'' These complex queries involving the aggregation of very large
amounts of data. Often, special ``data warehouses,`` or copies of information
from a variety of sources, are created for the purpose of asking decision-support
queries, while the original databases are used for transaction processing:
the fast, frequent operations that each update or query tiny fractions
of the data. Some new database products implement ``data cubes,'' which
are special organizations of data that facilitate common forms of decision-support
queries. There is need for theory and techniques addressing a spectrum
of new problems associated with decision support. One important area is
data-cube design, where it is feasible to materialize a selected set of
views to optimize average query time, for a given amount of storage. Selection
of materialized views for data cubes is a major research challenge. More
generally, when views in a warehouse are maintained as the underlying databases
change, it is desirable to have at the warehouse a selection of views so
that they can all be updated without having to issue queries at the underlying
databases (so-called ``self-maintainability'' of views). Moreover, the
warehouse should be maintainable ``incrementally,'' without having to recompute
the warehouse as the underlying data changes. Finally, decision-support
queries themselves can often be optimized by techniques recently discovered
and, it appears, some new techniques that need investigation.
Goals, Objectives, and Targeted Activities
-
Techniques for query optimization of very complex queries that are described
using the ``query flocks'' approach.
-
Improved algorithms for incremental maintenance of materialized views.
-
Renew efforts to find efficient ways to store summaries of data-cubes to
speed up queries.
Indication of Success
It is hard to point to successes from a long-range research project currently
at its midway point. However, earlier NSF-sponsored work of which the current
project is a natural evolution have led to a number of good things.
-
Work on theory of information integration led to the founding of a company,
Junglee Corp., in June 1996, by three former students (Ashish Gupta, Venky
Harinarayan, and Anand Rajaraman) who had been supported by the predecessor
NSF grant. This company has been quite successful integrating information
on the web, with Yahoo, the New York Times, and the Washington
Post numbered among several dozen clients. There are currently over
50 employees.
-
The paper Implementing
data cubes efficiently
The data-cube paper mentioned above was recently cited in a note from Redbrick
Inc., a startup data-warehousing company, as the source for the ``design
advisor'' they are now marketing.
I'm not sure what
GRPA is, but when the objective of the research is to explore new worlds,
it is hard to claim we have failed to do so. Like Capt. Kirk, it was hard
to predict exactly what we would discover in advance, but I believe we
have been adequately successful in inventing some useful artifacts and
ideas and have been recognized by the research community as having done
so.
Again, it is far
too early to tell about the future impact of this project. Four students,
one a woman, have been partially supported so far. Graduates supported
by previous NSF grants of the PI have been in great demand, both as summer
interns and as full-time employees after graduation. The PI is the 1998
winner of the ACM Karl Karlstrom award, and part of the citation indicates
the award is for the large number of doctoral students he has graduated
(almost all of them with at least partial NSF support).
Brin, S., R. Motwani,
and C. Silverstein [1997]. ``Beyond market baskets: generalizing association
rules to correlations,''
Proc. ACM SIGMOD Conf., pp. 265-276.Postscript
copy.
Brin, S., R. Motwani, J. D. Ullman, and S. Tsur [1997]. ``Dynamic itemset
counting and implication rules for market basket data,''
Proc. ACM SIGMOD
Conf., pp. 255-264.Postscript
copy.
Gupta, H. [1997]. ``Selection of views to materialize in a data warehouse,''
Proc.
Intl. Conf. on Database Theory, Athens, Jan., 1997, pp. 98-112.
Postscript
copy.
Gupta, H., V. Harinarayan, A. Rajaraman, and J. D. Ullman [1997]. ``Index
selection for OLAP,''
Intl. Conf. on Data Engineering, Birmingham,
UK, April, 1997.
Postscript
copy.
Nestorov, S., S. Abiteboul, and R. Motwani [1997]. ``Inferring structure
in semistructured data,''
Proc. Workshop on Management of Semistructured
Data, Tucson, May, 1997.
Postscript
copy. Also, a follow-on paper ``Extracting structure from semistructured
data'' by the same authors.
Nestorov, S., J. D. Ullman, J. Weiner, and S. Chawathe [1997]. ``Representative
objects: concise representations of semistructured, hierarchical data,''
Intl.
Conf. on Data Engineering, Birmingham, UK, April, 1997.
Postscript
copy.
Tsur, S., J. D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov,
and A. Rosenthal [1998]. ``Query flocks: a generalization of association-rule
mining,'' to appear in 1998 SIGMOD.
Postscript
copy. Vassalos, V. and Y. Papakonstantinou [1997]. ``Describing and
using query capabilities of heterogeneous sources,''
Proc. VLDB,
Athens, Greece, Aug., 1997.
Postscript
copy, extended version.
Area Background
My advice is to study database theory, especially logic, Datalog, conjunctive
queries and the like. This theory, while it may never spawn conventional
DBMS's, has been used in several information integration systems. An understanding
of relational algebra is also important, because a lot of the problems
are optimization problems. Extensions of the algebra figure prominently
in data-warehousing implementations.
Area References
A tutorial on database theory, much of it relevant to information integration,
can be found in My
CS345 notes.
The following reference is a more concise survey of some of these ideas:
Ullman, J. D. [1997]. ``Information integration using logical views,''
Proc.
Intl. Conf. on Database Theory, Athens, Jan., 1997, pp. 19-40.
Postscript
copy.
Garcia-Molina, H., Y. Papakonstantinou, D. Quass, A Rajaraman, Y. Sagiv,
J. D. Ullman, V. Vassalos, and J. Widom [1997]. ``The TSIMMIS approach
to mediation: data models and languages,''
J. Intelligent Information
Systems8:2, pp. 117-132.
Postscript
copy.
A book, soon to be published by MIT Press, has been written by Ashish
Gupta and Inderpal Mumick (two former NSF-supported students of mine);
the title is Materialized Views, Techniques, Implementations, and Applications.