NSF Grant IRI-96-31952 Data Warehousing and Decision Support

Jeffrey D. Ullman

Department of Computer Science
Stanford University

Contact Information

411 Gates Hall, 4A Wing
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
Phone: (650) 725-4802
Fax : (650) 725-2588
Email: ullman@cs.stanford.edu


NSF Grant IRI-96-31952.


Data cube, materialized view, information integration, semistructured data, data warehouse.

Project Award Information

Began: 9/1/96; ends: 8/31/99.

Project Summary

In business, government, and all aspects of life where planning is important, there is a need for database management systems that can handle ``decision-support queries.'' These complex queries involving the aggregation of very large amounts of data. Often, special ``data warehouses,`` or copies of information from a variety of sources, are created for the purpose of asking decision-support queries, while the original databases are used for transaction processing: the fast, frequent operations that each update or query tiny fractions of the data. Some new database products implement ``data cubes,'' which are special organizations of data that facilitate common forms of decision-support queries. There is need for theory and techniques addressing a spectrum of new problems associated with decision support. One important area is data-cube design, where it is feasible to materialize a selected set of views to optimize average query time, for a given amount of storage. Selection of materialized views for data cubes is a major research challenge. More generally, when views in a warehouse are maintained as the underlying databases change, it is desirable to have at the warehouse a selection of views so that they can all be updated without having to issue queries at the underlying databases (so-called ``self-maintainability'' of views). Moreover, the warehouse should be maintainable ``incrementally,'' without having to recompute the warehouse as the underlying data changes. Finally, decision-support queries themselves can often be optimized by techniques recently discovered and, it appears, some new techniques that need investigation.

Goals, Objectives, and Targeted Activities

  1. Techniques for query optimization of very complex queries that are described using the ``query flocks'' approach.
  2. Improved algorithms for incremental maintenance of materialized views.
  3. Renew efforts to find efficient ways to store summaries of data-cubes to speed up queries.

Indication of Success

It is hard to point to successes from a long-range research project currently at its midway point. However, earlier NSF-sponsored work of which the current project is a natural evolution have led to a number of good things. I'm not sure what GRPA is, but when the objective of the research is to explore new worlds, it is hard to claim we have failed to do so. Like Capt. Kirk, it was hard to predict exactly what we would discover in advance, but I believe we have been adequately successful in inventing some useful artifacts and ideas and have been recognized by the research community as having done so.

Project Impact and Outcome

Again, it is far too early to tell about the future impact of this project. Four students, one a woman, have been partially supported so far. Graduates supported by previous NSF grants of the PI have been in great demand, both as summer interns and as full-time employees after graduation. The PI is the 1998 winner of the ACM Karl Karlstrom award, and part of the citation indicates the award is for the large number of doctoral students he has graduated (almost all of them with at least partial NSF support).

Project References

Brin, S., R. Motwani, and C. Silverstein [1997]. ``Beyond market baskets: generalizing association rules to correlations,'' Proc. ACM SIGMOD Conf., pp. 265-276.Postscript copy.

Brin, S., R. Motwani, J. D. Ullman, and S. Tsur [1997]. ``Dynamic itemset counting and implication rules for market basket data,'' Proc. ACM SIGMOD Conf., pp. 255-264.Postscript copy.

Gupta, H. [1997]. ``Selection of views to materialize in a data warehouse,'' Proc. Intl. Conf. on Database Theory, Athens, Jan., 1997, pp. 98-112. Postscript copy.

Gupta, H., V. Harinarayan, A. Rajaraman, and J. D. Ullman [1997]. ``Index selection for OLAP,'' Intl. Conf. on Data Engineering, Birmingham, UK, April, 1997. Postscript copy.

Nestorov, S., S. Abiteboul, and R. Motwani [1997]. ``Inferring structure in semistructured data,'' Proc. Workshop on Management of Semistructured Data, Tucson, May, 1997. Postscript copy. Also, a follow-on paper ``Extracting structure from semistructured data'' by the same authors.

Nestorov, S., J. D. Ullman, J. Weiner, and S. Chawathe [1997]. ``Representative objects: concise representations of semistructured, hierarchical data,'' Intl. Conf. on Data Engineering, Birmingham, UK, April, 1997. Postscript copy.

Tsur, S., J. D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal [1998]. ``Query flocks: a generalization of association-rule mining,'' to appear in 1998 SIGMOD. Postscript copy. Vassalos, V. and Y. Papakonstantinou [1997]. ``Describing and using query capabilities of heterogeneous sources,'' Proc. VLDB, Athens, Greece, Aug., 1997. Postscript copy, extended version.

Area Background

My advice is to study database theory, especially logic, Datalog, conjunctive queries and the like. This theory, while it may never spawn conventional DBMS's, has been used in several information integration systems. An understanding of relational algebra is also important, because a lot of the problems are optimization problems. Extensions of the algebra figure prominently in data-warehousing implementations.

Area References

A tutorial on database theory, much of it relevant to information integration, can be found in My CS345 notes.

The following reference is a more concise survey of some of these ideas: Ullman, J. D. [1997]. ``Information integration using logical views,'' Proc. Intl. Conf. on Database Theory, Athens, Jan., 1997, pp. 19-40. Postscript copy.

Garcia-Molina, H., Y. Papakonstantinou, D. Quass, A Rajaraman, Y. Sagiv, J. D. Ullman, V. Vassalos, and J. Widom [1997]. ``The TSIMMIS approach to mediation: data models and languages,'' J. Intelligent Information Systems8:2, pp. 117-132. Postscript copy.

A book, soon to be published by MIT Press, has been written by Ashish Gupta and Inderpal Mumick (two former NSF-supported students of mine); the title is Materialized Views, Techniques, Implementations, and Applications.