NSF Grant IRI-96-31952 Data Warehousing and Decision Support

Jeffrey D. Ullman

Department of Computer Science
Stanford University

Contact Information

Jeffrey D. Ullman
411 Gates Hall
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
Phone: (650) 725-4802
Fax : (650) 725-2588
Email: ullman@cs.stanford.edu
URL: http://www-db.stanford.edu/~ullman


NSF Grant IRI-96-31952

List of Supported Students and Staff (optional)

Project Award Information


Data cube, materialized view, information integration, semistructured data, data warehouse.

Project Summary

In business, government, and all aspects of life where planning is important, there is a need for database management systems that can handle ``decision-support queries.'' These complex queries involving the aggregation of very large amounts of data. Often, special ``data warehouses,`` or copies of information from a variety of sources, are created for the purpose of asking decision-support queries, while the original databases are used for transaction processing: the fast, frequent operations that each update or query tiny fractions of the data. Some new database products implement ``data cubes,'' which are special organizations of data that facilitate common forms of decision-support queries. There is need for theory and techniques addressing a spectrum of new problems associated with decision support. One important area is data-cube design, where it is feasible to materialize a selected set of views to optimize average query time, for a given amount of storage. Selection of materialized views for data cubes is a major research challenge. More generally, when views in a warehouse are maintained as the underlying databases change, it is desirable to have at the warehouse a selection of views so that they can all be updated without having to issue queries at the underlying databases (so-called ``self-maintainability'' of views). Moreover, the warehouse should be maintainable ``incrementally,'' without having to recompute the warehouse as the underlying data changes. Finally, decision-support queries themselves can often be optimized by techniques recently discovered and, it appears, some new techniques that need investigation.

Publications and Products

Publications are listed under Project Publications below. Several startups that have been spawned by the technology developed under this grant or predecessors are mentioned under Project Impact.

Project Impact

The 1998 Report mentioned several achievements of the PI and students supported by previous NSF grants.

Last year, we mentioned two startups that developed from research under this and predecesor grants, Junglee Corp., bought by Amazon.com in 1998, applied information-integration technology to the Web. Google is a search engine company whose growth has brought it to the first rank, and that is growing faster than any of its competitors. Its core technology, which allows it to find pages far more accurately than other search engines, was partially supported by this grant.

This year, Enosys has started operation to provide XML-based information-integration in a particular vertical marketplace; it uses semistructured-data technology partially suppored by the grant.

Goals, Objectives, and Targeted Activities

In addition to several works on data warehousing reported last year, that were published this year [1, 2], we have addressed semistructured data integration, and query optimization for mediator systems. [5] examines the design of good query plans when using semistructured data. [6] introduces a logical language that looks like Datalog, but with placeholders for arbitrary numbers of arguments in predicates; its purpose is to model semistructured data in which the schema is not fixed. A major result is that this language can model a source that accepts full select-project-join queries, while that sort of capability cannot be modeled in Datalog.

The other direction has been optimization of query plans, using a query-centric mediator system such as the Stanford TSIMMIS. In [7, 8] we introduced the possibility that sources would only answer queries in which certain arguments are bound. [4] shows how to answer a path query by getting only the essential sources off the path that can contribute bindings; those bindings allow us to ask queries along the path with bindings that we couldn't get by following the path. [5] Answers the question whether by so doing we get all obtainable answers to the query, and several related issues.

The grant is terminating, and a proposal to continue work in the direction of query optimization for mediator and warehouse systems has been filed.


Project References

  1. Gupta, H. and I. S. Mumick, ``Selection of views to materialize under a maintenance-time constraint,'' Proc. 1999 ICDT. Postscript.

  2. Gupta, H. and D. Srivastava, ``Data warehouse of newsgroups,'' ibid. Postscript.

  3. Li, C., ``Computing complete answers to queries with binding restrictions,'' unpublished memorandum, Stanford Univ., Dept. of CS, 2000.

  4. Li, C. and E. Chang, ``Query planning with limited source capabilities,'' to appear in 2000 ICDE. Postscript.

  5. Papakonstantinou, Y. and V. Vassalos, ``Query rewriting for semistructured data,'' Proc. 1999 SIGMOD, pp. 455-466. Postscript.

  6. Vassalos, V. and Y. Papakonstantinou, ``Expressive capabilities description languages and query rewriting algorithms,'' J. Logic Programming 43:1, pp. 75-122. Postscript.

  7. Yerneni, R., C. Li, H. Garcia-Molina, and J. D. Ullman, ``Optimizing large join queries in mediation systems,'' Proc. 1999 ICDE, pp. 348-364. Postscript.

  8. Yerneni, R., C. Li, H. Garcia-Molina, and J. D. Ullman, ``Computing capabilities of mediators,'' Proc. 1999 SIGMOD, pp. 443-454. Postscript.

Area Background

I recommend the Survey by Alon Levy and my own Course Notes on the subject of information integration.