NSF Grant IRI-96-31952 Data Warehousing and Decision Support

Jeffrey D. Ullman
Department of Computer Science
Stanford University

Contact Information
Jeffrey D. Ullman
411 Gates Hall
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
Phone: (650) 725-4802
Fax: (650) 725-2588
Email: ullman@cs.stanford.edu
URL: http://www-db.stanford.edu/~ullman

NSF Grant IRI-96-31952

List of Supported Students and Staff (optional)

Project Award Information

Data cube, materialized view, information integration, sernistructured data, data warehouse.

Project Summary
In business, government, and all aspects of life where planning is important, there is a need for database management systems that can handle "decision-support queries." These complex queries involving the aggregation of very large amounts of data. Often, special "data warehouses," or copies of information from a variety of sources, are created for the purpose of asking decision-support queries, while the original databases are used for transaction processing: the fast, frequent operations that each update or query tiny fractions of the data. Some new database products implement "data cubes," which are special organizations of data that facilitate common forms of decisionsupport queries. There is need for theory and techniques addressing a spectrum of new problems associated with decision support. One important area is data-cube design, where it is feasible to materialize a selected set of views to optimize average query time, for a given amount of storage. Selection of materialized views for data cubes is a major research challenge. More generally, when views in a warehouse are maintained as the underlying databases change, it is desirable to have at the warehouse a selection of views so that they can all be updated without having to issue queries at the underlying databases (so-called "self-maintainability" of views). Moreover, the warehouse should be maintainable "incrementally," without having to recompute the warehouse as the underlying data changes. Finally, decision-support queries themselves can often be optimized by techniques recently discovered and, it appears, some new techniques that need investigation.

Goals, Objectives, and Targeted Activities

  1. Improved algorithms for maintenance of materialized views.
  2. Exploiting the World-Wide-Web as a database.
  3. Extending database technology to sernistructured data.

Indication of Success
We have had good success with publication in major conferences this past year. In addition, there are some new stories of startups based on NSF-supported research in the next paragraph.

Project Impact
The Previous Report mentioned several achievements of the Pl and students supported by previous NSF grants.
In addition, we note that the company Junglee Inc., which exploited information-integration technology developed under a previous grant, was bought by Amazon.com in August, 1998. Their technology is moving Amazon into the role of retailer, not just of books, but as an on-line "Macys. "
Finally, the google project has also gone commercial as Google.com.

GPRA Outcome Goals
See the Previous Report.

Project References

  1. S. Abiteboul, J. McHugh, M. Rys, V. Vassalos, and J. Wiener, "Incremental maintenance of sernistructured views over sernistructured data," VLDB, Aug., 1998. Extends incremental view maintenance to semistructured views. Also explores what the proper notion of views of sernistructured data should be. postscript.
  2. S. Brin, "Extracting patterns and relations from the world wide web," Proceedings of WebDB Workshop at EDBT`98, Valencia, Spain, March 1998. postscript.
  3. S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Proceedings of WWW7, Australia, 1998. postscript.
  4. H. Gupta and 1. S. Murnick, "Selection of views to materialize under a maintenance-time constraint," ICDT, Jan., 1999. Extends earlier work on optimizing the selection of materialized views to a model where the critical resource is the (usually short) time that the system can be shut down to update the warehouse. postscript.
  5. H. Gupta and D. Srivastava, "Data warehouse of newsgroups," ICDT, Jan., 1999. A "newsgroup" is a set of documents from some large repository that is of interest to one or more subscribers. The problem addressed is how to select some sets of documents to materialize under risource constraints. postcript .
  6. C. Li, R. Yemeni, H. Garcia-Molina, and J. D. Ullman, "Optimizing largejoin queries in mediation systems," ICDT, Jan., 1999. Shows how to take into account the binding patterns that are needed to answer a query at a source in the Tsimmis mediator system. The result is a way to optimize queries and yet not allow the plan to ask a source query that cannot be answered. postscript.
  7. C. Li, R. Yemeni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou, J. D. Ullman, and M. Valiveti, " Capability-based mediation in Tsimmis," SIGMOD, June, 1998. Published material regarding a demo that illustrated the use of binding patterns in the Tsimmis query optimizer. postscript.
  8. S. Nestorov, S. Abiteboul, and R. Motwani, "Extracting structure from semistructured data," SIGMOD, June, 1998. The greatest-fixedpoint of monadic Datalog programs is used as a method to deduce a reasonable set of types for a sernistructured database. The effect is a reasonable description of the "typical" nodes or objects in a sernistructured DB. postscript.
  9. Y. Papakonstantinou and V. Vassalos, "Query rewriting using sernistructured views," to appear in SIGMOD 99. Extends to sernistructured data the techniques for sythesizing query answers from views that are described in terms of global predicates (which were developed by Levy and others for Datalog). postscript.
  10. Y. Papakonstantinou and V. Vassalos, " Expressive-capabilities description languages and query rewriting algorithms," to appear in J Logic Programming. Compares proposed languages for expressing the ability of a source to answer certain kinds of queries. postscript.
  11. S. Tsur, J. D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal, "Query flocks: a generalization of association-rule mining," SIGMOD, June, 1998. Presents a framework for studying complex queries and their optimization. The major idea is a generalization of the "a-priori" technique used in association-rule data mining. postscript.

Area Background
See the Previous Report for useful background and references.