NSF Grant IRI-96-31952 Data Warehousing and Decision Support
Jeffrey D. Ullman
Department of Computer Science
Stanford University
Contact Information
Jeffrey D. Ullman
411 Gates Hall
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
Phone: (650) 725-4802
Fax: (650) 725-2588
Email: ullman@cs.stanford.edu
URL: http://www-db.stanford.edu/~ullman
WWW PAGE
NSF Grant IRI-96-31952
List of Supported Students and Staff (optional)
- Jeffrey D. Ullman
, PI.
- Sergey Brin,
Doctoral Student.
- Himanshu Gupta,
Doctoral Student.
- Chen Li
, Doctoral Student.
- Svedozar Nestorov
, Doctoral Student.
- Vasilis Vassalos
, Doctoral Student.
Project Award Information
- Award Number: NSF IRI-96-31952
- Duration: 9/l/96--8/31/1999
- Title: Data Warehousing and Decision Support
Keywords
Data cube, materialized view, information integration,
sernistructured data, data warehouse.
Project Summary
In business, government, and all aspects of life where planning is
important, there is a need for database management systems that can handle
"decision-support queries." These complex queries involving the aggregation
of very large amounts of data. Often, special "data warehouses," or copies
of information from a variety of sources, are created for the purpose of
asking decision-support queries, while the original databases are used for
transaction processing: the fast, frequent operations that each update
or query tiny fractions of the data. Some new database products
implement "data cubes," which are special organizations of data that
facilitate common forms of decisionsupport queries. There is need for
theory and techniques addressing a spectrum of new problems associated with
decision support. One important area is data-cube design, where it is
feasible to materialize a selected set of views to optimize average query
time, for a given amount of storage. Selection of materialized views for
data cubes is a major research challenge. More generally, when views in a
warehouse are maintained as the underlying databases change, it is
desirable to have at the warehouse a selection of views so that they can
all be updated without having to issue queries at the underlying databases
(so-called "self-maintainability" of views). Moreover, the warehouse should
be maintainable "incrementally," without having to recompute the warehouse
as the underlying data changes. Finally, decision-support queries
themselves can often be optimized by techniques recently discovered and, it
appears, some new techniques that need investigation.
Goals, Objectives, and Targeted Activities
- Improved algorithms for maintenance of materialized views.
- Exploiting the World-Wide-Web as a database.
- Extending database technology to sernistructured data.
Indication of Success
We have had good success with publication in major conferences this
past year. In addition, there are some new stories of startups based on
NSF-supported research in the next paragraph.
Project Impact
The Previous Report mentioned several achievements of the Pl
and students supported by previous NSF grants.
In addition, we note that the company Junglee Inc., which exploited
information-integration technology developed under a previous grant, was
bought by Amazon.com in August, 1998. Their technology is moving
Amazon into the role of retailer, not just of books, but as an on-line
"Macys. "
Finally, the google project has also gone commercial as
Google.com.
GPRA Outcome Goals
See the Previous Report.
Project References
- S. Abiteboul, J. McHugh, M. Rys, V. Vassalos, and J. Wiener,
"Incremental maintenance of sernistructured views over sernistructured
data," VLDB, Aug., 1998. Extends incremental view maintenance to
semistructured views. Also explores what the proper notion of views of
sernistructured data should be. postscript.
- S. Brin, "Extracting patterns and relations from the world wide web,"
Proceedings of WebDB Workshop at EDBT`98, Valencia, Spain, March
1998. postscript.
- S. Brin and L. Page, "The anatomy of a large-scale hypertextual
web search engine," Proceedings of WWW7, Australia, 1998.
postscript.
- H. Gupta and 1. S. Murnick, "Selection of views to materialize
under a maintenance-time constraint," ICDT, Jan., 1999. Extends earlier
work on optimizing the selection of materialized views to a model where the
critical resource is the (usually short) time that the system can be shut
down to update the warehouse. postscript.
- H. Gupta and D. Srivastava, "Data warehouse of newsgroups," ICDT,
Jan., 1999. A "newsgroup" is a set of documents from some large repository
that is of interest to one or more subscribers. The problem addressed is
how to select some sets of documents to materialize under risource
constraints. postcript .
- C. Li, R. Yemeni, H. Garcia-Molina, and J. D. Ullman, "Optimizing
largejoin queries in mediation systems," ICDT, Jan., 1999. Shows how to
take into account the binding patterns that are needed to answer a query at
a source in the Tsimmis mediator system. The result is a way to optimize
queries and yet not allow the plan to ask a source query that cannot be
answered. postscript.
- C. Li, R. Yemeni, V. Vassalos, H. Garcia-Molina, Y.
Papakonstantinou, J. D. Ullman, and M. Valiveti, " Capability-based
mediation in Tsimmis," SIGMOD, June, 1998. Published material regarding a
demo that illustrated the use of binding patterns in the Tsimmis query
optimizer. postscript.
- S. Nestorov, S. Abiteboul, and R. Motwani, "Extracting structure
from semistructured data," SIGMOD, June, 1998. The greatest-fixedpoint of
monadic Datalog programs is used as a method to deduce a reasonable set of
types for a sernistructured database. The effect is a reasonable
description of the "typical" nodes or objects in a sernistructured DB.
postscript.
- Y. Papakonstantinou and V. Vassalos, "Query rewriting using
sernistructured views," to appear in SIGMOD 99. Extends to sernistructured
data the techniques for sythesizing query answers from views that are
described in terms of global predicates (which were developed by Levy and
others for Datalog). postscript.
- Y. Papakonstantinou and V. Vassalos, "
Expressive-capabilities description languages and query rewriting
algorithms," to appear in J Logic Programming. Compares proposed
languages for expressing the ability of a source to answer certain kinds of
queries. postscript.
- S. Tsur, J. D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S.
Nestorov, and A. Rosenthal, "Query flocks: a generalization of
association-rule mining," SIGMOD, June, 1998. Presents a framework for
studying complex queries and their optimization. The major idea is a
generalization of the "a-priori" technique used in association-rule data
mining. postscript.
Area Background
See the Previous Report for useful background and
references.