NSF Grant IRI-96-31952 Data Warehousing and Decision SupportJeffrey D. Ullman
Contact InformationJeffrey D. Ullman
WWW PAGENSF Grant IRI-96-31952
List of Supported Students and Staff (optional)
Project Award Information
KeywordsData cube, materialized view, information integration, semistructured data, data warehouse.
Project SummaryIn business, government, and all aspects of life where planning is important, there is a need for database management systems that can handle ``decision-support queries.'' These complex queries involving the aggregation of very large amounts of data. Often, special ``data warehouses,`` or copies of information from a variety of sources, are created for the purpose of asking decision-support queries, while the original databases are used for transaction processing: the fast, frequent operations that each update or query tiny fractions of the data. Some new database products implement ``data cubes,'' which are special organizations of data that facilitate common forms of decision-support queries. There is need for theory and techniques addressing a spectrum of new problems associated with decision support. One important area is data-cube design, where it is feasible to materialize a selected set of views to optimize average query time, for a given amount of storage. Selection of materialized views for data cubes is a major research challenge. More generally, when views in a warehouse are maintained as the underlying databases change, it is desirable to have at the warehouse a selection of views so that they can all be updated without having to issue queries at the underlying databases (so-called ``self-maintainability'' of views). Moreover, the warehouse should be maintainable ``incrementally,'' without having to recompute the warehouse as the underlying data changes. Finally, decision-support queries themselves can often be optimized by techniques recently discovered and, it appears, some new techniques that need investigation.
Publications and ProductsPublications are listed under Project Publications below. Several startups that have been spawned by the technology developed under this grant or predecessors are mentioned under Project Impact.
Project ImpactThe 1998 Report mentioned several achievements of the PI and students supported by previous NSF grants.
Last year, we mentioned two startups that developed from research under this and predecesor grants, Junglee Corp., bought by Amazon.com in 1998, applied information-integration technology to the Web. Google is a search engine company whose growth has brought it to the first rank, and that is growing faster than any of its competitors. Its core technology, which allows it to find pages far more accurately than other search engines, was partially supported by this grant.
This year, Enosys has started operation to provide XML-based
information-integration in a particular vertical marketplace; it uses
semistructured-data technology partially suppored by the grant. Goals, Objectives, and Targeted Activities
Goals, Objectives, and Targeted ActivitiesIn addition to several works on data warehousing reported last year, that were published this year [1, 2], we have addressed semistructured data integration, and query optimization for mediator systems.  examines the design of good query plans when using semistructured data.  introduces a logical language that looks like Datalog, but with placeholders for arbitrary numbers of arguments in predicates; its purpose is to model semistructured data in which the schema is not fixed. A major result is that this language can model a source that accepts full select-project-join queries, while that sort of capability cannot be modeled in Datalog.
The other direction has been optimization of query plans, using a query-centric mediator system such as the Stanford TSIMMIS. In [7, 8] we introduced the possibility that sources would only answer queries in which certain arguments are bound.  shows how to answer a path query by getting only the essential sources off the path that can contribute bindings; those bindings allow us to ask queries along the path with bindings that we couldn't get by following the path.  Answers the question whether by so doing we get all obtainable answers to the query, and several related issues.
The grant is terminating, and a proposal to continue work in the direction of query optimization for mediator and warehouse systems has been filed.
I recommend the Survey by Alon Levy and my own Course Notes on the subject of information integration.