PoLAR: Postdiction of Large Archival Repositories

Overview

The ever-increasing demand to use and store data in perpetuity is limited by storage cost, which is decreasing slowly compared to computational power’s exponential growth. Under these circumstances, the deliberate loss of detail in data as it ages (referred to as data decay) is useful because it allows the cost of storing data to decrease alongside the data’s utility. The idea of data postdiction as a data decay method uses machine learning techniques to recover previously deleted values from data storage. This project proposes and evaluates a new pipeline using clustering, outlier detection, machine learning, and accuracy tuning to implement an effective data postdiction for archiving data. Overall, the goal is to train a machine learning model to estimate database features, allowing for the deletion of entire columns, which can later be reconstructed within some threshold of accuracy using the stored models. We evaluate the effectiveness of our postdiction pipeline in terms of storage reduction and data recovery accuracy using a real healthcare dataset. Our preliminary results show that the order in which outlier detection, clustering, and machine learning methods are applied leads to different trade-offs in terms of storage and recovery accuracy.

Please refer to the first paper published at ADBIS’23.

People

Faculty:

Panos K. Chrysanthis
Constantinos Costa

Graduate Students:

Brian T. Nixon
Trevor Petersen
Robbie Fishel

Undergraduate Students:

Anna Baskin
Scott Heyman