Welcome to the ADMT Publication Server

Remembering the Forgotten: Clustering, Outlier Detection, and Accuracy Tuning in a Postdiction Pipeline

DocUID: 2023-004

Author: Anna Baskin, Scott Heyman, Brian T. Nixon, Constantinos Costa, Panos K. Chrysanthis

Abstract: The ever-increasing demand to use and store data in perpetuity is limited by storage cost, which is decreasing slowly compared to computational power's exponential growth. Under these circumstances, the deliberate loss of detail in data as it ages (referred to as data decay) is useful because it allows the cost of storing data to decrease alongside the data's utility. The idea of data postdiction as a data decay method uses machine learning techniques to recover previously deleted values from data storage. This paper proposes and evaluates a new pipeline using clustering, outlier detection, machine learning, and accuracy tuning to implement an effective data postdiction for archiving data. Overall, the goal is to train a machine learning model to estimate database features, allowing for the deletion of entire columns, which can later be reconstructed within some threshold of accuracy using the stored models. We evaluate the effectiveness of our postdiction pipeline in terms of storage reduction and data recovery accuracy using a real healthcare dataset. Our preliminary results show that the order in which outlier detection, clustering, and machine learning methods are applied leads to different trade-offs in terms of storage and recovery accuracy.

Keywords: Data postdiction, Data Decaying, Lossy Compression, Clustering, Outlier Detection

Published In: European Conference on Advances in Databases and Information Systems (ADBIS)

ISBN: 978-3-031-42941-5

Volume: 1850Pages: 46-55

Place Published: Barcelona, Spain

Year Published: 2023

DOI: 10.1007/978-3-031-42941-5_5

Project: Data Postdiction Subject Area: Machine Learning, Data Decay, Big Data Challenges

Publication Type: Conference Paper

Sponsor: Others

Citation:Text Latex BibTex XML Anna Baskin, Scott Heyman, Brian T. Nixon, Constantinos Costa, and Panos K. Chrysanthis. Remembering the Forgotten: Clustering, Outlier Detection, and Accuracy Tuning in a Postdiction Pipeline. European Conference on Advances in Databases and Information Systems (ADBIS). 1850:46-55. 2023. Barcelona, Spain. DOI: 10.1007/978-3-031-42941-5_5.