Privacy-preserving Data Sanitization

23 Jul
Monday, 07/23/2012 6:00am to 8:00am
Ph.D. Dissertation Proposal Defense

Wentian Lu

Computer Science Building, Room 151

With the emergence of digital devices and internet, much of our information is maintained by industries like healthcare, insurance and IT. Not surprisingly, when people ask for the access to the data for monitoring, understanding and analyzing the information, privacy concerns could be one of main barriers that disrupts such behaviors. So the question how to enable third parties monitoring and analyzing these data sets but still preserve privacy is an important and realistic problem in the real world. In this proposal, we explore three privacy-preserving problems that are facing this difficulty and discuss the design of data sanitization process to overcome it.

In the first one, we face  the problem of auditing databases under retention restrictions. Auditing the changes to a database is critical for identifying malicious behavior, maintaining data quality, and improving system performance. But an accurate audit log is an historical record of the past that can also pose a serious threat to privacy. Policies that limit data retention conflict with the goal of accurate auditing, and data owners have to carefully balance the need for policy compliance with the goal of accurate auditing. We propose a framework for auditing the changes to a database system while respecting data retention policies. Our framework includes an historical data model that supports flexible audit queries, along with a language for retention policies that can hide individual attribute values or remove entire tuples from the history. Under retention policies, the audit history is partially incomplete. Thus, audit queries on the protected history can include imprecise results.

Next, we consider the problem of generating synthetic data sets privately, so that researchers can run analyses on it freely. For tabular data, we think of system performance evaluation tasks. Consider the challenge faced by a database vendor or researcher who has designed a novel database technology, and the vendor would like to evaluate the performance of her technology in the context of a real enterprise in order to measure performance gains. Such an evaluation would ideally be carried out using the actual data and query workloads used by the enterprise. However, proprietary datasets are usually not available to the third party analysts due to privacy or other reasons. To overcome this barrier, we propose techniques for synthesizing a relational database instance which matches the performance properties of the original database, especially with respect to a given target workload of SQL queries. For sanitization, we adopt a strong privacy definition, differential privacy, and our contribution lies in adapting it to our framework.

For network data, researchers are interested in a set of graph properties or simulations. As many networks usually contain personal information of their participants, it is generally difficult for institutions to release their network data. To overcome the obstacles to sharing network information caused by privacy, we propose a model-based graph generation method that selects expressive models of networks, sanitizes them under differential privacy, and finally releases the model. The analysts then can use the model to generate multiple synthetic networks, allowing them to analyze the ensemble of graphs consistent with the published model.

Advisor: Gerome Miklau