Affordable Access

Publishing set-valued dataset : strengthening the Disassociation approach to improve both privacy preservation and utility

  • Awad, Nancy
Publication Date
Sep 17, 2020
External links


This thesis addresses the problematic of anonymization for set-valued datasets, also known as transactional data. The work is based on an anonymization technique specific for set-valued data defined by Terrovitis as “Disassociation”. This technique works under the assumption that data values should not be altered, contrary to differential privacy, or suppressed, unlike k-anonymity. The duality character of disassociation is investigated. First, the position of disassociation facing data utility and knowledge extraction is evaluated and improved. Second, the truthfulness of disassociation towards protection of individuals’ private life under its own privacy model, is studied and adjusted. On a first observation on disassociation, the utility of the information in a disassociated dataset is investigated. By reason of probabilistic analysis, it is proven that various associations in a disassociated dataset suffer from information loss. Therefore, to increase the utility value of a predefined set of associations, specified as “utility rules” by the user, the clustering process of disassociation is optimized, using ant-based clustering for the utility rules in question. Disassociation suffers from a privacy breach for homogeneity attacks, defined as the “cover problem” in 2016. To address this problem, a solution is proposed by using partial suppression and noise addition. The correctness of the solution is investigated and proven, where every cover problem is resolved and no new cover problem is generated by the proposed solution. Finally, as disassociation isn’t a common data form, it is hard for machine Learning algorithms and data analyst to extract information and exploit the data in its current form. Re-expressing the data of the anonymized set-value datasets by disassociation in its original form, is a theoretical solution that can bring back data analysis techniques closer to anonymized data. A probabilistic re-association algorithm is thus proposed, sensitive to the probabilistic distribution of the associations in a cluster. This solution relies on an elaborated definition of neighbor datasets to prove its sensitivity and respect to the privacy constraints. The fidelity of the solution to data utility preservation is evaluated using the most exploited data analysis techniques over set-value data: mining frequent itemsets and association rules. In conclusion, this work digs deep in the field of anonymization for set-valued datasets. Starting from a defined anonymization technique known as disassociation, a privacy breach, the “cover problem”, is addressed for a solution and data utility is investigated within the disassociated dataset and for future uses. Results are impressive in terms of data utility and privacy preservation.

Report this publication


Seen <100 times