This project addresses fundamental challenges in privacy-enhancing technologies related to the statistical disclosure of sensitive data in the publication of surveys, or of any database combining demographic and confidential data. The magnitude of its potential impact emanates from its broad applicability to practical information systems designed for the collection, analysis or dissemination of anonymized data, with the ulterior purpose of statistical study, in contexts including, but not limited to, socioeconomics, electronic voting, healthcare, targeted advertising, personalized content recommendation and social networks. The leading object of the project at hand is twofold, as it encompasses two research aspects directly applicable to the development of practical mechanisms for statistical disclosure control; one dealing with computation, and the other concerning user trust. Together, these aspects significantly widen the range of practical applicability in modern information systems. A representative example of the simultaneous application of both research aspects is the anonymization of large-scale demographic surveys, with uncertain respondent participation. More precisely, the scope of the project is set within the field of statistical disclosure control. This field concerns the postprocessing of the demographic portion of the statistical results of surveys containing sensitive personal information, in order to effectively safeguard the anonymity of the participating respondents. The literature on the subject abounds with designs of algorithms to this end, along with quantitative evaluations on various standardized datasets. Unfortunately, most of the published work hardly focuses on the development of specific mechanisms aimed to carefully manage the substantial computational costs incurred in large datasets, often quadratic in number of records. Further, such anonymization algorithms are conceived under the assumption that the computational system in charge of their execution is also in possession of the entirety of the raw data available, unequivocally linking respondent identities with confidential information. This assumption inherently limits their applicability to the not always realistic scenario in which respondents are willing to disclose personal data to a trusted centralizing entity. It also raises the question whether the data could be preprocessed locally by the respondent, prior to its delivery to an untrusted server. If so, this would profoundly influence the willingness of the respondents to disclose confidential data, and the overall safety of the anonymization process. Ultimately, it would greatly widen the feasible range of applications for the collection of sensitive user data in a broad diversity of information systems. This project addresses both the matter of practical traditional anonymization of large datasets, and the new paradigm of local preprocessing for wider applicability. More technically, it addresses the operational improvement of traditional k-anonymous microaggregation for large datasets, as well as the development of more general variants based on statistical models of uncertain respondent participation, in which k-anonymity is concordantly enforced with a probabilistic guarantee. The combination of both aspects permits large-scale data collection, analysis and publication, in which demographic attributes may be anonymized more efficiently and safely, even before they are collected.
Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016
Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Proyectos de I+D+I para jóvenes investigadores sin vinculación o con vinculación temporal
Gobierno De España. Ministerio De Economía Y Competitividad, Mineco
Pallares, E.; Rebollo-Monedero, D.; Rodríguez-Hoyos, A.; Estrada, J.; Mezher, A.; Forne, J. Expert systems with applications Vol. 144, p. 113086:1-113086:17 DOI: 10.1016/j.eswa.2019.113086 Date of publication: 2019-11-11 Journal article