Re: [cross-project-issues-dev] Anonymisation of public data
On 27/04/2018 12:57, Gunnar Wagenknecht wrote:
I was one of the people asking off-list because I have a concern with encryption as a technology for anonymizing data. It immediately raises a red flag for me because it allows to de-anonymize the data. Thus, I would like to see use of data masking techniques such as hashing instead of encryption. To be more clear, I find it suspicious why reversible anonymization must be used in the first place.This anonymisation mechanism is not only meant for the Eclipse datasets.
It's meant to be used by other teams and projects too, hence the
requirement/feature. In the specific context of the Eclipse datasets,
we'll not even *save* the key so it's rather safe, especially
considering we're talking about public data.
And I'm not certain hashing is better than encrypting (assuming the key
is really thrown away) because of rainbow tables and similar techniques.
And since we're talking about public data, cracking the encryption (or
hashing) is a *lot* harder than simply reading the public sources.
*Every* data we retrieve is public, and I confirm no auth is required to
access them. There will be Git, Bugzilla, Forums, CI, SonarQube, mailing
lists -- all of which can be accessed by anyone publicly.
Can you also be more specific about what public data and which API endpoints you are going to use?
I assume it's anything that is public in Git already, which makes this discussion obsolete as everything is already public. But I want to confirm that non of the API endpoints require authentication to get data you wouldn't get without authentication.
This does not make the discussion obsolete, however. Even if the
information is public I do NOT want to ease the work of spammers or
malicious people (i.e. it'd easier for them to read csvs than git log).
Hence the anonymisation, even if it's on public data.
I'd be happy to have reviewers for the datasets, by the way. So if
anybody is willing to double check results or be part of the process,
please let me know.
One last thing; Crossminer is an EU-funded project, and they pay
attention to privacy (especially with the upcoming GDPR), so the context
is rather safe, and you actually do not even need to trust me
Once again, your concern and associated feedback are welcome, and i'm
happy to discuss that.