Re: [cross-project-issues-dev] Anonymisation of public data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [cross-project-issues-dev] Anonymisation of public data

From: Boris Baldassari <boris@xxxxxxxxxxxxxx>
Date: Fri, 27 Apr 2018 16:06:33 +0200
Delivered-to: cross-project-issues-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/cross-project-issues-dev>
List-help: <mailto:cross-project-issues-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev>, <mailto:cross-project-issues-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/cross-project-issues-dev>, <mailto:cross-project-issues-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0

On 27/04/2018 12:57, Gunnar Wagenknecht wrote:

Hi Boris,

Hi Gunnar,

I was one of the people asking off-list because I have a concern with encryption as a technology for anonymizing data. It immediately raises a red flag for me because it allows to de-anonymize the data. Thus, I would like to see use of data masking techniques such as hashing instead of encryption. To be more clear, I find it suspicious why reversible anonymization must be used in the first place.This anonymisation mechanism is not only meant for the Eclipse datasets.

It's meant to be used by other teams and projects too, hence therequirement/feature. In the specific context of the Eclipse datasets,we'll not even *save* the key so it's rather safe, especiallyconsidering we're talking about public data.

And I'm not certain hashing is better than encrypting (assuming the keyis really thrown away) because of rainbow tables and similar techniques.And since we're talking about public data, cracking the encryption (orhashing) is a *lot* harder than simply reading the public sources.

Can you also be more specific about what public data and which API endpoints you are going to use?
I assume it's anything that is public in Git already, which makes this discussion obsolete as everything is already public. But I want to confirm that non of the API endpoints require authentication to get data you wouldn't get without authentication.

*Every* data we retrieve is public, and I confirm no auth is required toaccess them. There will be Git, Bugzilla, Forums, CI, SonarQube, mailinglists -- all of which can be accessed by anyone publicly.

This does not make the discussion obsolete, however. Even if theinformation is public I do NOT want to ease the work of spammers ormalicious people (i.e. it'd easier for them to read csvs than git log).Hence the anonymisation, even if it's on public data.

I'd be happy to have reviewers for the datasets, by the way. So ifanybody is willing to double check results or be part of the process,please let me know.

One last thing; Crossminer is an EU-funded project, and they payattention to privacy (especially with the upcoming GDPR), so the contextis rather safe, and you actually do not even need to trust mepersonally. :-)

Once again, your concern and associated feedback are welcome, and i'mhappy to discuss that.


Cheers!


--
boris


Best,
Gunnar

References:
- [cross-project-issues-dev] Anonymisation of public data
  - From: Boris Baldassari
- Re: [cross-project-issues-dev] Anonymisation of public data
  - From: Gunnar Wagenknecht

Prev by Date: [cross-project-issues-dev] Impossible to sign in to Gerrit Code Review at git.eclipse.org
Next by Date: Re: [cross-project-issues-dev] Website style changes
Previous by thread: Re: [cross-project-issues-dev] Anonymisation of public data
Next by thread: [cross-project-issues-dev] LDAP error when logging into bugzilla
Index(es):
- Date
- Thread

Breadcrumbs