Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Newcomers » Proposals » [CR] How about data privacy?
[CR] How about data privacy? [message #638938] Sat, 13 November 2010 15:19 Go to next message
Marcel Bruch is currently offline Marcel BruchFriend
Messages: 230
Registered: July 2009
Senior Member
Quote:

Security is one the issues in the data mining part of the project. Do you already have some policies suggestions? Maybe a place to discuss them?



A good starting place (until the project is created) is this forum. Thus, put all your comments just into this thread.

Just a few initial thoughts concerning data privacy.

Currently we have two different kinds of systems. Those where the user stays in control over what she contributes to the community, and systems that leverage implicit knowleddge by analyzing code for instance.

Two examples for the former kind of systems:
First, the example code search engine . Like every search engine it stores only request related infos like the request id and which examples have been looked at.

Second, large parts of the crowd-sourced documentation platform are based on user-provided content or explicit feedback (where a user rates something as beeing good or bad).

In both systems, people are absolutely in control of their data.



Intelligent code completion and usage-driven javadoc are somewhat different. Right now we extract the data for these tools by analyzing open-source projects that leverage Eclipse technology only. However, this does not scale if virtually every framework should be supported.

The question is: How do we get the data to learn how others used the APIs in question?

What we currently extract is the following:

1. For each local variable in code we protocol all methods invoked on this instance, all method calls this object is used as parameter, and the context in which this object is used (like PrefernecePage.performOK()).

2. For each class we collect the name of the superclass, implemented interfaces, and overridden methods.

That's basically all the information we collect at the moment. I wonder if this information would be considered to contain too private data - especially if we could use a package based inclusion/exclusion filter like include:org.eclipse.**, java.**; exclude: com.sap.**

But what's your feeling about such kind of data collection? Do you have another proposal?


tw: @MarcelBruch
tw: @Recommenders

[Updated on: Sat, 13 November 2010 22:32]

Report message to a moderator

Re: [CR] How about data privacy? [message #639084 is a reply to message #638938] Mon, 15 November 2010 09:10 Go to previous messageGo to next message
Maxime Jeanmart is currently offline Maxime JeanmartFriend
Messages: 35
Registered: November 2010
Member
Here are some thoughts and ideas.

While you stay limited to open source projects, I think the level of security you have is enough. However, some additional security levels should probably be added if you want to also reach the enterprise world.
Some companies may be reluctant to transmit information outside their sphere of influence. They'll be afraid that the data may be reused against them by competition, for example. I don't even think that all companies will trust their developers in making the right choice when it comes to deciding whether data can be openly submitted or not.
So I see several possibilities to overcome this issue, which can be used separately or combined:
- Private mining and data servers, working on a LAN, VPN or subnet
- Security profiles:
-> Data profiles (e.g.: this data has the privacy level 'internal use')
-> User profiles (e.g.: this user can only submit 'internal use' data)
-> Code profiles (e.g.: this project can only submit 'internal use' data)

Technically, this will clearly make the tools a more complex to manage. For the private servers, that means Eclipse should ideally consolidate data coming from several servers. For the security profiles, that needs some different result sets according to the different security levels and some administrative work. If the administrative work is too heavy, then the tools won't be used neither. Some balance must be found.

We also need to acknowledge that there are different types of developers and frameworks. Some people are framework implementers, some extend the framework, and some are framework users. The examples will be different according to the usage pattern. We could assign some relevant tags to the pattern to refine the results.
Besides, some frameworks are internal to a company and some are public. It would be useful to implement different levels of privacy in some way. For example, we could have:
- public data: open to anyone.
- restricted data: open to the code owning company and affiliates.
- team data: open to the team only.
- private data: not shared.

Using the restricted level, companies can benefit from the tools without giving information about the framework internals to competition.
Using the team level, developers can use patterns that are private to the framework implementation (like when you decide to create 'internal' packages in an eclipse bundle), without risking to pollute the suggestions with example that can't be applied externally.

Still, this kind of security framework is really not a first priority matter. I guess that in terms of benefit / effort ratio, the support for multiple servers might be the best option. A configuration that consolidates data from 1 private and 1 public server is probably a good start. Is it something that is possible?
Re: [CR] How about data privacy? [message #639675 is a reply to message #639084] Wed, 17 November 2010 12:43 Go to previous messageGo to next message
Marcel Bruch is currently offline Marcel BruchFriend
Messages: 230
Registered: July 2009
Senior Member
Hi Max,

I'm sorry for the delay. I'm traveling a lot in November to spread the word. Some of the events are Eclipse Demo Camps, for instance in Bonn, Kassel, Dortmund, and Karlsruhe. If some guys wanna attend these camps: there are still a few seats available. Register soon.

Back to your post: I like the idea of having profiles and multiple knowledge bases and I guess this is something we need to have for commercial use. But I'm not sure whether we should support this during incubation. My feeling is that this would cut-off us from the required feedback we need to learn about how people use these tools, their (the tools') limitations and how to improve them.

What do you think: Would this exclude too many users from supporting the project if we get out with just one public profile in the beginning?

Another thing I wonder (maybe the most important question) : Who/how many developers will contribute their knowledge? Will they do it just to help others or do they need some kind of reward for sharing?

What's your opinion? Would you share (some parts) of your data or contribute by writing some code snippets, discussing reasons for stacktraces, or judging the quality of a piece of documentation? Short: What would you share - and what would prevent you from sharing?

Cheers,
Marcel


tw: @MarcelBruch
tw: @Recommenders
Re: [CR] How about data privacy? [message #639737 is a reply to message #639675] Wed, 17 November 2010 15:55 Go to previous messageGo to next message
Maxime Jeanmart is currently offline Maxime JeanmartFriend
Messages: 35
Registered: November 2010
Member
Yes indeed, this doesn't need to be part of the incubation project. If you find the idea interesting, you may just put them in a list of possible future developments. Then you'll see if people think it's important or not.

Only having a public profile and public knowledge base is surely fine for the beginning. People from open source project or individual people will probably be willing to collaborate freely. For companies, it's different. They might just use the knowledge base without contributing to it. To be more pragmatic, I think this will mostly depend on whether the access to the repository will be limited by a firewall or not. If it's a problem and until you have the capability to have different profiles and knowledge bases, maybe it's possible to run a copy of the public knowledge base inside a private network. This would probably calm security officers down...

Now to answer your questions, I think that open source projects will happily share. Small private projects might also. A key point to get people to use the tool and share knowledge is that it must be fast, easy and non intrusive. It's mostly a matter of usability here and it's the usual stuff: have configuration pages for everything that repeats itself, use default values, reuse previous values (if relevant), page wizard or dialog, the less possible clicks...

When it comes to companies (private projects in general), you have to deal with other factors that are not related to individuals:
- You have internal security policies to manage, on which developers have little impact.
- You're dealing with team code and not individual developer code.
This has several consequences:
- Are the requests and contributions allowed to pass through the firewalls, will it let code pass through?
- It's more difficult to reward someone when the related code belongs to a team. Who is to be rewarded? What happens about mining code that is replicated in every team member through version control views? It's still possible for stack traces and doc, tough. Reward can be as simple as a "best code sample of the week" or a stack trace committer chart...

And yet, companies support can also be strength. If a company (or team) realizes that the tool has a real global added value (better quality, faster development), using the tool may become a rule. It will then bring a lot of committers at once.

Best regards,
Max
Re: [CR] How about data privacy? [message #639762 is a reply to message #639737] Wed, 17 November 2010 17:17 Go to previous message
Marcel Bruch is currently offline Marcel BruchFriend
Messages: 230
Registered: July 2009
Senior Member
Quote:
you may just put them in a list of possible future developments. Then you'll see if people think it's important or not.


Sure. We will put a feature request into bugzilla - as soon as we have one at Eclipse. Developers may comment on and vote for this feature.

Quote:
[...]copy of the public knowledge base inside a private network. This would probably calm security officers down...

If these systems do not contribute back to the community this is not a huge difference to an in-house solution, right? However, only a small set of tools require analysis of your code and I tend to think that the data we collect is not sensitive to business in nature. Maybe it would be best to start first and make transparent which data is actually collected to build these systems. If people understand/see what get's actually collected some of the sorrows may fade away Smile But anyway, this discussion is needed and if companies (then) still have concerns about what gets collected we should rediscuss and adopt the mechanisms.

Quote:
Regarding rewarding contributors: "best code sample of the week" or a stack trace committer chart...

Nice idea. Providing some usage statistics as the system evolves was on our feature list - but adding some visibility for contributors could be one way to reward people. Maybe in a similar way systems like stackoverflow.com or ohloh.net reward their users.

For now I would say: Let's see what sorrows arise after the system is in place, constantly ask what people/companies prevents them from using and contributing to the system - and stay open-minded for other solutions and for adopting the process. Finally, in-house knowledge-bases (or specialized knowledge bases for special collections of frameworks etc.) might solve many of their reservations. If even companies share their knowledge how to use open-source platforms like Eclipse or Hibernate etc. with the community - there is no reason being against in-house solutions (but after incubation Smile ) ...

best,
Marcel


tw: @MarcelBruch
tw: @Recommenders
Previous Topic:Code Recommenders
Next Topic:[CR] What kind of community are you looking to build right now (and later)?
Goto Forum:
  


Current Time: Sat Dec 20 01:40:18 GMT 2014

Powered by FUDForum. Page generated in 0.08240 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software