I think that there are several things that need discussion.
1. Which data users (explicitly or implicitly) provide 2. Under which terms of use this data is used by us and others 3. Who stores the data 4. Who can access the data and in which format (degree of anonymization).
1. The term 'data' subsumes a quite large range of information.
For Snipmatch this includes code snippets and maybe usage statistics (what has been used when to update the ranking strategies) For Extdoc this may include information like comments, editorial actions, or user ratings. For Call Completion this includes the models that have to be delivered to the clients and information about their jar's they use (e.g., file fingerprints etc). For Chain Completion this may include usage statistics (as for snipmatch to improve ranking strategies) and code snippets. You can think of other information too.
2. I'd like to say that this is an important topic that needs a solid research. It will probably require us to get in contact with lawyers to clarify what's possible/required. It should be clear that everyone who shares data (code snippets etc.) must be in the position to actually be allowed to share it. For me, it's basically the same as with the Eclipse Wiki. All users that contribute to it must agree its terms of use. Is there a difference? Are these terms of use reusable for our use case? I guess I should prepare a detailed description what get's collected and provided by whom to enable a lawyer to help here?
3. If I understood correctly, the foundation has no bandwidth to host these services. In that case, I've to get back to my university and ask for permission to host these services somewhere close to our backbone or raise some funding to put a server elsewhere. One question that comes into my mind: If the foundation is not hosting these services, can we deliver Code Recommenders with preconfigured URLs that point to external project servers? For instance, something like " code.recommenders.org"?
4. What is needed - and technical feasible? It may become the case that the raw data exceeds TBs (not in the first years I guess :)). Honestly, I've yet no clue how much data will be collected and what information others may be interested in. What we have in mind is to create reference data sets for machine learners and se researchers to enable research to create new tools and improve algorithms for code search, code recommendations etc. But these data sets will, for practical reasons, only include a subset of (anonymized) data needed for research purpose. Would this be satisfying? Do you think some kind of agreement is needed?
Is there anything I'm currently not aware of?
Thanks, Marcel On 11.01.2012, at 00:08, Wayne Beaton wrote:
FWIW, the Eclipse Foundation has a single lawyer on staff. Though we
do retain the services of other lawyers. So I guess, "lawyers" is
generally accurate :-)
The project needs to make a case to the Eclipse Foundation for
capturing and maintaining this data. We are very concerned about
privacy, and so are many people in the community. There are actual
laws in some countries that need to be considered as well.
Since we are a transparent and open organization, there needs to be
consideration for disseminating the collected data to other parties.
With the usage data, we tried publishing filtered data (which
excluded anything that could potentially expose/identify specific
users) with limited success. We failed in this regard which is a big
reason why we shut down the udc.
Unfortunately, the Eclipse Foundation lacks the bandwidth to
maintain this data on your behalf.
Wayne
On 01/10/2012 05:36 PM, Marcel Bruch wrote:
sounds good to me. But let's see what the Foundation's lawyers say about this... I'll keep you posted.
On 10.01.2012, at 23:26, Doug Wightman wrote:
Hi Marcel,
I think that's a great idea. For SnipMatch, it would probably make the
most sense to have wording in to the effect that the contributor is
verifying that they own the code and is giving a royalty-free license
to use it for any purpose. This would be associated with a checkbox
that must be checked when the code is to be shared publicly. We
currently have something to this effect already built, but the wording
hasn't been run by lawyers.
Doug
On Tue, Jan 10, 2012 at 3:00 PM, Marcel Bruch <bruch@xxxxxxxxxxxxxxxxxx> wrote:
Hi PMC,
code recommenders is making good progress and we are confident that we'll
satisfy all major criteria for M5. Extended documentation platform, code
completion engines, and local code search engine are maturing quickly and
SnipMatch guys will start at the end of January. Java, RCP/RAP, and Scout
Packages expressed some interest to integrate Code Recommenders in their
package and we work at full blast to make this happen.
One thing that hasn't been discussed in detail was how do we deal with the
data users provide for instance to snipmatch's community code templates
store or to the extended documentation platform? Is there a special
wiki-like 'terms of usage' needed? Were does this data go to? Also, for
stacktrace search or model generation and model download some data needs to
be delivered to the client and submitted. We started this discussion a while
ago but postponed it.
I'd like to pick up the discussion again - early enough before Juno
arrives. I'm not sure wether this is a discussion for the PMC mailing list
since finally it's a decision of the Foundation. But Wayne will know, I
guess.
Thanks,
Marcel
_______________________________________________
recommenders-dev mailing list
recommenders-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/recommenders-dev
_______________________________________________
recommenders-dev mailing list
recommenders-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/recommenders-dev
Thanks,
Marcel
_______________________________________________ recommenders-dev mailing list recommenders-dev@xxxxxxxxxxx http://dev.eclipse.org/mailman/listinfo/recommenders-dev
|