Eclipse DataEggs provides datasets related to the development of Eclipse projects, mainly for software practitionners and researchers.
The datasets include various pieces of data retrieved from the Eclipse forge: Mailing lists, Project development data, and AERI stacktraces, all in handy CSV and JSON formats. Each dataset comes with R Markdown documents describing its content and providing hints about how to use it. Examples provided mainly use the R statistical analysis software.
The datasets provided include:
- Mailing lists (full mboxes and csv extracts) hosted at the Eclipse forge with their documentation and examples.
- AERI exception stacktraces (not updated anymore, historical data only) includes 2 datasets: problems (see documentation) and incidents (see documentation).
- Development data from Eclipse projects. Depending on data sources available for each project, the following information is provided:
- SCM (git).
- ITS (Bugzilla, GitHub issues, GitLab issues).
- CI (Jenkins).
- PMI checks.
- Stack Overflow statistics.
- Scancode analysis (executed on our server).
Privacy has been a major concern from the beginning. Once extracted, data is anonymised using data-anonymiser and published in the downloads section of the project. See our documentation for more details
All data related to projects is retrieved from the Eclipse Alambic instance at https://eclipse.alambic.io. Alambic is an open-source framework for development data extraction and processing, for more information see https://alambic.io.
All work on the Eclipse DataEggs project is handled transparently at https://gitlab.eclipse.org/eclipse/dataeggs/ .
We’re open: if you’d like to contribute, please join us! You can:
- Get the code and propose merge requests on the DataEggs repository.
- Should you have any problem, request or question, please fill an issue in the Eclipse GitLab project page.
All datasets are published under the Creative Commons BY-Attribution-Share Alike 4.0 (International).