Skip to Content

Eclipse DataEggs provides datasets related to the development of Eclipse projects, mainly for software practitionners and researchers.

The datasets include various pieces of data retrieved from the Eclipse forge: Mailing lists, Project development data, and AERI stacktraces, all in handy CSV and JSON formats. Each dataset comes with R Markdown documents describing its content and providing hints about how to use it. Examples provided mainly use the R statistical analysis software.

Content

The datasets provided include:

  • Mailing lists (full mboxes and csv extracts) hosted at the Eclipse forge with their documentation and examples.
  • AERI exception stacktraces (not updated anymore, historical data only) includes 2 datasets: problems (see documentation) and incidents (see documentation).
  • Development data from Eclipse projects. Depending on data sources available for each project, the following information is provided:
    • SCM (git).
    • ITS (Bugzilla, GitHub issues, GitLab issues).
    • CI (Jenkins).
    • PMI checks.
    • Stack Overflow statistics.
    • Scancode analysis (executed on our server).

Privacy has been a major concern from the beginning. Once extracted, data is anonymised using data-anonymiser and published in the downloads section of the project. See our documentation for more details

All data related to projects is retrieved from the Eclipse Alambic instance at https://eclipse.alambic.io. Alambic is an open-source framework for development data extraction and processing, for more information see https://alambic.io.

Contributing

All work on the Eclipse DataEggs project is handled transparently at https://gitlab.eclipse.org/eclipse/dataeggs/ .

We’re open: if you’d like to contribute, please join us! You can:

Licencing

All datasets are published under the Creative Commons BY-Attribution-Share Alike 4.0 (International).