Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[rdf4j-dev] Proposal: merging rdf4j-storage and rdf4j back together

Back in 2017 (see https://www.eclipse.org/lists/rdf4j-dev/msg00410.html ) we made a decision to split the rdf4j project over multiple repositories. The main motivation for this was that a full build + verification of the project was taking too long, and this encouraged contributors to take shortcuts. The theory was that by splitting the project, we could get verification time down.

However, I think that that expected speed gain has not really materialized. It's true that individual repo builds are quicker, but when we make a change in, for example, the rdf4j repo, we still need to run verification in rdf4j-storage and -tools - and can turn out to still break things.

A further downside is that compliance tests in the rdf4j repo often use code from rdf4j-storage (e.g. a sail impl) or from rdf4j-tools (e.g. to spin up an rdf4j server) - however due to the order of dependencies, those modules are not built yet when rdf4j repo does its verification. While that's no big deal as long as we're on a develop branch, and everything just uses "the latest SNAPSHOT", it seriously messes up things when release time rolls around and we have to set fixed versions: suddenly when rdf4j repo tries to build it fails because its tests can't find rdf4j-sail-memory 3.0.0. I have been attempting to mitigate this by moving tests around in the project, as well as setting fixed (older) versions for these kinds of test dependencies - but it's not ideal.

Finally: the Jenkins pipeline we currently have to build, verify and deploy all of this is just incredibly convoluted. We now have 24(!) separate Jenkins jobs to coordinate all of this, and it's painful to maintain tbh.

At this point I feel the split between rdf4j and rdf4j-storage in particular hinders us more than that it helps. So I'd like to propose the following:

1. we keep rdf4j-tools as a separate project, but rdf4j and rdf4j-storage get merged back together as a single github repo.
2. we abolish rdf4j-testsuites, and move the testsuite code back into the merged rdf4j repo (this is actually already mostly done in the develop branch).

To make sure we get decent build and verification times, I want to do the following:

- culling and cleaning in our compliance and integration tests (there are quite a few tests in there that are either very slow, or redundant, or both).
- better unit testing with mocking and stubbing instead of cramming all our verification into massive compliance/integration test suites that spin up full servers every time.

If we do this right, it will help us set things up so that pull requests can be very quickly verified by only running unit tests, and main branch stability is still guaranteed by running full compliance/integration.

Your thoughts?

Jeen

PS in case you're wondering: I'm still frantically working on getting a milestone build done for 3.0, and this is part of the motivation for this post :)

Back to the top