Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [rdf4j-dev] SNAPSHOT vs SNAPSHOT_READ



On Thu, Apr 30, 2020 at 5:36 PM Håvard Ottestad <hmottestad@xxxxxxxxx> wrote:
Hi,

I was wondering about the difference between SNAPSHOT and SNAPSHOT_READ and why SNAPSHOT_READ was chosen as default for the MemoryStore and NativeStore? Jeen, maybe you could help me understand :)

I'm a bit rusty on this myself, but if I remember correctly, SNAPSHOT_READ ensures that once a query is started, its result is not influenced by any subsequent changes. For example, if you query in a transaction:

      con.begin();
      statements = con.getStatements(....);
      while(statements.hasNext()) {
              Statement st = statements.next(); // etc
              // now imagine that somewhere during this loop, after we have started evaluating our query, a concurrent transaction commits its result: should that result show up in this iteration?
              // This is the difference: in READ_COMMITTED, those changes may show up in the query result while iterating, in SNAPSHOT_READ, this is prevented.
       }
       con.commit();

SNAPSHOT is a stricter level, which, in addition to the above, also ensures that if you query twice in the same transaction, those two separate queries will observe the same state of the data, as it was at the start of the transaction.

SNAPSHOT_READ was presumably chosen as the default over READ_COMMITTED because RDF4J uses lazy query evaluation, so the query result is not "materialized" until you actually start iterating - and this can give unexpected results.


The reason is that I’m wondering if maybe we should upgrade the default to SNAPSHOT, which is the highest level we have that doesn’t fail transactions for isolation violations.

I'm not sure what you mean with "isolation violations", but I think that it makes sense to keep the default isolation level as low as possible without compromising on consistent results when reading. Upping to SNAPSHOT as the default would make concurrent transactions more expensive to process.

As a note, relational databases typically default to READ_COMMITTED, or NONE, letting it be up to the user to decide if they want to trade performance for consistency. This would of course not be backward compatible for us to do, and I also don’t think it’s a good idea due to the number of question we would then get about inconsistencies from users who don’t want to have to think about isolation.

READ_COMMITTED and SNAPSHOT_READ are really very close, and we've only chosen the slightly bigger level just to make sure that you don't get unexpected things while iterating over a query result. I suspect that in many relational databases, no distinction is made between READ_COMMITTED and SNAPSHOT_READ, and they effectively do the same thing.

Jeen
 

Back to the top