Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [rdf4j-dev] Contributing a write-once/read-many triple store to RDF4j
  • From: "Bart Hanssens (BOSA)" <bart.hanssens@xxxxxxxxxxxx>
  • Date: Mon, 19 Dec 2022 14:58:34 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=bosa.fgov.be; dmarc=pass action=none header.from=bosa.fgov.be; dkim=pass header.d=bosa.fgov.be; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=FDTiK0hCqV5sVYcs80SmMda8j7+0a8deF7l4I0J/6Oc=; b=N5tKwQZrrBRbltOlXp2UYebA24VAqlR8lEFtHVLfQD/Zvu0Inkjbck1SVw92NjFk86BavJRh9r8eKKNeVl6px9QrzAEYeSQ/0JqWFbOFI9LGxA0LSQULAbWH+WioWuKwmrYvdFeAzBAqFhUC/ZTe/JpGSmWLXNHP8i4wSJN2W46W7Mc05vgu16IH/s4PViTyzDG+IvsV8luJCKskGH54+nAuzb+302Ms2K7cTwolnigOP2HEmhjPOATGeC2+ZF5yAM4+/PyT0eOa1Wbfo0hkmxeIaUy2sBvkHST7i14TyYhRCgcW9d4O57vghHKspqvq9ZDjbtUOLqM3j8mtsowtZg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=n8QQALBt0ilFCy/TFqYiso9OPS50NR/kqXcK24Vxmq20cIUh3vblFZ3ACM6budFcD8VI0h7XWV7MBGsJbydYRjK5I6gABuV37dndZkYVBU/+d0M9uf9STeToQ6judxKTvIByUKg2RzCyBf+Jk+odAd13pnv7kAeOFtxQHWUa+qooLAAxv3K59QqOcKFu69QVUA+PzWfuozSKuC//feqS5ZFOUQbS6FM4UjQz2ho59joxHrupxDOSov8hzr4OizSUUqbjCWBjek2G12tCr69FICjoU4AiyCnb4b62VQrmUcQmuY6421rMduLPVh5bPY2HhU/eFNJ3hdFtA49A+YkfFA==
  • Delivered-to: rdf4j-dev@xxxxxxxxxxx
  • List-archive: <https://www.eclipse.org/mailman/private/rdf4j-dev/>
  • List-help: <mailto:rdf4j-dev-request@eclipse.org?subject=help>
  • List-subscribe: <https://www.eclipse.org/mailman/listinfo/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=subscribe>
  • List-unsubscribe: <https://www.eclipse.org/mailman/options/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=unsubscribe>
  • Msip_labels:
  • Thread-index: AQHZE7ctaF5Fv4rcQEOyjr+AiXKkgq51TTSw
  • Thread-topic: [rdf4j-dev] Contributing a write-once/read-many triple store to RDF4j

Ow, very nice !

 

Bart

 

From: rdf4j-dev <rdf4j-dev-bounces@xxxxxxxxxxx> On Behalf Of Jerven Tjalling Bolleman
Sent: maandag 19 december 2022 15:36
To: rdf4j-dev@xxxxxxxxxxx
Subject: Re: [rdf4j-dev] Contributing a write-once/read-many triple store to RDF4j

 

Dear All,


As a first step this is available at

This has been cleared by management and is already available under the normal RDF4j licencse.

 

There is still a lot of work to be done and I need to document much more !

Still it is public now and open for review.

 

Regards,

Jerven Bolleman

 

SIB logo

Jerven Tjalling Bolleman
Principal Software Developer
SIB | Swiss Institute of Bioinformatics
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss

 


From: rdf4j-dev <rdf4j-dev-bounces@xxxxxxxxxxx> on behalf of jerven Bolleman <jerven.bolleman@sib.swiss>
Sent: 31 October 2022 15:09
To: rdf4j-dev@xxxxxxxxxxx <rdf4j-dev@xxxxxxxxxxx>
Subject: [rdf4j-dev] Contributing a write-once/read-many triple store to RDF4j

 

Dear RDF4j dev-community,

I have been distracted by writing a write-once/read-many quad store :)

This store is designed with some of the challenges of UniProt in mind.
It is based around two concepts sort all the things, and don't mix value
types. This quad store is aimed to be good for datasets with up to about
4000 distinct predicates and graphs in a few 100s range, billions of
distinct values and trillions of triples. That change relatively rarely
and when they do can be generated/reloaded from scratch.

# Some technical snippets.

## Sorted lists for values

The store has dictionaries for values like the vast majority of quad
stores. Difference is one dictionary for each distinct datatype plus one
for iris. A nuance of these dictionaries are that they are based around
sorted lists compressed and memory mapped and all keys are therefore
just index position values. These keys are valid for comparison
operators e.g. key 1 value "a" key 2 value "b" and key comparison
(Long.compare) would match SPARQL value comparison.

## Partioned triple tables, with graph filters

The quad table however is highly partitioned.  e.g. one table per
* if the subject is bnode or iri
* the unique predicate
* if the object is bnode or iri or specific datatype.

e.g.

_:1 :pred_0 <http://example.org/iri> .
<http://example.org/iri> :pred_0 3 .
<http://example.org/iri> :pred_0 "lala" .

Will be stored in 3 distinct tables. Allowing us to a completely avoid
storing the predicates and the type of subject or object. For now stored
in separate files e.g.

./pred_0/bnode/iris
./pred_0/iri/datatype_xsd_int
./pred_0/iri/datatype_xsd_string

Which graphs a triple is in is encoded in bitset (roaring for
compression) and there might be multiple graph bitsets per table.
All graphs must be identified by an IRI.

## Inverted indexes using bitsets
Many values can be stored complet
ely inline in such a representation
and we also do inversion of the table. e.g. very valuable for when there
is a small set of distinct objects. e.g. for a with boolean values

We do
true -> [:iri1, :iri2, :iri4]
false -> [:iri1, :iri4, :iri8]

instead of
:iri1 true
:iri1 false
:iri2 true
:iri4 true
:iri4 false
:iri7 false

As all iri's string values are addressable by a 63 bit long value
(positive only). We an turn this into two bitsets. Which give very large
compression ratios and speed afterwards. Reduction to 2% of the input
data for quite a large number of datasets is possible. (2/3rds of the
predicate value combinations in UniProtKB are compressible this way)

## Join optimization candidates

Considering all triples are stored in subject, object order (or that
order is cheap to generate) we can also do a MergeJoin per default for
all patterns where a "subject variable" is joined on. BitSet joins might
in some cases also be possible.

## Open work

There is still a lot of work to be done to make it as fast as possible
and validate that it really works as it is supposed too.
* Strings using less than nine UTF-8 characters are also inline value
candidates but this is not wired up yet.
* FSST compression for the IRI dictionary instead of LZ4.
* Cleanup experiments
* Document more :(
* Reduce temporary file size requirements during compression stage (7TB
for UniProtKB)


## Early results

Early results are encouraging. With for UniProtKB release we need 610 GB
of diskspace. 197 GB for the "quads" the other 413GB for the values.
e.g. roughly 16 bit per triple! This is better than the raw rdf/xml
compressed with xz --best :)

Loading time (for UniProtKB 2022_04) is currently 59 hours on a 128 core
machine (first generation EPYC). With 24 hours in preparsing the rdf/xml
and merge sorting the triples. Another 10 hours in sorting all IRIs, and
25 for converting all values in the triple tables down into their long
identifiers.

In principle the first and last step are highly parallelize and the last
step might be much faster when moving from lz4 to fsst[1] compression
for IRIs and long strings.

I have an in principle agreement that I am allowed to contribute this to
RDF4j. But would like to poll if there is a desire for this and what
kind of paper work do I need to supply.

Considering it is a larger than normal contribution for me. I won't make
the code available until I am clear that the paperwork will be fine/or
that making it fine requires it to be open somewhere already.

Regards,
Jerven


[1] https://github.com/cwida/fsst/









--

        *Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss

_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/rdf4j-dev


Back to the top