Hi Dan,
I agree with Håvard that this would be best discussed on github. Because the current
emails are rather vague and therefore kinda hard to answer.
RIO parsers are streaming, so something else is going on meaning we need a lot more
details.
Please open an discussion and give us some code snippets.
Regards,
Jerven
|
Jerven Tjalling Bolleman
Principal Software Developer
SIB | Swiss Institute of Bioinformatics
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss |
From: rdf4j-dev <rdf4j-dev-bounces@xxxxxxxxxxx> on behalf of Dan S via rdf4j-dev <rdf4j-dev@xxxxxxxxxxx>
Sent: 17 January 2024 21:38
To: rdf4j developer discussions <rdf4j-dev@xxxxxxxxxxx>
Cc: Dan S <danielms853@xxxxxxxxx>
Subject: Re: [rdf4j-dev] Question about parsing large files
Hi Håvard,
To clarify, we think this email should be related to the internal rdf4j development. The parser takes in a filename/handle and returns an iterator. A major selling point of the iterator abstraction is that it should enable extraction of elements
without necessarily loading all of them into memory at once. I guess what we're asking is whether the parser can be enhanced so that it can read larger files than can be loaded in memory (maybe this is something we may also be able to help with). Presumably
if a large enough lookahead is provided it might be able to parse out the next few triples, and as the iterator is advanced through it can free up triples that have already been read. This might be especially useful as the parser is a standalone component
of RDF4J which can be imported via maven/gradle into other projects. It would be super useful to find a parser for very large files, and we were wondering if the rdf4j one could eventually become the solution.
Thanks,
Dan
Hi Benjamin,
Could you post this on the GitHub discussion section? We prefer to keep the dev email just focused on the internal development of RDF4J.
That being said I would recommend trying the Nquads instead. And if you are inserting the data into a database then make sure to use isolation level NONE.
Cheers,
Håvard
On 17 Jan 2024, at 20:35, Benjamin Herber (BLOOMBERG/ 919 3RD A) via rdf4j-dev <rdf4j-dev@xxxxxxxxxxx> wrote:
Hi everyone!
I'm currently working on a project that involves bulk loading larger files (right now limited to n3, ttl and associated family). I was trying to parse about a 100 million triples (~13 GB) and it caused the parser to run out of memory with the JVM set to have
32 GB of heap.
Looked quickly into the parser implementation and it seems like the parser is not able to be set to parse per statement iteration. So what I am doing now is trying to chunk the larger triple files into a series of smaller ones and load each individually, but
this is proving rather error prone and unmaintainable in the long term for different formats.
Does anyone have any insights into how to better approach this or work around a stream parsing? Also, newer to the codebase, so any pointers if I missed something would be appreciated!
Thank you!
- Benjamin Herber
_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/rdf4j-dev
_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/rdf4j-dev
|