Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » TMF (Xtext) » Partial parsing of large files
Partial parsing of large files [message #735310] Tue, 11 October 2011 14:07 Go to next message
Moritz   is currently offline Moritz Friend
Messages: 22
Registered: July 2011
Junior Member
Hi,

Description:
I use Xtext to build an User interface for something what is like a state machine. Xtext helped very much so far to parse the description of the machine. I have the following usecase now:

The file with the simulation results of the state machine is text-based, and very large (let's say 100k to millions of lines). The content is very simpel, though. It contains a header, many entries of the same format (for different steps) and an index:

Model:
	'#Header' header = Header
	'#Data' entries += Entry*
	'#Table' index = Index
;


There are no references whatsoever. I started to use Xtext for this file, too, because it helps very much to build the parser. I currently try to use the generated parser to read the Header and the Table. As a second step, I want to parse several items on different positions of the file. I can't hold the complete tree in memory and try to parse single Entries.

Question:
As stated in the FAQ, I provided the file as a strem so that I can set the position of the file arbitrarily (and thereby start the parsing at any point). I can also start the parser with a given Rule (I understand that this is not what Xtext is intended for, but it is still helpful). When the parsing is done, I get the expected Entry model back. There is an error message "missing EOF" that I could accept, too. The problem is that the parser seems to parse an enormous amount of text after my Entry rule (more than a few 1000, maybe everything), even if the rule has been already perfectly parsed. I read about this problem (probably the lookahead) here and I wonder, if my goal can be achieved with the files generated by Xtext.

Does partial parsing helps me here? I couldn't bring it to work until now...
The manipulation of the initial lookahead didn't help either.
Maybe I can inject an EOF in the stream, but then I would have to know when the passed Rule (an Entry) was parsed succesfully.
I read that there is a great speed enhancement in 2.0.1, but I don't think it will help in my case.

I would appreciate any suggestions here.
Thank you!

[Updated on: Tue, 11 October 2011 14:15]

Report message to a moderator

Re: Partial parsing of large files [message #735336 is a reply to message #735310] Tue, 11 October 2011 14:27 Go to previous messageGo to next message
Sebastian Zarnekow is currently offline Sebastian ZarnekowFriend
Messages: 3118
Registered: July 2009
Senior Member
Hi Moritz,

parsing is done in two steps. The first part is lexing where the
complete stream is read into memory and split into tokens. There is no
communication channel from the parser to the lexer thus the parser
cannot indicate when to stop lexing. If it possible to apply some
heuristics to cut the trailing parts of the stream prior to passing it
to the parser, that would help.

Best regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

On 11.10.11 16:07, Moritz wrote:
> Hi,
>
> Description:
> I use Xtext to build an User interface for something what is like a
> state machine. Xtext helped very much so far to parse the description of
> the machine. I have the following usecase now:
>
> The file with the simulation results of the state machine is text-based,
> and very large (let's say 100k to millions of lines). The content is
> very simpel, though. It contains a header, many entries of the same
> format (for different steps) and an index:
>
> Model:
> '#Header' header = Header
> '#Data' entries += Entry*
> '#Table' index = Index
> ;
>
> There are no references whatsoever. I started to use Xtext for this
> file, too, because it helps very much to build the parser. I currently
> try to use the generated parser to read the Header and the Table. As a
> second step, I want to parse several items on different positions of the
> file. I can't hold the complete tree in memory and try to parse single
> Entries.
>
> Question:
> As stated
> http://wiki.eclipse.org/Xtext/FAQ#How_do_I_load_my_model_in_a_standalone_Java_application.C2.A0.3F,
> I provided the file as a strem so that I can set the position of the
> file arbitrarily. I can also start the parser
> http://www.eclipse.org/forums/index.php/mv/msg/242621/730622/#msg_730622
> (I understand that this is not what Xtext is intended for, but it is
> still helpful). When the parsing is done, I get the expected Entry model
> back. There is an error message "missing EOF" that I could accept, too.
> The problem is that the parser seems to parse an enormous amount of text
> after my Entry rule (more than a few 1000, maybe everything), even if
> the rule has been already perfectly parsed. I read about this problem
> (probably the lookahead)
> http://www.eclipse.org/forums/index.php/mv/msg/17750/59272/#msg_59272
> and I wonder, if my goal can be achieved with the files generated by Xtext.
>
> Does partial parsing helps me here? I couldn't bring it to work until
> now...
> The manipulation of the initial lookahead didn't help either.
> Maybe I can inject an EOF in the stream, but then I would have to know
> when the passed Rule (an Entry) was parsed succesfully.
> I read that there is a great speed enhancement in 2.0.1, but I don't
> think it will help in my case.
>
> I would appreciate any suggestions here.
> Thank you!
Re: Partial parsing of large files [message #735364 is a reply to message #735336] Tue, 11 October 2011 15:34 Go to previous messageGo to next message
Moritz   is currently offline Moritz Friend
Messages: 22
Registered: July 2011
Junior Member
Sebastian Zarnekow wrote on Tue, 11 October 2011 10:27
Hi Moritz,

parsing is done in two steps. The first part is lexing where the
complete stream is read into memory and split into tokens. There is no
communication channel from the parser to the lexer thus the parser
cannot indicate when to stop lexing. If it possible to apply some
heuristics to cut the trailing parts of the stream prior to passing it
to the parser, that would help.

Best regards,
Sebastian


Hi Sebastian,
thank you for the explanation. I thought the lexer passes the tokens directly to the parser.
Yes, there is a possible and simple heuristic: One item is between 15 and 30 lines long, so I could simply cut off the file stream after some characters. The Lexer would stop, the parser returns the object and I could examine (or ignore) the remaing parsing errors.
I think this is what you suggested. Thank you for that, I will go for it.

Anyway, I thought there may be a better solution? If I would take a hand written parser, afaik I would be able to parse single rules. Can I use the generated model on another way to call a parser?

Best regards,
Moritz
Re: Partial parsing of large files [message #735380 is a reply to message #735364] Tue, 11 October 2011 15:50 Go to previous messageGo to next message
Henrik Lindberg is currently offline Henrik LindbergFriend
Messages: 2509
Registered: July 2009
Senior Member
On 10/11/11 5:34 PM, Moritz wrote:

> Anyway, I thought there may be a better solution? If I would take a hand
> written parser, afaik I would be able to parse single rules. Can I use
> the generated model on another way to call a parser?
>
You could call the parser for individual rules - I do that when testing:
IParseResult result = parser.parse(ga.getExpressionRule(), new
StringReader(s));

Tests that the expression text in s can be parsed as an Expression.
Not sure if that helps you...

Regards
- henrik
Re: Partial parsing of large files [message #735388 is a reply to message #735380] Tue, 11 October 2011 16:19 Go to previous messageGo to next message
Moritz   is currently offline Moritz Friend
Messages: 22
Registered: July 2011
Junior Member
Henrik Lindberg wrote on Tue, 11 October 2011 11:50

You could call the parser for individual rules - I do that when testing:
IParseResult result = parser.parse(ga.getExpressionRule(), new
StringReader(s));

Tests that the expression text in s can be parsed as an Expression.
Not sure if that helps you...

Regards
- henrik


Hi Henrik,
thanks; yeah this was my first step to provide my file as a Reader and parse a special rule.
Maybe I wasn't clear enough: The problem is that the parsing won't stop after the Rule is matched, if there is more input available (and my FileReader can't predict when to simulate an EOF). Or with Sebastian's explanation, it's more the lexer that will buffer all the input first.

Best regards,
Moritz
Re: Partial parsing of large files [message #735400 is a reply to message #735388] Tue, 11 October 2011 17:00 Go to previous messageGo to next message
Moritz   is currently offline Moritz Friend
Messages: 22
Registered: July 2011
Junior Member
For those interested and to clarify, the code could read like that:

Model:
	'#Header' header = Header
	'#Data' entries += Entry*
	'#Table' index = Index
;


@Inject
IParser parser;

public Entry readeSingleEntry(File file, int pos) {
	// Create file reader for a large file (> 1M rows) and set position
	FileInputStream fileStream = new FileInputStream(file);
	InputStreamReader reader = new InputStreamReader(fileStream));
	fileStream.getChannel().position(pos);
	
	// Get a sub-rule to be parsed
	SimulationResultParser srParser = (SimulationResultParser) parser;
	ParserRule rule = srParser.getGrammarAccess().getEntryRule();
	IParseResult result = srParser.parse(rule, reader);
	// At this point, the result is available, but the whole fileStream has been read :(

	// Ignore the error 'missing EOF'
	Iterable<INode> syntaxErrors = result.getSyntaxErrors();

	return (Entry) result.getRootASTElement();
}


One solution would be to subclass the FileInputStream and stop the output after several characters.
Re: Partial parsing of large files [message #735540 is a reply to message #735400] Wed, 12 October 2011 06:59 Go to previous message
Daniel Missing name is currently offline Daniel Missing nameFriend
Messages: 101
Registered: July 2011
Senior Member
You could simply create your own MyDslInputStream which stopps reading on your desired offset and combine it with the algorithm above. Simply start caching the string as you read a # sign and continue caching till you encounter a whitespace. Then you check on which section you are and stop reading from the stream by signal an EOF. Or you simply stop reading after the second # sign.

Cheers
Daniel
Previous Topic:popup menu contribution to outline (Xtext 2.0 M5)
Next Topic:suggest changes
Goto Forum:
  


Current Time: Fri Apr 26 07:15:36 GMT 2024

Powered by FUDForum. Page generated in 0.03403 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top