Eclipse Community Forums: TMF (Xtext) » Partial parsing of large files

Help

Home

Home » Modeling » TMF (Xtext) » Partial parsing of large files

Show: Today's Messages :: Show Polls :: Message Navigator

Partial parsing of large files [message #735310]

Tue, 11 October 2011 14:07

Moritz

Messages: 22
Registered: July 2011

Junior Member

Hi,

Description:
I use Xtext to build an User interface for something what is like a state machine. Xtext helped very much so far to parse the description of the machine. I have the following usecase now:

The file with the simulation results of the state machine is text-based, and very large (let's say 100k to millions of lines). The content is very simpel, though. It contains a header, many entries of the same format (for different steps) and an index:

Model:
	'#Header' header = Header
	'#Data' entries += Entry*
	'#Table' index = Index
;

There are no references whatsoever. I started to use Xtext for this file, too, because it helps very much to build the parser. I currently try to use the generated parser to read the Header and the Table. As a second step, I want to parse several items on different positions of the file. I can't hold the complete tree in memory and try to parse single Entries.

Question:
As stated in the FAQ, I provided the file as a strem so that I can set the position of the file arbitrarily (and thereby start the parsing at any point). I can also start the parser with a given Rule (I understand that this is not what Xtext is intended for, but it is still helpful). When the parsing is done, I get the expected Entry model back. There is an error message "missing EOF" that I could accept, too. The problem is that the parser seems to parse an enormous amount of text after my Entry rule (more than a few 1000, maybe everything), even if the rule has been already perfectly parsed. I read about this problem (probably the lookahead) here and I wonder, if my goal can be achieved with the files generated by Xtext.

Does partial parsing helps me here? I couldn't bring it to work until now...
The manipulation of the initial lookahead didn't help either.
Maybe I can inject an EOF in the stream, but then I would have to know when the passed Rule (an Entry) was parsed succesfully.
I read that there is a great speed enhancement in 2.0.1, but I don't think it will help in my case.

I would appreciate any suggestions here.
Thank you!

[Updated on: Tue, 11 October 2011 14:15]

Report message to a moderator

Re: Partial parsing of large files [message #735336 is a reply to message #735310]

Tue, 11 October 2011 14:27

Sebastian Zarnekow

Messages: 3118
Registered: July 2009

Senior Member

Hi Moritz,

parsing is done in two steps. The first part is lexing where the
complete stream is read into memory and split into tokens. There is no
communication channel from the parser to the lexer thus the parser
cannot indicate when to stop lexing. If it possible to apply some
heuristics to cut the trailing parts of the stream prior to passing it
to the parser, that would help.

Best regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

On 11.10.11 16:07, Moritz wrote:
> Hi,
>
> Description:
> I use Xtext to build an User interface for something what is like a
> state machine. Xtext helped very much so far to parse the description of
> the machine. I have the following usecase now:
>
> The file with the simulation results of the state machine is text-based,
> and very large (let's say 100k to millions of lines). The content is
> very simpel, though. It contains a header, many entries of the same
> format (for different steps) and an index:
>
> Model:
> '#Header' header = Header
> '#Data' entries += Entry*
> '#Table' index = Index
> ;
>
> There are no references whatsoever. I started to use Xtext for this
> file, too, because it helps very much to build the parser. I currently
> try to use the generated parser to read the Header and the Table. As a
> second step, I want to parse several items on different positions of the
> file. I can't hold the complete tree in memory and try to parse single
> Entries.
>
> Question:
> As stated
> http://wiki.eclipse.org/Xtext/FAQ#How_do_I_load_my_model_in_a_standalone_Java_application.C2.A0.3F,
> I provided the file as a strem so that I can set the position of the
> file arbitrarily. I can also start the parser
> http://www.eclipse.org/forums/index.php/mv/msg/242621/730622/#msg_730622
> (I understand that this is not what Xtext is intended for, but it is
> still helpful). When the parsing is done, I get the expected Entry model
> back. There is an error message "missing EOF" that I could accept, too.
> The problem is that the parser seems to parse an enormous amount of text
> after my Entry rule (more than a few 1000, maybe everything), even if
> the rule has been already perfectly parsed. I read about this problem
> (probably the lookahead)
> http://www.eclipse.org/forums/index.php/mv/msg/17750/59272/#msg_59272
> and I wonder, if my goal can be achieved with the files generated by Xtext.
>
> Does partial parsing helps me here? I couldn't bring it to work until
> now...
> The manipulation of the initial lookahead didn't help either.
> Maybe I can inject an EOF in the stream, but then I would have to know
> when the passed Rule (an Entry) was parsed succesfully.
> I read that there is a great speed enhancement in 2.0.1, but I don't
> think it will help in my case.
>
> I would appreciate any suggestions here.
> Thank you!

Report message to a moderator

Re: Partial parsing of large files [message #735364 is a reply to message #735336]

Tue, 11 October 2011 15:34

Moritz

Messages: 22
Registered: July 2011

Junior Member

Sebastian Zarnekow wrote on Tue, 11 October 2011 10:27

Hi Moritz,

parsing is done in two steps. The first part is lexing where the
complete stream is read into memory and split into tokens. There is no
communication channel from the parser to the lexer thus the parser
cannot indicate when to stop lexing. If it possible to apply some
heuristics to cut the trailing parts of the stream prior to passing it
to the parser, that would help.

Best regards,
Sebastian

Hi Sebastian,
thank you for the explanation. I thought the lexer passes the tokens directly to the parser.
Yes, there is a possible and simple heuristic: One item is between 15 and 30 lines long, so I could simply cut off the file stream after some characters. The Lexer would stop, the parser returns the object and I could examine (or ignore) the remaing parsing errors.
I think this is what you suggested. Thank you for that, I will go for it.

Anyway, I thought there may be a better solution? If I would take a hand written parser, afaik I would be able to parse single rules. Can I use the generated model on another way to call a parser?

Best regards,
Moritz

Report message to a moderator

Re: Partial parsing of large files [message #735380 is a reply to message #735364]

Tue, 11 October 2011 15:50

Henrik Lindberg

Messages: 2509
Registered: July 2009

Senior Member

On 10/11/11 5:34 PM, Moritz wrote:

> Anyway, I thought there may be a better solution? If I would take a hand
> written parser, afaik I would be able to parse single rules. Can I use
> the generated model on another way to call a parser?
>
You could call the parser for individual rules - I do that when testing:
IParseResult result = parser.parse(ga.getExpressionRule(), new
StringReader(s));

Tests that the expression text in s can be parsed as an Expression.
Not sure if that helps you...

Regards
- henrik

Report message to a moderator

Re: Partial parsing of large files [message #735388 is a reply to message #735380]

Tue, 11 October 2011 16:19

Moritz

Messages: 22
Registered: July 2011

Junior Member

Henrik Lindberg wrote on Tue, 11 October 2011 11:50

You could call the parser for individual rules - I do that when testing:
IParseResult result = parser.parse(ga.getExpressionRule(), new
StringReader(s));

Tests that the expression text in s can be parsed as an Expression.
Not sure if that helps you...

Regards
- henrik

Hi Henrik,
thanks; yeah this was my first step to provide my file as a Reader and parse a special rule.
Maybe I wasn't clear enough: The problem is that the parsing won't stop after the Rule is matched, if there is more input available (and my FileReader can't predict when to simulate an EOF). Or with Sebastian's explanation, it's more the lexer that will buffer all the input first.

Best regards,
Moritz

Report message to a moderator

Re: Partial parsing of large files [message #735400 is a reply to message #735388]

Tue, 11 October 2011 17:00

Moritz

Messages: 22
Registered: July 2011

Junior Member

For those interested and to clarify, the code could read like that:

Model:
	'#Header' header = Header
	'#Data' entries += Entry*
	'#Table' index = Index
;

@Inject
IParser parser;

public Entry readeSingleEntry(File file, int pos) {
	// Create file reader for a large file (> 1M rows) and set position
	FileInputStream fileStream = new FileInputStream(file);
	InputStreamReader reader = new InputStreamReader(fileStream));
	fileStream.getChannel().position(pos);
	
	// Get a sub-rule to be parsed
	SimulationResultParser srParser = (SimulationResultParser) parser;
	ParserRule rule = srParser.getGrammarAccess().getEntryRule();
	IParseResult result = srParser.parse(rule, reader);
	// At this point, the result is available, but the whole fileStream has been read :(

	// Ignore the error 'missing EOF'
	Iterable<INode> syntaxErrors = result.getSyntaxErrors();

	return (Entry) result.getRootASTElement();
}

One solution would be to subclass the FileInputStream and stop the output after several characters.

Report message to a moderator

Re: Partial parsing of large files [message #735540 is a reply to message #735400]

Wed, 12 October 2011 06:59

Daniel Missing name

Messages: 101
Registered: July 2011

Senior Member

You could simply create your own MyDslInputStream which stopps reading on your desired offset and combine it with the algorithm above. Simply start caching the string as you read a # sign and continue caching till you encounter a whitespace. Then you check on which section you are and stop reading from the stream by signal an EOF. Or you simply stop reading after the second # sign.

Cheers
Daniel

Report message to a moderator

Previous Topic:	popup menu contribution to outline (Xtext 2.0 M5)
Next Topic:	suggest changes

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 26 07:15:36 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter