Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » TMF (Xtext) » Umlauts and UTF-8 BOM
Umlauts and UTF-8 BOM [message #1690231] Wed, 25 March 2015 20:53 Go to next message
Hendrik Motza is currently offline Hendrik MotzaFriend
Messages: 7
Registered: October 2014
Junior Member
Hi,

I have to write a grammar for an existing DSL which uses UTF-8 files and can contain any umlauts (also outside of quoted strings).

1. There are too much umlauts in the different languages. Is there no way to allow any letter similar to java regular expressions with \\w?

2. When a dsl file is opened which begins with the UTF-8 BOM, this results in a parser exception. Is there a way to ignore the BOM preambel?

Thx in advance!
DataWorm
Re: Umlauts and UTF-8 BOM [message #1690250 is a reply to message #1690231] Thu, 26 March 2015 06:21 Go to previous messageGo to next message
Christian Dietrich is currently offline Christian DietrichFriend
Messages: 14735
Registered: July 2009
Senior Member
for (1) no the only way is to define ranges '\uFIRST'..'\uSECOND'

for (2) you may comment on https://bugs.eclipse.org/bugs/show_bug.cgi?id=390308


Twitter : @chrdietrich
Blog : https://www.dietrich-it.de
Day Job: https://www.everest-systems.com
Re: Umlauts and UTF-8 BOM [message #1690251 is a reply to message #1690250] Thu, 26 March 2015 06:24 Go to previous messageGo to next message
Christian Dietrich is currently offline Christian DietrichFriend
Messages: 14735
Registered: July 2009
Senior Member
P.S: Maybe you can additionally add U+FEFF to the grammar

Twitter : @chrdietrich
Blog : https://www.dietrich-it.de
Day Job: https://www.everest-systems.com
Re: Umlauts and UTF-8 BOM [message #1690278 is a reply to message #1690250] Thu, 26 March 2015 09:53 Go to previous messageGo to next message
Hendrik Motza is currently offline Hendrik MotzaFriend
Messages: 7
Registered: October 2014
Junior Member
'\\u...' is recognized as string and not as character. But the idea of a range was simple and helpful. I solved it that way:
terminal fragment LETTER: ('a'..'z' | 'A'..'Z' | 'À'..'ÿ');

I have seen that bug report, also the day it was opened and that the state is still marked as NEW. Is Xtext still under development or why does they ignore such old unhandled bugs?

I thought someone might have found another working solution for this like christian supposed. I will give it a further try!

[Updated on: Thu, 26 March 2015 10:48]

Report message to a moderator

Re: Umlauts and UTF-8 BOM [message #1738620 is a reply to message #1690278] Thu, 21 July 2016 13:26 Go to previous messageGo to next message
Nils B. is currently offline Nils B.Friend
Messages: 10
Registered: July 2016
Junior Member
i also need to ignore the bom. is there now a way to do this?
Re: Umlauts and UTF-8 BOM [message #1738657 is a reply to message #1738620] Thu, 21 July 2016 19:19 Go to previous messageGo to next message
Hendrik Motza is currently offline Hendrik MotzaFriend
Messages: 7
Registered: October 2014
Junior Member
If there is one I haven't found it yet! Sad
Re: Umlauts and UTF-8 BOM [message #1738892 is a reply to message #1738657] Mon, 25 July 2016 18:37 Go to previous messageGo to next message
Jan Koehnlein is currently offline Jan KoehnleinFriend
Messages: 760
Registered: July 2009
Location: Hamburg
Senior Member
What do you mean by "ignoring the BOM"?

The BOM is usually handled by Eclipse-classes, such as org.eclipse.ui.editors.text.FileDocumentProvider.setDocumentContent(IDocument, IEditorInput, String).


---
Get professional support from the Xtext committers at www.typefox.io
Re: Umlauts and UTF-8 BOM [message #1738894 is a reply to message #1738892] Mon, 25 July 2016 18:57 Go to previous messageGo to next message
Hendrik Motza is currently offline Hendrik MotzaFriend
Messages: 7
Registered: October 2014
Junior Member
When using xtext on a utf8 file with bom, the bom is seen as content of my xtext grammar. Therefore when i try to load such a grammar file starting with a bom the editor tells me the dsl starts with invalid content.

To avoid such error messages we would like to ignore the bom if it exists at the start of the document. I tried to add the bom to my grammar (and then simply ignore this part of my grammar) but so far I did not succeed for some reasons to catch that charsequence with a grammar...

Thx for your hint regarding the corresponding eclipse class but for now I have no clue how to modify/influence it because all these calls of eclipse classes are done by the xtext sdk... :-/
Re: Umlauts and UTF-8 BOM [message #1738932 is a reply to message #1738894] Tue, 26 July 2016 08:26 Go to previous messageGo to next message
Jan Koehnlein is currently offline Jan KoehnleinFriend
Messages: 760
Registered: July 2009
Location: Hamburg
Senior Member
You must have customized something else, as by default you never see the BOM inside the document, i.e. you never have to and you never can deal with it in the grammar.

---
Get professional support from the Xtext committers at www.typefox.io
Re: Umlauts and UTF-8 BOM [message #1738933 is a reply to message #1738932] Tue, 26 July 2016 08:27 Go to previous messageGo to next message
Jan Koehnlein is currently offline Jan KoehnleinFriend
Messages: 760
Registered: July 2009
Location: Hamburg
Senior Member
The hint could give you a good start for debugging. Usually Xtext classes inherit from Eclipse classes and you have to find out which ones and where they are created and then hook in your changes by dependency injection.

---
Get professional support from the Xtext committers at www.typefox.io
Re: Umlauts and UTF-8 BOM [message #1739031 is a reply to message #1738932] Wed, 27 July 2016 05:32 Go to previous message
Ed Merks is currently offline Ed MerksFriend
Messages: 33258
Registered: July 2009
Senior Member
I didn't test it, but perhaps the workspace doesn't think/know the
encoding is UTF-8 and isn't expected the BOM at the start.


On 26.07.2016 10:26, Jan Koehnlein wrote:
> You must have customized something else, as by default you never see
> the BOM inside the document, i.e. you never have to and you never can
> deal with it in the grammar.


Ed Merks
Professional Support: https://www.macromodeling.com/
Previous Topic:IAE Exception when no type name.
Next Topic:[SOLVED] Content Assist Invalid completeXMemberFeatureCall_Feature
Goto Forum:
  


Current Time: Sun Dec 08 07:35:24 GMT 2024

Powered by FUDForum. Page generated in 0.04591 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top