Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [asciidoc-lang-dev] Whitespace handling



On Sun, 28 Feb 2021 at 22:10, Sylvain Leroux <sylvain@xxxxxxxxxxx> wrote:


On 28/02/2021 11:37, Dan Allen wrote:
>> Lex:
>>   "Asciidoc spacing characters" is fine by me, if a bit long, but
> {asc} can fix that :-)
>
> Excellent. We will certainly define it.

FWIW, it suits me well too ;)

>> > Dan wrote:
>> One thing we will need to be careful about, though, is that AsciiDoc
> doesn't support *all* spacing characters. So we'll just need to
> emphasize that in our definition / usage.
>> Lex wrote:
>> I wonder if the standard should consider changing that?
>
> Perhaps. Though I think one of the values of AsciiDoc is that it's not
> cryptic. So limiting spacing characters to space, tab, and line feed has
> merit. When we start to allow other spacing characters, then I think it
> introduces inconsistently from one document to the next. I think a case
> has to be made for each individual character we permit. That way, we
> know why we permitted it.

I already discussed that idea with Dan elsewhere. I assume it's not a
problem if I quote him here: "A valid space character is a space, a tab,
or a line feed (aka newline). It's questionable whether a non-breaking
space should be allowed. But it definitely shouldn't extend beyond that."

My understanding at that time was a parser only has to deal with Ascii
characters (Formally, the C0 Controls and Basic Latin block [1]). So the
only spacing allowed between markups and user's content were \u0009 and
\u0020. That won't prevent any other spacing character inside the text,
at least up to the DOM.

Agree that the spacing which is markup (like paragraph separators or after section and list markups) is only ASCII.

Spacing that is recognised for output purposes, such as PDF justification, is different and is in that processor's perview to define what it supports IMO.
 

Dan also reminded me at that time `asciidoctor` "normalizes" lines early
in the document processing to remove trailing spaces ([2]). Once again,
I understood "to remove trailing \u0009 and \u0020." That also means
trailing spaces will not make their way into the DOM.

Since trailing spacing has no semantic meaning in standard Asciidoc that has no effect on Asciidoc.  But it would however impact extensions that assigned semantics to it, for example an extension that processes markdown (which uses trailing spaces as break IIUC).  So at least on literal blocks it should be left so that included content can be addressed by extensions unmodified.  (What would be a use-case? An example is that programming projects are now documenting their functions with markdown so writing a book on such a programming library could include the function description from the source directly and process it with an extension, eg the Julia language).
 

We didn't discuss the case of the `\r`. AFAIK, asciidoctor only
recognizes `\n` as the line terminator. So, we might extend the
normalization process to remove `\r` at the end of a line.


I would phrase it that line separators are \n and \r\n, not sure how much \r for old macs exists any more.  That way the \r is simply part of the line ending.  And the concept of multibyte line endings allows NEL or LS and PS to be added in future if it becomes necessary.
 
Cheers
Lex


[1]: https://unicode.org/charts/PDF/U0000.pdf
[2]:
https://discuss.asciidoctor.org/Section-without-a-title-tp8466p8468.html

- Sylvain



_______________________________________________
asciidoc-lang-dev mailing list
asciidoc-lang-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev

Back to the top