Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof

From: Lex Trotman <exciidoc@xxxxxxxxx>
Date: Mon, 8 Mar 2021 23:23:37 +1000
Delivered-to: asciidoc-lang-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/asciidoc-lang-dev/>
List-help: <mailto:asciidoc-lang-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=unsubscribe>

...

>
> 1.b. As proposed in another thread initial "letter like" could be
> defined as a Unicode class L code point and final "letter like" could be
> defined as Unicode class L code point followed by any number of Unicode
> class M (combining characters such as accents) code points.

_Or_ as any Unicode class L code point or Unicode class M code point.
The current asciidoctor documentation explicitly states that: "An
AsciiDoc processor always assumes the content is UTF-8 encoded". Are
stray Unicode class M code point forbidden by Unicode? Is this something
we should enforce?

The only place a combining character is "stray" is at the start of a line (ie first in file or after a line separator), otherwise it always applies to the preceding base code point, which can be other things besides category L (letters) and that of course breaks the option of only detecting a combining character. So it is safer to specify the base code point category as you initially suggested.

[1]: https://docs.asciidoctor.org/asciidoc/latest/normalization/
>
> 1.c. currently the punctuation allowed is (,;".?!) which are common
> English punctuations but do not include any non-English punctuation.
>
> 1.d. So should common non-English punctuation be allowed and which?
>
> 1.e. Should all Unicode category P punctuation be allowed?
>
> 1.e. Should punctuation be allowed before the initial constrained
markup?
>
> 1.f. Should only Unicode category Pi and Ps be allowed before and Pf and
> Pe and Po after?
>
> 1.g. What is "space" (here I'm talking in the context of constrained
> markup, there is another thread that addresses it more generally), eg
> Unicode category Zs (https://www.compart.com/en/unicode/category/Zs) and
> AsciiDoc line separators?
>
>From i18l perspective, allowing "Unicode category Zs | Unicode category
Ps" ([2],[3])before an opening constrained markup seems reasonable. We
can hope this wouldn't raise the number of false positives in markup
detection dramatically. I have no opinion regarding the inclusion or not
of the Unicode category Pi ([4]) in that set.

Not even for the French double angle bracket quotation marks?

I have the same reasoning regarding the closing markups, using the
complementary Unicode categories, of course.

[2]: https://www.compart.com/en/unicode/category/Zs
[3]: https://www.compart.com/en/unicode/category/Ps
[4]: https://www.compart.com/en/unicode/category/Pi

> 1.h. Or since unconstrained markup is available should the specification
> be conservative on what is allowed bounding unconstrained markup, the
> markups (*_`#~^) are uncommon in general English text, but tend to occur
> when talking about programming code and math, and I don't know how
> common they are in other languages. The rules are intended to minimise
> nuisance recognition of such use-cases as markup, so the more situations
> that markup is allowed the more nuisance occurrences are likely.

This is the conservative approach, aiming toward maximizing the
compatibility with the existing implementation. But, even for Latin
scripts, the current implementation is not satisfactory. Think, for
example, to the inverted question and exclamation marks used in Spanish
[5]. So in fine, this has to evolve.

Should we consider that (1.h) option for the v1.x of the specs, knowing
we would evolve toward a more intl-aware solution based on the Unicode
categories in the v2.x? Or should we make the leap right now, assuming a
"good enough" compatibility with the existing document base?

[5]: https://en.wikipedia.org/wiki/Inverted_question_and_exclamation_marks

Well, if the formatting markup is going to be parsed into the DOM in version 1.x then the current sequential evaluation of markups is likely to make things difficult having to re-evaluate the text of a node for embedded markups. Also I can't find anything that defines the order that quotes are substituted, but maybe it is somewhere. Anything is of course possible to program, but that particular backward compatibility is with a problematic solution, one that even we have trouble describing. And keeping compatibility leaves writers struggling with the current issues.

Dan's REGEX shows how clear as mud it currently is :-)

Perhaps this is the chance to simplify and clarify it, but it does risk some backward incompatibility.

Cheers

Lex

_______________________________________________
asciidoc-lang-dev mailing list
asciidoc-lang-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev

References:
- [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
  - From: Lex Trotman
- Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
  - From: Sylvain Leroux

Prev by Date: Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
Next by Date: Re: [asciidoc-lang-dev] Whitespace handling
Previous by thread: Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
Next by thread: Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
Index(es):
- Date
- Thread

Breadcrumbs