|Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof|
> 1.b. As proposed in another thread initial "letter like" could be
> defined as a Unicode class L code point and final "letter like" could be
> defined as Unicode class L code point followed by any number of Unicode
> class M (combining characters such as accents) code points.
_Or_ as any Unicode class L code point or Unicode class M code point.
The current asciidoctor documentation explicitly states that: "An
AsciiDoc processor always assumes the content is UTF-8 encoded". Are
stray Unicode class M code point forbidden by Unicode? Is this something
we should enforce?
> 1.c. currently the punctuation allowed is (,;".?!) which are common
> English punctuations but do not include any non-English punctuation.
> 1.d. So should common non-English punctuation be allowed and which?
> 1.e. Should all Unicode category P punctuation be allowed?
> 1.e. Should punctuation be allowed before the initial constrained
> 1.f. Should only Unicode category Pi and Ps be allowed before and Pf and
> Pe and Po after?
> 1.g. What is "space" (here I'm talking in the context of constrained
> markup, there is another thread that addresses it more generally), eg
> Unicode category Zs (https://www.compart.com/en/unicode/category/Zs) and
> AsciiDoc line separators?
>From i18l perspective, allowing "Unicode category Zs | Unicode category
Ps" (,)before an opening constrained markup seems reasonable. We
can hope this wouldn't raise the number of false positives in markup
detection dramatically. I have no opinion regarding the inclusion or not
of the Unicode category Pi () in that set.
I have the same reasoning regarding the closing markups, using the
complementary Unicode categories, of course.
> 1.h. Or since unconstrained markup is available should the specification
> be conservative on what is allowed bounding unconstrained markup, the
> markups (*_`#~^) are uncommon in general English text, but tend to occur
> when talking about programming code and math, and I don't know how
> common they are in other languages. The rules are intended to minimise
> nuisance recognition of such use-cases as markup, so the more situations
> that markup is allowed the more nuisance occurrences are likely.
This is the conservative approach, aiming toward maximizing the
compatibility with the existing implementation. But, even for Latin
scripts, the current implementation is not satisfactory. Think, for
example, to the inverted question and exclamation marks used in Spanish
. So in fine, this has to evolve.
Should we consider that (1.h) option for the v1.x of the specs, knowing
we would evolve toward a more intl-aware solution based on the Unicode
categories in the v2.x? Or should we make the leap right now, assuming a
"good enough" compatibility with the existing document base?
asciidoc-lang-dev mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
Back to the top