<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Hi Michael,<br>

      <br>

      On 01.05.2015 02:55, Michael Scharf wrote:<br>

    </div>

    <blockquote cite="mid:5542CF04.4070703@scharf.gr" type="cite">Hi

      Sebastian,

      <br>

      <br>

      &gt; While UTF-8 has a variable length of one to four bytes per

      code point,

      <br>

      &gt; UTF-16 and 32 have the advantage of a fixed length.

      <br>

      <br>

      UTF-16 is **not** a fixed length encoding. Like UTF-8 it can use

      up to 4 bytes.

      <br>

      Only UTF-32 encodes with a fixed length.

      <br>

    </blockquote>

    <br>

    Ah, yes, thanks, I got mixed up. <br>

    <br>

    <blockquote cite="mid:5542CF04.4070703@scharf.gr" type="cite">Here I

      show it in python (note the u'xxx' is a UTF-16):

      <br>

      <br>

        &gt;&gt;&gt; len('𐐂')

      <br>

        4

      <br>

        &gt;&gt;&gt; len(u'𐐂')

      <br>

        2

      <br>

        &gt;&gt;&gt; len('ä')

      <br>

        2

      <br>

        &gt;&gt;&gt; len(u'ä')

      <br>

        1

      <br>

      <br>

      The same is true in javascript (node):

      <br>

      <br>

      <br>

        &gt; u4='𐐂'

      <br>

        '𐐂'

      <br>

        &gt; u4.length

      <br>

        2

      <br>

        &gt; u4.charCodeAt(1)

      <br>

        56322

      <br>

        &gt; u4.charCodeAt(0)

      <br>

        55297

      <br>

      <br>

        &gt; u2='ä'

      <br>

        'ä'

      <br>

        &gt; u2.length

      <br>

        1

      <br>

        &gt; u2.charCodeAt(0)

      <br>

        228

      <br>

        &gt; u2.charCodeAt(1)

      <br>

        NaN

      <br>

      <br>

    </blockquote>

    <br>

    Do you think it is feasible to require implementations of the Web

    Annotation Data Model to count in code points?<br>

    Javascript also seems to have methods for "character-based" string

    counting:<br>

    <br>

    punycode.ucs2.decode(string).length;<br>

    or <br>

    Array.from(string).length;<br>

    or<br>

    [...string].length;<br>

    <br>

    I am not a JS expert, I am just copying from

    <a class="moz-txt-link-freetext" href="https://mathiasbynens.be/notes/javascript-unicode">https://mathiasbynens.be/notes/javascript-unicode</a>.<br>

    <br>

    All the best,<br>

    Sebastian<br>

    <br>

    <blockquote cite="mid:5542CF04.4070703@scharf.gr" type="cite">

      <br>

      Michael

      <br>

      <br>

      On 2015-04-30 17:55, Sebastian Hellmann wrote:

      <br>

      <blockquote type="cite">Hi all,

        <br>

        <br>

          I am a bit puzzled why

        <a class="moz-txt-link-freetext" href="http://www.w3.org/TR/charmod/#sec-stringIndexing">http://www.w3.org/TR/charmod/#sec-stringIndexing</a> is renaming

        Unicode Code Points (a clearly defined thing) to Character

        String.

        <br>

         From my understanding the example in

        <a class="moz-txt-link-freetext" href="http://www.w3.org/TR/charmod/#C052">http://www.w3.org/TR/charmod/#C052</a> is not good:

        <br>

        "(Example: the use of UTF-16 in [DOM Level 1])."

        <br>

        <br>

        UTF-16 is the encoding of the string and is independent of code

        points, units and graphems, i.e. you can encode the same code

        point in UTF-8, UTF-16 and UTF-32 which will definitely change

        the number of code units and bytes needed.

        <br>

        <br>

        While UTF-8 has a variable length of one to four bytes per code

        point, UTF-16 and 32 have the advantage of a fixed length. This

        means that you can use byte offsets easily to jump to certain

        positions in the text. However, this is mostly

        <br>

        used internally, i.e. C/C++ has a dataype widechar using 16 bits

        as it is easier to allocate memory for variables. Maybe some DOM

        parser rely on UTF-16 internally too, but still count Code

        Points

        <br>

        <br>

        On the (serialized) web, UTF-8 is predominant, which is really

        not the question here as the choice between graphems, code

        points and units is orthogonal to encoding.

        <br>

        <br>

        Regarding annotation, using code points or Character Strings is

        definitely the best practice. Any deviation will lead to side

        effects such as "ä" having the length 2:

        <br>

        <br>

        Using code points:

        <br>

        Java, length(): "ä".length() == 1

        <br>

        PHP,utf8_decode(): strlen(utf8_decode("ä"))===1

        <br>

        Python, len() in combination with decode():

        len("ä".decode("UTF-8")) ==1

        <br>

        <br>

        Using code units:

        <br>

        Unix wc:           echo -n "ä" | wc is 2

        <br>

        PHP:                 strlen("ä")===2

        <br>

        Python:             len("ä")===2

        <br>

        <br>

        For the NLP2RDF project we converted these 30 million

        annotations to RDF: <a class="moz-txt-link-freetext" href="http://wiki-link.nlp2rdf.org/">http://wiki-link.nlp2rdf.org/</a>

        <br>

        It was quite difficult to work with the byte offset given that

        the original formats where HTML, txt, PDFs and docx.

        <br>

        <br>

        Anyhow, I wouldn't know a single use case for using Code Units

        for annotation. I am unsure about Graphems. Personally I think,

        byte offset for text is unnecessary, simply because code points

        are better, i.e. stable regarding encoding and

        <br>

        charset.

        <br>

        <br>

        There is a problem with Unicode Normal Form (NF). Generally,

        Normal Form C is fine. However, if people wish to annotate

        diacritics independently. NFD is needed.

        <br>

        NFC:  è

        <br>

        NFD: `e

        <br>

        in NFD you can annotate the code point for the diacritic

        separately. However, NFD is not in wide use and the annotation

        of diacritics is probably out of scope.

        <br>

        <br>

        There is some info in

        <br>

        - the "definition of string" section in the NIF spec:

        <a class="moz-txt-link-freetext" href="http://persistence.uni-leipzig.org/nlp2rdf/specification/core.html">http://persistence.uni-leipzig.org/nlp2rdf/specification/core.html</a>

        <br>

        (yes,  we consider moving to a W3C community group for further

        improvement)

        <br>

        - Unicode Norm Forms:

        <a class="moz-txt-link-freetext" href="http://unicode.org/reports/tr15/#Norm_Forms">http://unicode.org/reports/tr15/#Norm_Forms</a>

        <br>

        - <a class="moz-txt-link-freetext" href="http://tinyurl.com/sh-thesis">http://tinyurl.com/sh-thesis</a> ,  page 76

        <br>

        <br>

        On my wishlist, I would hope that the new Annotation standard

        would include a normative list (SHOULD not MUST) of string

        counting functions for all major programming languages and other

        standards like SPARQL  to tackle interoperability.

        <br>

        When transfering data, it is important that the other

        implementation counts offsets the same way. Listing the

        functions would help a lot.

        <br>

        <br>

        All the best,

        <br>

        Sebastian

        <br>

        <br>

        On 30.04.2015 13:01, Nick Stenning wrote:

        <br>

        <blockquote type="cite">Thanks for this reference, Martin, and

          thanks for passing this to TAG,

          <br>

          Frederick.

          <br>

          <br>

          The character model lays out the problems more clearly than I

          have. It's

          <br>

          clear that recommendation is to use character strings (i.e.

          codepoint

          <br>

          sequences) unless:

          <br>

          <br>

          a) there are performance considerations that would predicate

          the use of

          <br>

          "code unit strings" (I presume interop with existing DOM APIs

          would also

          <br>

          be a strong motivator)

          <br>

          b) "user interaction is a primary concern" -- in which case

          grapheme

          <br>

          clusters may be considered

          <br>

          <br>

          Unfortunately for us, both considerations apply in the

          annotation use

          <br>

          case.

          <br>

          <br>

          I'd suggest we schedule a discussion of this issue in an

          upcoming call.

          <br>

          <br>

          N

          <br>

          <br>

          On Thu, Apr 30, 2015, at 02:58, Martin J. Dürst wrote:

          <br>

          <blockquote type="cite">Hello Frederik,

            <br>

            <br>

            This is an old, well-known issue. As a starter, please have

            a look at

            <br>

            what the Character Model has to say about this:

            <br>

            <br>

            <a class="moz-txt-link-freetext" href="http://www.w3.org/TR/charmod/#sec-stringIndexing">http://www.w3.org/TR/charmod/#sec-stringIndexing</a>

            <br>

            <br>

            Please feel free to come back again here or contact the I18N

            WG.

            <br>

            <br>

            Regards,   Martin.

            <br>

            <br>

            On 2015/04/29 21:45, Frederick Hirsch wrote:

            <br>

            <blockquote type="cite">TAG members - has the issue of

              dealing with symbols vs characters/codepoints come up in

              TAG discussion?

              <br>

              <br>

              Any comment/suggestion welcome  (I've cross-posted

              intentionally, please remove recipients if not

              appropriate.)

              <br>

              <br>

              Thanks

              <br>

              <br>

              regards, Frederick

              <br>

              <br>

              Frederick Hirsch

              <br>

              Co-Chair, W3C Web Annotation WG

              <br>

              <br>

              <a class="moz-txt-link-abbreviated" href="http://www.fjhirsch.com">www.fjhirsch.com</a>  <a class="moz-txt-link-rfc2396E" href="http://www.fjhirsch.com/">&lt;http://www.fjhirsch.com/&gt;</a>

              <br>

              @fjhirsch

              <br>

              <br>

              <blockquote type="cite">Begin forwarded message:

                <br>

                <br>

                From: "Nick Stenning"<a class="moz-txt-link-rfc2396E" href="mailto:nick@whiteink.com">&lt;nick@whiteink.com&gt;</a>

                <br>

                Subject: Unicode offset calculations

                <br>

                Date: April 29, 2015 at 4:38:34 AM EDT

                <br>

                <a class="moz-txt-link-abbreviated" href="mailto:To:public-annotation@w3.org">To:public-annotation@w3.org</a>

                <br>

                <a class="moz-txt-link-abbreviated" href="mailto:Resent-From:public-annotation@w3.org">Resent-From:public-annotation@w3.org</a>

                <br>

                <br>

                One of the most useful discussions at our working group

                F2F last week was the result of a question from Takeshi

                Kanai about how we calculate character offsets such as

                those used by Text Position

                Selector<a class="moz-txt-link-rfc2396E" href="http://www.w3.org/TR/annotation-model/#text-position-selector">&lt;http://www.w3.org/TR/annotation-model/#text-position-selector&gt;</a> 

                in the draft model. Specifically, if I have a selector

                such as

                <br>

                <br>

                {

                <br>

                    "@id": "urn:uuid:...",

                <br>

                    "@type": "oa:TextPositionSelector",

                <br>

                    "start": 478,

                <br>

                    "end": 512

                <br>

                }

                <br>

                to what do the numbers 478 and 512 refer? These numbers

                will likely be interpreted by other components specified

                by this WG (such as the RangeFinder API), not to mention

                external systems, and we need to make sure we are

                consistent in our definitions across these

                specifications.

                <br>

                <br>

                I've reviewed what the model spec currently says and I'm

                not sure it's particularly precise on this point. Even

                if I'm misreading it and it is clear, I'm not sure it

                makes a recommendation that is practical. In order to

                review this, I'm going to first lay out the possible

                points of ambiguity, and then review what the spec seems

                to say on these issues.

                <br>

                <br>

                1. A symbol is not (necessarily) a codepoint

                <br>

                <br>

                The atom of selection in the browser is the symbol, or

                grapheme. For example, "ą́" is composed of three

                codepoints, but is rendered as a single selectable

                symbol. It can only be unselected or selected: there is

                no way to only select some of the codepoints that

                comprise the symbol.

                <br>

                <br>

                Because user selections start and end at symbols, it

                would be reasonable for TextPositionSelector offsets to

                be defined as symbol counts. Unfortunately, most extant

                DOM APIs don't deal in symbols:

                <br>

                <br>

                <blockquote type="cite">var p =

                  document.createElement('p')

                  <br>

                  p.innerText = 'ą́'

                  <br>

                  p.innerText.length

                  <br>

                </blockquote>

                3

                <br>

                <blockquote type="cite">p.firstChild.splitText(1)

                  <br>

                  p.firstChild.nodeValue

                  <br>

                </blockquote>

                '\u0061'

                <br>

                <blockquote type="cite">p.firstChild.nextSibling.nodeValue

                  <br>

                </blockquote>

                '\u0328\u0301'

                <br>

                Calculating how a sequence of codepoints maps to

                rendered symbols is in principle

                complicated<a class="moz-txt-link-rfc2396E" href="http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">&lt;http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries&gt;</a> 

                and in practice not completely standardised across

                rendering engines. It's also (as demonstrated with

                splitText above) possible for the DOM to end up in a

                state in which the mapping between textual content and

                rendered symbols has become decoupled.

                <br>

                <br>

                2. Combining characters

                <br>

                <br>

                Some sequences of codepoints can render identically to

                other sequences of codepoints. For example:

                <br>

                <br>

                ñ (U+00F1 LATIN SMALL LETTER N WITH TILDE)

                <br>

                renders identically to

                <br>

                <br>

                ñ (U+006E LATIN SMALL LETTER N + U+0303 COMBINING TILDE)

                <br>

                This is the "combining characters" problem. Some

                codepoints are used to modify the appearance of

                preceding codepoints. Selections made on a document

                containing one of these would behave identically to

                selections made on a document containing the other, but:

                <br>

                <br>

                <blockquote type="cite">'ñ'.length

                  <br>

                </blockquote>

                1

                <br>

                <blockquote type="cite">'ñ'.length

                  <br>

                </blockquote>

                2

                <br>

                This is not an insoluble problem, as the Unicode

                specification itself defines a process by which

                sequences of codepoints can be canonicalised into fully

                decomposed (aka "NFD") or fully composed (aka "NFC")

                form. But it's not that simple, because if we specify a

                canonicalisation requirement for annotation selector

                offsets, then there may be undesirable performance

                implications (consider making an annotation at the end

                of a 100KB web page of unknown canonicalisation status).

                <br>

                <br>

                3. Astral codepoints and JavaScript

                <br>

                <br>

                JavaScript's internal encoding of Unicode strings is

                based on

                UCS-2<a class="moz-txt-link-rfc2396E" href="https://en.wikipedia.org/wiki/UTF-16#History">&lt;https://en.wikipedia.org/wiki/UTF-16#History&gt;</a>,

                which means that it represents codepoints from the

                so-called "astral planes" (i.e. codepoints above 0xFFFF)

                as two surrogate pairs. This leads to the principal

                problem that Takeshi identified, which is that different

                environments will calculate offsets differently. For

                example, in Python 3:

                <br>

                <br>

                <blockquote type="cite">

                  <blockquote type="cite">

                    <blockquote type="cite">len('😀') # U+1F600 GRINNING

                      FACE

                      <br>

                    </blockquote>

                  </blockquote>

                </blockquote>

                1

                <br>

                Whereas in JavaScript:

                <br>

                <br>

                <blockquote type="cite">'😀'.length

                  <br>

                </blockquote>

                2

                <br>

                There are ways of addressing this problem in JavaScript,

                but to my knowledge none of them are particularly

                elegant, and none of them will allow us to calculate

                offsets at the bottom of a long document without

                scanning the entire preceding text for astral

                codepoints.

                <br>

                <br>

                So what does our spec currently say?

                <br>

                <br>

                The text must be normalized before counting characters.

                HTML/XML tags should be removed, character entities

                should be replaced with the character that they encode,

                unnecessary whitespace should be normalized, and so

                forth. The normalization routine may be performed

                automatically by a browser, and other clients should

                implement the DOM String Comparisons [DOM-Level-3-Core]

                method.

                <br>

                <br>

                It's not immediately clear what this means in terms of

                Unicode normalisation. Following the chain of

                specifications leads to §1.3.1 of the DOM Level 3 Core

                specification<a class="moz-txt-link-rfc2396E" href="http://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMString">&lt;http://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMString&gt;</a>.

                The only thing this says about Unicode normalisation is:

                <br>

                <br>

                The character normalization, i.e. transforming into

                their fully normalized form as as defined in [XML 1.1],

                is assumed to happen at serialization time.

                <br>

                <br>

                This doesn't appear to be relevant, as the meaning of

                "serialization" in this context appears to refer to the

                mechanisms described in the DOM Load and

                Save<a class="moz-txt-link-rfc2396E" href="http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/">&lt;http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/&gt;</a> 

                spec, and does not refer to the process of parsing an

                HTML document and presenting its content through the DOM

                APIs. (I'd be very happy if someone more familiar with

                the DOM Level 3 Spec could confirm this interpretation.)

                <br>

                <br>

                For completeness, "fully normalized form" in the XML 1.1

                sense<a class="moz-txt-link-rfc2396E" href="http://www.w3.org/TR/2004/REC-xml11-20040204/#dt-fullnorm">&lt;http://www.w3.org/TR/2004/REC-xml11-20040204/#dt-fullnorm&gt;</a> 

                would appear to imply full "NFC" normalisation of the

                document. It is apparent even from the simple examples

                above that browsers do not apply NFC normalisation to

                documents they receive from the server.

                <br>

                <br>

                What should we do about all this?

                <br>

                <br>

                I've listed above three sources of confusion in talking

                about offsets. In each case there is tension between

                what would make the most sense to a user and a pragmatic

                engineering recommendation that takes into account

                contingent factors.

                <br>

                <br>

                Symbols: users can only select symbols in browsers, but

                as far as I'm aware all current internal DOM APIs ignore

                this fact. Further, given that determining the symbol

                sequence for a given codepoint sequence is non-trivial,

                we probably should not attempt to define offsets in

                terms of symbols.

                <br>

                <br>

                Combining characters: user selections make no

                distinction between combinatoric variants such as "ñ"

                and "n + ˜", so it would seem logical to define offsets

                in terms of the "NFC" canonicalised form. In practice,

                such a recommendation would likely be ignored by

                implementers (for reasons of complexity or performance

                impact), and so for the same reasons as in 1) I'd be

                inclined to suggest we define offsets in terms of the

                delivered document codepoint sequence rather than any

                canonical form.

                <br>

                <br>

                Astral codepoints + surrogate pairs: this is the tricky

                one. As demonstrated by this

                page<a class="moz-txt-link-rfc2396E" href="http://bl.ocks.org/nickstenning/bf09f4538878b97ebe6f">&lt;http://bl.ocks.org/nickstenning/bf09f4538878b97ebe6f&gt;</a>,

                this poses serious problems for interoperability, as

                JavaScript counts a single unicode astral codepoint as

                having length 2, due to the internal representation of

                the codepoint as a surrogate pair. As far as I'm

                concerned we're stuck between a rock and a hard place:

                <br>

                <br>

                a. calculating offsets in terms of codepoints (i.e.

                accounting for surrogate pairs in JavaScript) makes

                interoperability more likely, but could impose a

                substantial cost on client-side algorithms, both in

                terms of implementation complexity and performance

                impact.

                <br>

                <br>

                b. calculating offsets using native calculations on

                JavaScript strings is preferable from an implementation

                complexity standpoint, but as far as I'm aware no other

                mainstream programming environment has the same

                idiosyncrasy, thus almost guaranteeing problems of

                interoperability when offsets are used in both the DOM

                environment and outside.

                <br>

                <br>

                In summary, Takeshi raised an important question at the

                F2F. What do we do about JavaScript's rather unfortunate

                implementation of unicode strings? I'd be interested to

                hear from anyone with thoughts on this subject. I

                imagine there are people in the I18N activity at W3C who

                would be able to weigh in here too.

                <br>

                <br>

                -N

                <br>

                <br>

              </blockquote>

              <br>

            </blockquote>

          </blockquote>

          <br>

        </blockquote>

        <br>

        <br>

        --

        <br>

        Sebastian Hellmann

        <br>

        AKSW/NLP2RDF research group

        <br>

        Insitute for Applied Informatics (InfAI) and DBpedia Association

        <br>

        Events:

        <br>

        * *Feb 9th, 2015* 3rd DBpedia Community Meeting in Dublin

        <a class="moz-txt-link-rfc2396E" href="http://wiki.dbpedia.org/meetings/Dublin2015">&lt;http://wiki.dbpedia.org/meetings/Dublin2015&gt;</a>

        <br>

        * *May 29th, 2015* Submission deadline SEMANTiCS 2015

        <br>

        * *Sept 15th-17th, 2015* SEMANTiCS 2015 (formerly i-SEMANTICS),

        Vienna <a class="moz-txt-link-rfc2396E" href="http://semantics.cc/">&lt;http://semantics.cc/&gt;</a>

        <br>

        Venha para a Alemanha como PhD:

        <a class="moz-txt-link-freetext" href="http://bis.informatik.uni-leipzig.de/csf">http://bis.informatik.uni-leipzig.de/csf</a>

        <br>

        Projects: <a class="moz-txt-link-freetext" href="http://dbpedia.org">http://dbpedia.org</a>, <a class="moz-txt-link-freetext" href="http://nlp2rdf.org">http://nlp2rdf.org</a>,

        <a class="moz-txt-link-freetext" href="http://linguistics.okfn.org">http://linguistics.okfn.org</a>, <a class="moz-txt-link-freetext" href="https://www.w3.org/community/ld4lt">https://www.w3.org/community/ld4lt</a>

        <a class="moz-txt-link-rfc2396E" href="http://www.w3.org/community/ld4lt">&lt;http://www.w3.org/community/ld4lt&gt;</a>

        <br>

        Homepage: <a class="moz-txt-link-freetext" href="http://aksw.org/SebastianHellmann">http://aksw.org/SebastianHellmann</a>

        <br>

        Research Group: <a class="moz-txt-link-freetext" href="http://aksw.org">http://aksw.org</a>

        <br>

        Thesis:

        <br>

        <a class="moz-txt-link-freetext" href="http://tinyurl.com/sh-thesis-summary">http://tinyurl.com/sh-thesis-summary</a>

        <br>

        <a class="moz-txt-link-freetext" href="http://tinyurl.com/sh-thesis">http://tinyurl.com/sh-thesis</a>

        <br>

      </blockquote>

      <br>

    </blockquote>

    <br>

    <br>

    <div class="moz-signature">-- <br>

      <small>Sebastian Hellmann<br>

        AKSW/NLP2RDF research group<br>

        Insitute for Applied Informatics (InfAI) and DBpedia Association<br>

        Events: <br>

        * <b>Feb 9th, 2015</b> <a

          href="http://wiki.dbpedia.org/meetings/Dublin2015">3rd DBpedia

          Community Meeting in Dublin</a><br>

        * <b>May 29th, 2015</b> Submission deadline SEMANTiCS 2015 <br>

        * <b>Sept 15th-17th, 2015</b> <a href="http://semantics.cc/">SEMANTiCS

          2015 (formerly i-SEMANTICS), Vienna </a><br>

        Venha para a Alemanha como PhD: <a

          href="http://bis.informatik.uni-leipzig.de/csf">http://bis.informatik.uni-leipzig.de/csf</a><br>

        Projects: <a href="http://dbpedia.org">http://dbpedia.org</a>,

        <a href="http://nlp2rdf.org">http://nlp2rdf.org</a>, <a

          href="http://linguistics.okfn.org">http://linguistics.okfn.org</a>,

        <a href="http://www.w3.org/community/ld4lt">https://www.w3.org/community/ld4lt</a><br>

        Homepage: <a href="http://aksw.org/SebastianHellmann">http://aksw.org/SebastianHellmann</a><br>

        Research Group: <a href="http://aksw.org">http://aksw.org</a><br>

        Thesis:<br>

        <a href="http://tinyurl.com/sh-thesis-summary">http://tinyurl.com/sh-thesis-summary</a><br>

        <a href="http://tinyurl.com/sh-thesis">http://tinyurl.com/sh-thesis</a><br>

      </small></div>

  </body>

</html>