[NLP2RDF] use of underscore vs. classical query scheme

Fri Jun 22 20:48:18 CEST 2012

Dear Maxime,
Actually the offset scheme of NIF is built on IETF RFC 5147 
(http://tools.ietf.org/html/rfc5147).
So the following would hold:
@prefix ld: <http://www.w3.org/DesignIssues/LinkedData.html#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
ld:offset_717_729  owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC/MediaFrag syntax. The only 
problem is that I don't see any advantages.
Are there e.g. libraries that specialize on reading fragments? Request 
parsing, yes, but fragment parameters? I think they are not widely 
supported or implemented.
Here are the five reasons , which lead to using an own syntax 
(#offset_717_729) instead of #char=717,12

1.  The optional part is not easy to handle, because you would need to 
add owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 .
ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 .
ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

2. When implementing web services, NIF allows the client to choose the 
prefix:
http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&prefix=http%3A%2F%2Fthis.is%2Fa%2Fslash%2Fprefix%2F&urirecipe=offset&input=President+Obama+is+president. 

returning URIs like <http://this.is/a/slash/prefix/offset_10_15>
So RFC 5147 would look like:
<http://this.is/a/slash/prefix/char=717,12>
<http://this.is/a/slash/prefix/char=717,12;UTF-8>
or
<http://this.is/a/slash/prefix?char=717,12>
<http://this.is/a/slash/prefix?char=717,12;UTF-8>

3. Character like = , prevent the use of prefixes, e.g. in turtle:
echo "@prefix ld: <http://www.w3.org/DesignIssues/LinkedData.html#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
ld:offset_717_729  owl:sameAs ld:char=717,12 .
" > test.ttl ; rapper -i turtle  test.ttl

correct turtle:
@prefix owl: <http://www.w3.org/2002/07/owl#> .
<http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729> 
owl:sameAs <http://www.w3.org/DesignIssues/LinkedData.html#char=717,12> .

4. implementation is a little bit more difficult, given that the 
following PHP code extracts the values correctly :

$identifier = "offset_717_729" ;
$arr = split("_", $identifier ) ;
switch ($arr[0]){
     case 'offset' :
         $begin = $arr[1];
         $end = $arr[2];
         //new uri scheme might not have the first 20 chars any more
         break;
     case 'hash' :
         $clength = $arr[1];
         $slength = $arr[2];
         $hash = $arr[3];
         $first20chars = urldecode( substr( $fragment , 
strlen("hash_".$clength."_".$slength."_".$hash."_")   ) ) ;
         // calculete begin and end here if necessary
         break;
}

5. The RFC and Fragments  assume a certain mime type, i.e. plain text. 
NIF does have a broader assumption, i.e. Strings

All the best,
Sebastian

On 06/18/2012 05:53 PM, Maxime Lefrançois wrote:
> (same mail with object and recipients corrected)
>
> Hi,
>
> Maybe there has already been some discussions about this : the pros and cons of using an underscore '_' to separate the different parts of the URI fragment,
>
> For the two existing recipes (offset-based and context-hash-based URIs), if there is an underscore '_' among the 20 first characters of the anchored string, this leads to a 5-parts-looking NIF URI. OK, that's not a big issue. But would you introduce a new NIF recipe with one of the inner parts being a string, you would need to percent-encode the underscores (%5F). This is unconventional as the underscore is a RFC 3986 Unreserved Character and classical urlencode methods don't replace it.
>
>
> I suggest that NIF 2.0 use a classical query scheme for the URI fragments: #nifRecipe=identifier(&param=val)*
> This is also the direction taken in the W3C Media Fragment Proposed Recommendation [1]
>
> For instance:
> Offset-based URIs: #nif=offset&begin=14406&end=14418&text=Semantic%20Web
> Context-Hash-based URIs: #nif=hash&context=4&length=12&md5=79edde636fac847c006605f82d4c5c4d&text=Semantic%20Web
>
> This leads to slightly more verbose fragments, but then you don't need to escape RFC 3986 Unreserved Characters, as the '&' and '=' are always escaped (%26 and %3D).
>
> [1] http://www.w3.org/TR/media-frags/
>
> Cheers,
> Maxime Lefrançois
> _______________________________________________
> NLP2RDF mailing list
> NLP2RDF at lists.informatik.uni-leipzig.de
> http://lists.informatik.uni-leipzig.de/mailman/listinfo/nlp2rdf
>

-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org