[NLP2RDF] document and corpus level aggregates

Felix Sasaki fsasaki at w3.org
Thu May 30 08:29:17 CEST 2013


Am 30.05.13 08:24, schrieb Steve Cassidy:
> Thanks Felix, is there a difference though between making an assertion 
> about the document and making one about the string that results from 
> pre-processing the document?

The difference will be in the subject URIs: different tools might do 
different preprocessing, leading to different subject URIs in the 
asserations: e.g. in

http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/nif/EX-nif-conversion-output.xml
you have as reference context
http://example.com/exampledoc.html#char=0,29
but you might have
http://example.com/exampledoc.html#char=0,30
When processing NIF representations processed via different extraction 
chains e.g. in SPARQL queries the difference between 29 and 30 matters.

Best,

Felix

>
> It's probably not an important point but it seems odd to me to qualify 
> it in this way.
>
> Steve
>
>
> On 30 May 2013 16:14, Felix Sasaki <fsasaki at w3.org 
> <mailto:fsasaki at w3.org>> wrote:
>
>     Am 30.05.13 08:07, schrieb Steve Cassidy:
>>
>>
>>         The basic unit in NIF is the nif:Context, so the
>>         document-level is covered, when the string in a nif:Context
>>         equals the content of a document. 
>>
>>         ...
>>         <Alcoholism.txt#char=37028,37043>
>>                 a  nif:RFC5147String ;
>>                 nif:beginIndex "37028" ;
>>                 nif:endIndex "37043" ;
>>                 itsrdf:taIdentRef
>>         <http://dbpedia.org/resource/Benzodiazepine> ;
>>                 nif:referenceContext <Alcoholism.txt#char=0,91429>  .
>>
>>
>>     Just wondering why you don't use <Alcoholism.txt> when making
>>     assertions about the document as a whole rather than giving the
>>     entire character range as a qualifier.
>
>     Hi Steve,
>
>     Sebastian may have a different answer, but here is my view from
>     how this is used in ITS 2.0: when you convert a document like
>     http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#EX-HTML-whitespace-normalization
>     to NIF, you will make a lot of decisions what to drop (white space
>     nodes, content of HTML "head" or "script" inside "body") and how
>     to segment (e.g. not extract content of "span" separately but
>     rather as part of "p"). nif:referenceContext gives you together
>     with nif:isString clear information what the extracted complete
>     string is.
>
>     Best,
>
>     Felix
>
>>      Presumably the same assertion would be true of
>>     <Alcoholism.txt#char=0,91427>  too but if you are trying to
>>     encode document level meta-data and you have an identifier for
>>     the document, why not use it?
>>
>>     Steve
>>     -- 
>>     Department of Computing, Macquarie University
>>     http://web.science.mq.edu.au/~cassidy/
>>     <http://web.science.mq.edu.au/%7Ecassidy/>
>>
>>
>>     _______________________________________________
>>     NLP2RDF mailing list
>>     NLP2RDF at lists.informatik.uni-leipzig.de  <mailto:NLP2RDF at lists.informatik.uni-leipzig.de>
>>     http://lists.informatik.uni-leipzig.de/mailman/listinfo/nlp2rdf
>
>
>
>
> -- 
> Department of Computing, Macquarie University
> http://web.science.mq.edu.au/~cassidy/ 
> <http://web.science.mq.edu.au/%7Ecassidy/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.informatik.uni-leipzig.de/pipermail/nlp2rdf/attachments/20130530/e9bd9b8b/attachment-0001.html>


More information about the NLP2RDF mailing list