[NLP2RDF] document and corpus level aggregates
Felix Sasaki
fsasaki at w3.org
Thu May 30 08:29:17 CEST 2013
Am 30.05.13 08:24, schrieb Steve Cassidy:
> Thanks Felix, is there a difference though between making an assertion
> about the document and making one about the string that results from
> pre-processing the document?
The difference will be in the subject URIs: different tools might do
different preprocessing, leading to different subject URIs in the
asserations: e.g. in
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/nif/EX-nif-conversion-output.xml
you have as reference context
http://example.com/exampledoc.html#char=0,29
but you might have
http://example.com/exampledoc.html#char=0,30
When processing NIF representations processed via different extraction
chains e.g. in SPARQL queries the difference between 29 and 30 matters.
Best,
Felix
>
> It's probably not an important point but it seems odd to me to qualify
> it in this way.
>
> Steve
>
>
> On 30 May 2013 16:14, Felix Sasaki <fsasaki at w3.org
> <mailto:fsasaki at w3.org>> wrote:
>
> Am 30.05.13 08:07, schrieb Steve Cassidy:
>>
>>
>> The basic unit in NIF is the nif:Context, so the
>> document-level is covered, when the string in a nif:Context
>> equals the content of a document.
>>
>> ...
>> <Alcoholism.txt#char=37028,37043>
>> a nif:RFC5147String ;
>> nif:beginIndex "37028" ;
>> nif:endIndex "37043" ;
>> itsrdf:taIdentRef
>> <http://dbpedia.org/resource/Benzodiazepine> ;
>> nif:referenceContext <Alcoholism.txt#char=0,91429> .
>>
>>
>> Just wondering why you don't use <Alcoholism.txt> when making
>> assertions about the document as a whole rather than giving the
>> entire character range as a qualifier.
>
> Hi Steve,
>
> Sebastian may have a different answer, but here is my view from
> how this is used in ITS 2.0: when you convert a document like
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#EX-HTML-whitespace-normalization
> to NIF, you will make a lot of decisions what to drop (white space
> nodes, content of HTML "head" or "script" inside "body") and how
> to segment (e.g. not extract content of "span" separately but
> rather as part of "p"). nif:referenceContext gives you together
> with nif:isString clear information what the extracted complete
> string is.
>
> Best,
>
> Felix
>
>> Presumably the same assertion would be true of
>> <Alcoholism.txt#char=0,91427> too but if you are trying to
>> encode document level meta-data and you have an identifier for
>> the document, why not use it?
>>
>> Steve
>> --
>> Department of Computing, Macquarie University
>> http://web.science.mq.edu.au/~cassidy/
>> <http://web.science.mq.edu.au/%7Ecassidy/>
>>
>>
>> _______________________________________________
>> NLP2RDF mailing list
>> NLP2RDF at lists.informatik.uni-leipzig.de <mailto:NLP2RDF at lists.informatik.uni-leipzig.de>
>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/nlp2rdf
>
>
>
>
> --
> Department of Computing, Macquarie University
> http://web.science.mq.edu.au/~cassidy/
> <http://web.science.mq.edu.au/%7Ecassidy/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.informatik.uni-leipzig.de/pipermail/nlp2rdf/attachments/20130530/e9bd9b8b/attachment-0001.html>
More information about the NLP2RDF
mailing list