[NLP2RDF] document and corpus level aggregates

Thu May 30 08:34:06 CEST 2013

The difference will be in the subject URIs: different tools might do
> different preprocessing, leading to different subject URIs in the
> asserations: e.g. in
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/nif/EX-nif-conversion-output.xml
> you have as reference context
> http://example.com/exampledoc.html#char=0,29
> but you might have
> http://example.com/exampledoc.html#char=0,30
> When processing NIF representations processed via different extraction
> chains e.g. in SPARQL queries the difference between 29 and 30 matters.
>

Exactly, so if the _intention_ is to make an assertion about the document,
then http://example.com/exampledoc.html would be a more appropriate subject
URI. If the intention is to make an assertion about the result of
processing that document then the char range is appropriate.

It's perhaps the difference between "this document has 300 words" and "when
I process this document like this it has 300 words".

The problem might come as you say when we try to aggregate results from
different chains each of which intended to make assertions about the
document as a whole but used different pre-processing giving different
offsets.

Steve
-- 
Department of Computing, Macquarie University
http://web.science.mq.edu.au/~cassidy/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.informatik.uni-leipzig.de/pipermail/nlp2rdf/attachments/20130530/e600ca48/attachment.html>