[NLP2RDF] document and corpus level aggregates

Sebastian Hellmann hellmann at informatik.uni-leipzig.de
Thu May 30 12:24:38 CEST 2013


Hi Steve,
> Thanks Felix, is there a difference though between making an assertion 
> about the document and making one about the string that results from 
> pre-processing the document? 

documents are really tricky technical as well as philosophical (abstract 
identity, ship of Theseus). From the top of my head I couldn't even 
define what "document" means exactly.

Basically you can never be certain what hides behind a document URL. 
Here are some examples:
1. non-information resources: http://dbpedia.org/resource/London
2. A multilingual CMS normally implements a fallback mechanism to 
English , if a translated page is missing in e.g. German. So while the 
language of the document would be German, the content would be English.
3. http://www.w3.org/DesignIssues/LinkedData.html has been edited 
several time, last on 2009. To what does the URI refer to in this case? 
All versions or the latest?

For NLP we should only use the document URL for info not concerning the 
content. This makes everything much easier and interoperable.

> It's perhaps the difference between "this document has 300 words" and 
> "when I process this document like this it has 300 words". 
That is one major difficulty for interoperability. The latter one is 
reproducible. nif:Context is a more granular modeling as it points to 
the text. It doesn't really matter, whether it is a document or not 
(e.g. sentence or paragraph).  So you can actually model paragraphs the 
same way.


> I guess the question is for a processing component that wants to make 
> an assertion in its output about the document as a whole so that a 
> subsequent step can use it.  Should it use the input document URI or 
> make an assertion about the character range that it used to represent 
> the document internally.  Given that the character range might be 
> different between different components, it would seem useful to have a 
> way of making assertions about the whole document that didn't depend 
> on how it was pre-processed.
>
>     Can you give a triple and a sparql query that only works if we
>     drop #char=0,29 from the URI?
>
> Well, it would be the result of two components making assertions about 
> different character ranges each believing that it is making an 
> assertion about the whole document.

Wouldn't his be a client issue on how to merge this. How is this handled 
traditionally? nif:sourceUrl is currently still unstable and 
underspecified.
Maybe we can find a use case for this issue and then decide.
For the ITS use case this is completely irrelevant. Because NIF is only 
used in the Web service conversion scenario. That is  ITS in HTML -> 
Text (or NIF) -> NLP webservice -> NIF output -> merge with ITS in HTML.

There are several options (maybe more):
1. use a special identifier such as  #char=0,  to denote the whole 
character range. This merges everything automatically then.
2.  the client can merge annotations by copying them:
construct {
     <newUri#char=x,x> ?p ?o
} where {
     ?context ?p ?o .
     ?context nif:sourceUrl <document>.
}

Could you elaborate what kind of annotations you are referring to ?
Using the document URI makes sense for certain annotations (e.g. 
dc:publisher). For others not (e.g. nif:count).
All the best,
Sebastian

Am 30.05.2013 09:01, schrieb Steve Cassidy:
> On 30 May 2013 16:39, Felix Sasaki <fsasaki at w3.org 
> <mailto:fsasaki at w3.org>> wrote:
>
>     Well, do avoid the problem you need two pieces of information:
>     - document URI independent of complete character range
>     - document URI + complete character range
>     http://example.com/exampledoc.html#=char=0,29 gives you both, and
>     the ability to distinguish between different calculations of
>     complete character ranges.
>
>
> <http://example.com/exampledoc.html#=char=0,29> xx:wordcount 5 .
> <http://example.com/exampledoc.htm 
> <http://example.com/exampledoc.html#=char=0,29>l> xx:wordcount 5 .
>
> These are two separate statements and not related unless we say
>
> <http://example.com/exampledoc.htm 
> <http://example.com/exampledoc.html#=char=0,29>l>
>         xx:full_character_range 
> <http://example.com/exampledoc.html#=char=0,29> .
>
> which of course you could assert.
>
> I guess the question is for a processing component that wants to make 
> an assertion in its output about the document as a whole so that a 
> subsequent step can use it.  Should it use the input document URI or 
> make an assertion about the character range that it used to represent 
> the document internally.  Given that the character range might be 
> different between different components, it would seem useful to have a 
> way of making assertions about the whole document that didn't depend 
> on how it was pre-processed.
>
>     Can you give a triple and a sparql query that only works if we
>     drop #=char=0,29 from the URI?
>
> Well, it would be the result of two components making assertions about 
> different character ranges each believing that it is making an 
> assertion about the whole document.
>
> Steve
>
> -- 
> Department of Computing, Macquarie University
> http://web.science.mq.edu.au/~cassidy/ 
> <http://web.science.mq.edu.au/%7Ecassidy/>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, 
Deadline: *July 8th*)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.informatik.uni-leipzig.de/pipermail/nlp2rdf/attachments/20130530/501a9b24/attachment.html>


More information about the NLP2RDF mailing list