[NLP2RDF] document and corpus level aggregates

David Lewis dave.lewis at cs.tcd.ie
Thu May 30 12:40:34 CEST 2013


I think some of these issues can only be addressed by understanding where any particular nif model sits within a process chain. The best way to record this in my view is using the provenance ontology. In mlw-lt we've been looking how to integrated NIF and PROV-O in particular for localsiation processing chains, see:

http://www.w3.org/International/its/wiki/Provenance_Best_Practice

This is just a rough example, but we are going to be updating this and better documentation shortly. We already have an implementation running successfully named CMS-LION.

 I'd be keen to collaborate more on working out in more detail how nif and prov-o could be combined.

Cheers,
Dave

On 30 May 2013, at 11:24, Sebastian Hellmann <hellmann at informatik.uni-leipzig.de> wrote:

> Hi Steve,
>> Thanks Felix, is there a difference though between making an assertion about the document and making one about the string that results from pre-processing the document? 
> 
> documents are really tricky technical as well as philosophical (abstract identity, ship of Theseus). From the top of my head I couldn't even define what "document" means exactly. 
> 
> Basically you can never be certain what hides behind a document URL. Here are some examples:
> 1. non-information resources: http://dbpedia.org/resource/London
> 2. A multilingual CMS normally implements a fallback mechanism to English , if a translated page is missing in e.g. German. So while the language of the document would be German, the content would be English. 
> 3. http://www.w3.org/DesignIssues/LinkedData.html has been edited several time, last on 2009. To what does the URI refer to in this case? All versions or the latest?
> 
> For NLP we should only use the document URL for info not concerning the content. This makes everything much easier and interoperable. 
> 
>> It's perhaps the difference between "this document has 300 words" and "when I process this document like this it has 300 words". 
> That is one major difficulty for interoperability. The latter one is reproducible. nif:Context is a more granular modeling as it points to the text. It doesn't really matter, whether it is a document or not (e.g. sentence or paragraph).  So you can actually model paragraphs the same way. 
> 
> 
>> I guess the question is for a processing component that wants to make an assertion in its output about the document as a whole so that a subsequent step can use it.  Should it use the input document URI or make an assertion about the character range that it used to represent the document internally.  Given that the character range might be different between different components, it would seem useful to have a way of making assertions about the whole document that didn't depend on how it was pre-processed.
>> 
>>> Can you give a triple and a sparql query that only works if we drop #char=0,29 from the URI?
>> Well, it would be the result of two components making assertions about different character ranges each believing that it is making an assertion about the whole document.
> 
> Wouldn't his be a client issue on how to merge this. How is this handled traditionally? nif:sourceUrl is currently still unstable and underspecified. 
> Maybe we can find a use case for this issue and then decide. 
> For the ITS use case this is completely irrelevant. Because NIF is only used in the Web service conversion scenario. That is  ITS in HTML -> Text (or NIF) -> NLP webservice -> NIF output -> merge with ITS in HTML. 
> 
> There are several options (maybe more):
> 1. use a special identifier such as  #char=0,  to denote the whole character range. This merges everything automatically then. 
> 2.  the client can merge annotations by copying them: 
> construct {
>     <newUri#char=x,x> ?p ?o 
> } where { 
>     ?context ?p ?o .
>     ?context nif:sourceUrl <document>. 
> }
> 
> Could you elaborate what kind of annotations you are referring to ?
> Using the document URI makes sense for certain annotations (e.g. dc:publisher). For others not (e.g. nif:count). 
> All the best,
> Sebastian
> 
> Am 30.05.2013 09:01, schrieb Steve Cassidy:
>> On 30 May 2013 16:39, Felix Sasaki <fsasaki at w3.org> wrote:
>> 
>>> Well, do avoid the problem you need two pieces of information:
>>> - document URI independent of complete character range
>>> - document URI + complete character range 
>>> http://example.com/exampledoc.html#=char=0,29 gives you both, and the ability to distinguish between different calculations of complete character ranges.
>> 
>> <http://example.com/exampledoc.html#=char=0,29> xx:wordcount 5 .
>> <http://example.com/exampledoc.html> xx:wordcount 5 .
>> 
>> These are two separate statements and not related unless we say
>> 
>> <http://example.com/exampledoc.html> 
>>         xx:full_character_range <http://example.com/exampledoc.html#=char=0,29> .
>> 
>> which of course you could assert.  
>> 
>> I guess the question is for a processing component that wants to make an assertion in its output about the document as a whole so that a subsequent step can use it.  Should it use the input document URI or make an assertion about the character range that it used to represent the document internally.  Given that the character range might be different between different components, it would seem useful to have a way of making assertions about the whole document that didn't depend on how it was pre-processed.
>> 
>>> Can you give a triple and a sparql query that only works if we drop #=char=0,29 from the URI?
>> Well, it would be the result of two components making assertions about different character ranges each believing that it is making an assertion about the whole document.
>> 
>> Steve
>> 
>> -- 
>> Department of Computing, Macquarie University
>> http://web.science.mq.edu.au/~cassidy/
> 
> 
> -- 
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig 
> Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Deadline: *July 8th*)
> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
> Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org
> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
> Research Group: http://aksw.org
> _______________________________________________
> NLP2RDF mailing list
> NLP2RDF at lists.informatik.uni-leipzig.de
> http://lists.informatik.uni-leipzig.de/mailman/listinfo/nlp2rdf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.informatik.uni-leipzig.de/pipermail/nlp2rdf/attachments/20130530/6757909a/attachment.html>


More information about the NLP2RDF mailing list