[NLP2RDF] NIF and provenance

Thu Jan 8 12:38:33 CET 2015

Hey folks,

tracking provenance of NIF annotations is a problem that recently came
up in on of our projects. OpenAnnotation was one of the proposed
solutions and we were thinking about reification and named graphs as
well. To further discussion on the issue, I compiled a small
presentation that compares different approaches. The main use case
focused on was tracking the provenance of a specific nif property
(nif:stem, but could also be pos tags or similar) annotated by different
tools. You can take a look at it here:

https://docs.google.com/presentation/d/1TIuUiBCC95j9T2ZwYX-r27-lxx-LIRrE9HJS22mJT9w/edit?usp=sharing

I'm open for comments, further use-cases, best-practice proposals and
further approaches.

regards,
Martin

Am 27.12.2014 um 20:16 schrieb Rob H Warren:
> Dear all,
>
> The use cases that Peter has put forward all have merit. I have combined PROV with NIF in a previous project on transcription to record the specific processes that generated the fragments as well as different evaluations of transcriptions.
>
> One issue that I would like to bring up (and it is a problem elsewhere too): proper provenance tracking requires a series of instances and properties that link the result to the source through some process. It is not always practical to incur that complexity within a closed system and people's self interest means they will do the minimum, even if it means the data is otherwise inconsistent to everyone else.
>
> For this reason, I would suggest that a dual provenance approach be taken by NIF: simple nif-based properties for strait forward cases and the full PROV spec for more complex problems. I'm not completely comfortable using graphs in that for any distributed contexts, it is ambiguous how to communicate to the query engine (or query writer) that the graph holding the triple has significance and graph information is not encoded in some serializations.
>
> Perhaps a document with examples for tracking provenance with nif fragments might be useful for discussion and education?
>
> best,
> rhw
>
> On Dec 19, 2014, at 7:00 AM, nlp2rdf-request at lists.informatik.uni-leipzig.de wrote:
>
>> Message: 4
>> Date: Fri, 19 Dec 2014 09:05:12 +0100
>> From: Peter Menke <pmenke at techfak.uni-bielefeld.de>
>> To: Philipp Cimiano <cimiano at cit-ec.uni-bielefeld.de>, Rob H Warren
>> 	<warren at muninn-project.org>
>> Cc: swalter at techfak.uni-bielefeld.de, cunger at techfak.uni-bielefeld.de,
>> 	benjamin.siemoneit at yahoo.de, nlp2rdf at lists.informatik.uni-leipzig.de
>> Subject: Re: [NLP2RDF] NIF and provenance
>> Message-ID: <etPan.5493dc38.327b23c6.1051 at niphredil.fritz.box>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Dear all,
>>
>> let me give a little more detail in addition to the information provided by Philipp.
>>
>> Our goal is to add provenance information in a way that this provenance information can be exploited to efficiently retrieve subsets of NIF corpora on their basis. Our original approach was from a querying perspective, where we wanted to solve problems such as retrieving the correct triples for questions such as
>>
>> - Give me all texts in the corpus annotated by person X.
>> - Give me those layers of PoS information generated by service Y.
>> - Give me all annotations from the corpus that have been validated by service Z.
>>
>> However, some properties of NIF make it challenging for us to model this kind of information. One of them is the fact that NIF annotations use the "#char=n,m" notation as subjects for many related annotations. This makes it difficult to identify and address different kinds of annotations resulting from different activities. However, the identification of single layers of annotations seems very important for our problems.
>>
>> For instance, one of our use cases is related to using multiple PoS taggers that use the same tag set. They produce different results, but we cannot express the provenance information in a way that allows for the identification of the origin of a particular pos tag token (e.g., when we want to answer questions such as "Which tagger can be blamed for this erroneous tag? Tagger A or tagger B?")?
>>
>> Also, we looked at annotations that correct a previous layer of data, such as a manual correction of an automated tagging service. As soon as the automated tagger result is published in NIF there is no easy way of adding information about single corrections of its results.
>>
>> This, in a nutshell, is the background of what we attempt to achieve with provenance metadata in our current project. I will add further information as soon as possible, but I hope that this gives you a better impression and avoids some misunderstandings.?
>> ?
>> Best regards,
>> Peter Menke
>>
>> --  
>> Peter Menke
>> SFB 673 "Alignment in Communication"
>> Project X1 "Multimodal Alignment Corpora"
>> Universit?t Bielefeld
>> Postfach 10 01 31
>> 33501 Bielefeld
>>
>> CITEC-Geb?ude, Raum 2.309
>> Telefon (+49 521) 106-67328
>>
>> On 18. Dezember 2014 at 23:32:40, Philipp Cimiano (cimiano at cit-ec.uni-bielefeld.de) wrote:
>>> Dear Rob, all,
>>>
>>> thanks for your answer. I should have been more precise in my
>>> question. We are actually building on PROV as well. So we are not
>>> looking into extending NIF by a provenance layer, but really working out
>>> best practices for representing provenance information in NIF building
>>> on PROV.
>>>
>>> Our actual question is where to attach the provenance information too.
>>> We see two options: i) reifying all annotation triples and add
>>> provenance information to the reified object representing the annotaiton
>>> or ii) using named graphs to attach provenance information to the graph.
>>>
>>> Are there any experiences with these two options that you can share with us?
>>>
>>> Best regards,
>>>
>>> Philipp.
> _______________________________________________
> NLP2RDF mailing list
> NLP2RDF at lists.informatik.uni-leipzig.de
> http://lists.informatik.uni-leipzig.de/mailman/listinfo/nlp2rdf