[NLP2RDF] NIF ontology modifications for dependency relations and ambigious OLiA properties

Thu Aug 21 11:25:41 CEST 2014

Dear Martin, dear all,

while converting the TIGER corpus and its dependency trees to NIF, I bumped
> against some limitations of the NIF ontology:
>
> At the moment, the only dedicated property for dependency structures is
> "nif:dependency", pointing from the head to the dependant. However, I think
> an inverse property to that would also be nice to have. So I propose
> "nif:phraseHead", pointing in the other direction.
>

And actually, that would me more conformant to quasi-standardized formats
such as CoNLL that points from dependant to head. In CoNLL, this is a
technical artifact, though: in a tree, the head is unique for the
dependent, but not vice versa, and thus, the tabular CoNLL format requires
exactly two columns rather than an indefinite number.

> A completely missing property is "nif:dependencyRelationType", annotating
> the type of the dependency relation as a literal, just like "nif:posTag"
> does for POS tags.
>

Can you provide a more elaborate example? Do you mean to annotate
dependency relations at dependents? This may not be unambiguous. For
schemes that allow multiple heads/parents for the same dependant/child
(e.g., TIGER, SALSA), it needs to be annotated to the property itself. To
provide a link with OLiA and stay within OWL2/DL, POWLA used reification to
represent labelled edges. As I understand NIF, however, reification is
currently not recommended, as for its original use case as interchange
format in NLP pipelines, this creates too much overhead.

> It would also be nice to have some property that annotates the root node
> of a sentence that could be used to traverse the dependency tree of the
> sentence just like "nif:firstWord" and "nif:nextWord" enable traversing
> surface structure of the sentence.
>

In POWLA, we had such a root feature, as it was intended for corpus
querying rather than merely representing NLP output. For querying, this is
quite essential. But again, it increases the overhead, though. The original
idea we had with Sebastian when he started his work on NIF was to develop a
division of labour between POWLA and NIF, with NIF being an interchange
format as minimal (and as compact) as possible, and POWLA a formalism ready
for OWL2/DL-supported corpus querying. For different reasons (new
positions, new obligations, etc.), working out the mapping between both
stalled at some point, but if there is interest from the community to work
more into the corpus direction with NIF, I would support taking this
endeavour up again.

In any case, one should carefully distinguish two different use cases:
annotation exchange and corpus querying, as they have diametrically
opposed requirements in terms of expressivity and formality (not so much on
the annotation exchange, more on the querying part) and compactness (vice
versa). Pushing both into the same formalism would be a compromise optimal
for neither application. Unless a usecase emerges in which both
requirements meet (and I don't see any), developing two, bidirectionally
mappable, dialects of the same formalism would bve preferrable. A
reasonable mapping should be possible in linear time, so that from a
computational perspective, it would not add substantial overhead to a
pipeline in which on-the-fly syntax annotation (hence, polynomial, at
least) is involved.

On a different node, the OLiA tags may need some changes:
>
> At the moment, there is nif:oliaLink and nif:oliaCategory used to link
> annotated words to respective OLiA resources. However, these resources can
> either be mophological or syntactic annotations. The properties themselves
> don't make it sufficiently clear if the oliaLink is used to link to a POS
> tag category or a syntactic category, like "NounPhrase". I think this is
> semantically ambigious. If OLiA is used for different classes of
> annotation, the properties should reflect this. So the tags should rather
> be "nif:oliaPosLink" and "nif:oliaSyntaxLink" or something like that.
>

>From the OLiA perspective, this is a non-issue, as the Reference Model
remains agnostic about the annotation layer (pos, syntax, whatever) an
annotation comes from. This is because different schemes follow quite
different strategies to distribute morphological, syntactic or semantic
information across different annotation layers (e.g., *semantic* properties
such as being a locative adverb may be encoded on the POS level [Susanne
corpus], on the dependency level [Stanford deps], edge labels in a
constituency tree [TIGER], node labels in a constituency tree [Penn
Historical Corpora] and of course on NER or SRL levels).

Concepts in an OLiA Annotation Model are intrinsically tied to an
annotation layer, though. This is what the hasTier property is intended
for. It is not widely used, though, and may require reassessment, because
real-life tier (annotation level) identifiers are variable rather than
constants. In layer-based annotation tools such as ELAN or EXMARaLDA, tier
ids can be freely defined, and this is used, for example, for dialog
annotation. Then, there would be multiple pos layers, for example, and from
the perspective of the annotation model, we have no idea whether these are
called "STTS1..n" (for the German standard POS tagset) , "POS1..n" or
"Wortarten1..n" (for German POS) or whatever.

Another point in question is that NIF is rather dependent on OLiA
> categories. Now some tagsets used to annotate corpora are not mapped by
> OLiA. Users might also not agree with the OLiA categories themselves and
> might like to define own categories. There is no way to support such
> additions. Of course we could speak to Christian Chiarcos about additions
> to OLiA, but I don't know how open he will be to collaborative additions
> and changes to his model. My vague proposal (just an idea at this point)
> would be:
>

I'm open to additions and discussions about modifications. Modifications in
OLiA should be monotone (until a version change there should be no
deletions, merely deprecation), but possible, and don't depend on me as a
person, but on the collaborators on sourceforge (please don't hesitate to
contact me if you want to contribute).

In any case, OLiA encourages linking with "external reference models"
precisely in the way that Martin suggests. However, unless there is a
strong call for it from the community, I would advise against building yet
another ontology at the moment, as there are plenty around (ISOcat, GOLD,
TDS, quite a few project-specific ones). Instead, one may register novel
categories in ISOcat (http://www.isocat.org/) and include them in the
OLiA-ISOcat linking via rdfs:subPropertyOf and rdfs:subClassOf.

ISOcat provides definitions, URIs, and (optionally) hierarchical relations
between concepts, all of which can be exported to RDF. It is, however, not
an ontology, but a semistructured, and extendable, list of data categories,
so an ontology defining relations between (established or newly created)
ISOcat categories may be provided in addition. As I understood, this was
the idea of the RELcat addition to ISOcat. In this way, user-specific
semantics can be added to ISOcat URIs. A sample ISOcat ontology for the
morphosyntactic profile can be found in the "experimental" branch of the
OLiA sourceforge repository together with the (experimental) linking. If
there is demand from the community, I polish them up and integrate them
into the stable dump.

All the best,
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.informatik.uni-leipzig.de/pipermail/nlp2rdf/attachments/20140821/e20649d5/attachment.html>