[NLP2RDF] [Corpora-List] Announcement: NLP Interchange Format (NIF)

Sat Dec 17 16:49:06 CET 2011

Dear Sebastian,

> I was actually wondering why you got stuck on the serialization
> of NIF. It is one of the tiniest and uninteresting aspects,
> in my opinion.

I certainly agree with that point.

Unfortunately, the RDF/XML serialization was the source of the
problem.  Everybody with a background in AI and kn. representation
wanted to use LISP.  Ora Lassila started with a LISP version in 1997.
Guha wanted to use LISP, but the W3C made the decision to use XML.

If they had defined the semantics in LISP, they would have had clean
triples (A B C) that could be mapped to XML, JSON, or anything else
very easily.  But since they defined RDF in XML, the angle brackets
had lots of nooks and crannies where people could insert other
information -- and they did.  Please look at the article by Tim Bray:

    http://www.tbray.org/ongoing/When/200x/2003/05/21/RDFNet

Quotation by Tim B, who defined the original RDF:
> Speaking only for myself, I have never actually managed to write down
> a chunk of RDF/XML correctly, even when I had the triples laid out
> quite clearly in my head. Furthermore—once again speaking for myself
> —I find most existing RDF/XML entirely unreadable. And I think I
> understand the theory reasonably well.

When Guha and Bray, the two original designers of RDF/XML, admit
that they made a mistake, that is a clue that something is wrong.

SH
> There is a plethora of open parser and serializer implementations
> available, so developers are relieved of a lot of boilerplate.

But every one of those serializations is designed to be compatible
with RDF/XML.  If Guha had defined the semantics in LISP notation,
the translation to JSON would be trivial and compatible.  But the
definition in terms of RDF/XML means that any other notation must
be compatible with the underlying complexity.

> Personally, I would recommend to use technologies like UIMA and RDBMS
> for performance critical tasks.

When you're talking about the WWW, NLP, or the combination WWW+NLP,
every application is performance critical.  At one point, IBM was
planning to use RDF, and they hired Guha at their Almaden Research
Lab.  But the NLP group rejected the idea, and they invented UIMA.
For the Watson Jeopardy challenge, they needed so much performance
that they used a supercomputer with 2880 CPUs.

Computers are getting faster all the time, but the amount of data
keeps growing even faster.  And NLP has never had the luxury
of surplus computing power.

> It would be interesting to have some measurement if the parsing
> speed has any relevance compared to network speed and latency
> in the real world.

I'm sure that Google, Microsoft, and Yahoo! have lots of data about
the real world.  And their answer was to collaborate on schema.org.

> One use case of NIF is to be able to easily replace one web service
>(e.g. Zemanta or OpenCalais) with another one as the interface is standardized.

Does anybody on the NIF project seriously think that IBM, Google,
Microsoft, Yahoo!, Amazon -- all of whom abandoned or never adopted
RDF -- will adopt any proposed standard based on RDF?  As I said in
an email note to Ontolog Forum, "If you can't beat 'em, join 'em."

> JSON and XML are not really data models, I would rather count
> them as serialization formats for data models.

I agree.  Pat Hayes worked with Guha to define the logic base (LBase)
for RDF.  Then he and Chris Menzel worked with the ISO committee to
define an upward compatible model theory for the ISO standard 24707
for Common Logic (CL).

All you have to do is to take the CL standard, define a mapping
from JSON to CL, and you have a semantics.  You don't have to
implement the full CL standard.  You can just implement whatever
subset you find useful -- and the full standard is always available
if and when you need more.

> I think the main difficulty for the adoption of RDF was that people
> tried to use it for tasks that it is not suited for (e.g. replacing
> relational databases).

That is another advantage of JSON.  You can write triples [A, B, C],
or you can write n-tuples that map to or from a relational DB.
You can also have types:  {Type1:A, Type2:B, Type3:C, Type4:D}.

> Annotations are a form of linking, right?

Right. That's why Google, Microsoft, and Yahoo! designed schema.org.

My recommendation: Don't abandon RDF and OWL, but develop a migration
plan for the future.  Having a mapping to and from UIMA is good.
Define a semantics for JSON by mapping it to some version of logic,
and emphasize those features of NIF that are easy to represent and
map among RDF, UIMA, and JSON.

If you emphasize features in the intersection of all three,
quirky features of any one will fall into disuse.  That's good.

As for OWL, please note that the overwhelming majority of OWL
ontologies on the WWW use a tiny subset of OWL that doesn't go
beyond Aristotle.  That subset is also easy to map to and from
schema.org.  And keep in mind that rules and even larger subsets
of logic are widely used in NLP systems.

Good luck,

John