[NLP2RDF] [Corpora-List] Announcement: NLP Interchange Format (NIF) 1.0 Spec, Demo and Reference Implementation

William Waites wwaites at tardis.ed.ac.uk
Sun Dec 4 13:52:19 CET 2011


Dear John,

On Sat, 03 Dec 2011 20:58:47 -0500, "John F. Sowa" <sowa at bestweb.net> said:

    jfs> Unfortunately, RDF is wrong in so many ways that it is hard
    jfs> to summarize them.  There is nothing wrong with having a
    jfs> readable human notation that compiles into an unreadable but
    jfs> efficient computer version.  But the RDF/XML notation is so
    jfs> bloated that it is horribly inefficient for computer
    jfs> processing, network transmission, and storage.

I don't think you will find many who disagree that RDF/XML is not very
pretty and not very convenient to use or process. But that's just the
surface encoding. Really its just an obtuse way to represent 3-tuples
and there are other more convenient ways.

In fact I would suspect that the most common practice for exchanging
RDF data, certainly for large datasets, is to use N-Triples or N-Quads
where each tuple is simply written out on a line and ended with a
'.'. A bit verbose, but it compresses well and is very easy to
process.

    jfs> At the semantic level, a serious flaw of RDF is the complete
    jfs> lack of typing.  There is no way to indicate that a URI is
    jfs> intended to represent a literal (the URI itself), the
    jfs> document identified by the URI, the content of that document,
    jfs> or the result of evaluating that content (if it happens to
    jfs> contain some executable or interpretable language).

An interesting point. Obviously literals can have types. There are
some corner cases where it doesn't work well (strings and language
tags, can't have both but doesn't really make sense to have an untyped
string).

You can, of course, make statements about a URI, and with a little bit
of indirection can even talk about the URI itself rather than what it
denotes. You can talk about a process, e.g. running a program that
comes from <u1> with input from <u2> and output at <u3> and if you're
careful about keeping the identifiers vs. documents straight (cf. the
http-range-14 bugbear) you can even do it pretty unambiguously.

I tend to think that the "indirection" that I snuck into the last
paragraph may be the bigger problem with RDF-the-data-model since
you're forced into doing that sort of thing (reification) as soon as
you want to talk about N-ary relations where N > 2. This has come out
again in some recent discussions on the RDF-WG list about overloading
the fourth column (graph) to have some other kind of meaning like
temporal scope. This is not a problem that lisp or prolog have.

    jfs> They do not use RDF.  They use RDFa, which is a notation for
    jfs> tagging HTML (or XML) documents.  But RDFa has nothing in
    jfs> common with RDF/XML other than the three letters R, D, and F.

Here the confusion behind your arguments is quite clear. RDF/XML and
RDFa are just two ways of writing down exactly the same
thing. Personally I don't think RDFa is a terribly good idea since it
mixes up the data with the presentation but there are some use cases
for which it makes sense (semantically rich cut-and-paste for
example).

The point is, you can take exactly the same 3-tuples and write them
down using RDF/XML or RDFa or Turtle or N-Triples or or RDF/JSON or
JSON-LD and it will remain *exactly* the same. If you use one of the
many libraries that are available for parsing this stuff you won't
even notice the difference modulo implementation quirks.

It is a mistake for you to focus on RDF/XML as a significant feature
or attribute of the NLP2RDF work.

    jfs> Furthermore, Google is one of the founding members of
    jfs> schema.org, which has developed their own vocabulary and
    jfs> methods of processing.  See their hierarchy of terms:
    jfs> http://schema.org/docs/full.html

    jfs> Look at the way they use those terms:
    jfs> http://schema.org/docs/gs.html You won't see any RDF or OWL
    jfs> there.

As a matter of fact you will:

   The data model used is very generic and derived from RDF Schema
   ... and an OWL version is here...

   http://schema.org/docs/datamodel.html
   http://schema.org/docs/schemaorg.owl

In addition, you will find that there has been some work to represent
the schema.org stuff directly in RDF, see http://schema.rdfs.org/

In other words, it's basically RDF with a simplified surface syntax
and some limits on what you can use it to express, which makes sense
for their use case of search engine optimisation.

    jfs> As I said, that model can be expressed more easily in LISP or
    jfs> JSON.  For a linguist, calling those triples a "subj-pred-obj
    jfs> model" is so hopelessly naive that there is no way they could
    jfs> take it seriously.

Just to reiterate one more time. You are quite right. It is perfectly
possible to represent triples in LISP or JSON and many people do
that. It is perfectly possible to automatically translate between
these and RDF/XML. There are programs and libraries that are good for
doing this. The surface syntax really doesn't matter. It's like the
difference between writing yiddish text in hebrew or latin
letters. Apart from some marginal convenience there is no difference
at all.

Your point about S-P-O as a data model possibly being naive because it
is cumbersome to represent more complex relations is, I believe, quite
correct. That it is equally difficult to do annotation well
(statements about statements) is likewise a very big
problem. RDF/XML's ugliness compared to LISP or JSON is quite beside
the point.

Cheers,
-w

--
William Waites <wwaites at tardis.ed.ac.uk>
Visiting Researcher
Laboratory for Foundations of Computer Science
School of Informatics, University of Edinburgh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.informatik.uni-leipzig.de/pipermail/nlp2rdf/attachments/20111204/8028350a/attachment.asc>


More information about the NLP2RDF mailing list