[NLP2RDF] Salient words extraction (e.g. ontology, Open Source projects)

Sat Jan 19 18:43:24 CET 2013

2013/1/16 Sebastian Hellmann <hellmann at informatik.uni-leipzig.de>:
> Hi Jean-Marc,
>
> to begin with, let's look at previous work in this area:

All this is interesting, but my original question was in fact "how to
extract keywords from a small text".
And this small text happens to be the rdfs:comment of an owl:Ontology
, but it could be any text.

What was implied , and I should have made explicit, is that the
extraction could be made by rules ( N3 or Jena or other ) based on the
NLP2RDF output, or otherwise use this "triplelized" syntaxic tree.
>>> further embedded comments .....

> First of all, there is the BioPortal: http://bioportal.bioontology.org/  for
> finding bio ontologies. This one won the SW challenge price in 2010.
...
Added this detail in my comprehensive paragraph "Finding stuff on the Web" :
http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html#Finding

> On the other hand, all the community lists are full with questions asking
> the existence of ontologies for certain domains. I really would conclude
> from this, that there is a real need for an ontology repository, that works.

I does not have to centralized.
It is the duty of ontology authors, or else other people, to provide
triples of the form
<htttp://my.com/ontologyXYZ#> skos:subject dbpedia:Tourism .

If ontology authors want their ontology to found, in any special
purpose repository, or in Google, it is their own interest to add
these triples.

> So, if we are planning to do this, I would start the brainstorming right now
> ;)
>
> 1. Use google to find ontologies in the first place (bootstrapping)
>
> 2 (a). Use DBpedia to link to the ontologies, e.g.
> http://dbpedia.org/resource/Tourism could link to
> http://e-tourism.deri.at/ont/e-tourism.owl

Again , the best place to put SKOS subject is the ontology itself.
DBpedia links today have to come from Wikipedia infoboxes and special links.
I don't think every authors for every subjects will have the time ,
and the ontology knowledge to add these links in Wikipedia.

> 2 (b) alternatively, we could make an extra endpoint that keeps metadata

I was thinking of this, with the additional idea that this endpoint is
populated by a simple web user page with these fields :

- .1 prefix or URI ontology
- 2. a "propose keywords" button that does keyword extractions +
dbpedia disambiguation according to 1. , and then updates 3.
- 3. keywords
- 4. OK button to update the underlying endpoint
- 5. SEARCH to find ontologies according to 3.

An example of a simple collaborative application, simpler but very
useful is prefix.cc .

.....

> 3. create useful tips, how ontology creators could improve their ontology

There should be discussion indeed on semantic-web at w3.org , and Protégé .

...

> Using syntax trees and part of speech tagging is, however, not the right
> tool for the job, I am afraid.

I read AIMA chapter 22.4 "information extraction", and indeed, it's
not like this that people do extraction. But relying on large corpus
and probabilities will maybe not help to catch ontologies for rare
specialized subjects. Also "information extraction" there is not the
same as "Salient words extraction" .

> Although sometimes it is completely enough to discard everything except the
> nouns. So the simple formula:
> "discard everything that is not of type olia:Noun " might yield salient
> words with sufficient quality ;)

and simply sort by local frequency maybe .
I see that it's doable.

And as a further processing rely on some triples in dbpedia or maybe
WordNet to merge concepts : e.g. sum up "travel hotel reservation
holidays" into just "Tourism" .

> All the best,
> Sebastian
>
>
>
> Am 13.01.2013 13:03, schrieb Jean-Marc Vanel:
>
>> In fact it is not hard to understand an ontology, but it is hard to
>> know which ontology to use.
>>
>> There is no "directory" of ontologies. It's like the menu of ice
>> creams, there are many. There are rather search engines but
>> traditional ones, not conceptual, such as swogle, falcons [1] ... It
>> is so open, that it's hard even for knowledge experts to choose good
>> ontologies.
>>
>> To remedy this, what I have planned is to create tools to help
>> authors, users or developers to annotate ontologies with concepts from
>> DBpedia or WordNet, using NLP analyzers.
>>
>> So what it would be is a tool for extracting salient words from
>> English, which outputs 5 to 10 relevant words, typically from a
>> rdfs:comment. These words are then (if necessary) disambiguated , for
>> example using a Wikipedia Web Service (the one you use when typing in
>> the Wikipedia search field).
>>
>> Salient words (here music), will be put in triples such as:
>>
>> <myOntology> skos:subject DBpedia:Music.
>>
>> which can then be used in the ontology itself (the best), or added in
>> Turtle or RDF documents online or SPARQL databases and / or
>> collaborative sites such prefix.cc [2].
>>
>> Thus a human or an agent program could find a software component more
>> accurately. The issue about ontologies is similar to Open Source
>> programs, and many other types of resources.
>>
>> Ideally, the software component for the NLP extraction would in Java
>> and Open Source, which would facilitate the addition in the EulerGUI
>> environment [3]. I feel that nlp2rdf could help. It already has a web
>> service for parsing. What is missing is processing the syntax tree in
>> RDF for the salient words, or directly using an NLP tool.
>>
>> [1] Finding ontologies on the Web:
>>
>>
>> http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html#Finding2
>>
>> [2] collaborative website for ontologies and their prefixes:
>> http://prefix.cc
>>
>> [3] EulerGUI , GUI environment and framework for Semantic Web and rules
>>
>>
>> http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html
>>
>>
>>
>
>
> --
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig
> Projects: http://nlp2rdf.org , http://dbpedia.org
> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
> Research Group: http://aksw.org

-- 
Jean-Marc Vanel
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
http://deductions-software.com/
+33 (0)6 89 16 29 52
Twitter: @jmvanel ; chat: irc://irc.freenode.net#eulergui