[NLP2RDF] Salient words extraction (e.g. ontology, Open Source projects)

Wed Jan 16 12:57:41 CET 2013

Hi Jean-Marc,

to begin with, let's look at previous work in this area:

First of all, there is the BioPortal: http://bioportal.bioontology.org/  
for finding bio ontologies. This one won the SW challenge price in 2010.
Secondly, actually Google already indexes all ontologies, already:
http://www.google.com/search?btnG=1&pws=0&q=filetype%3Aowl+tourism

Here is another list: http://www.w3.org/wiki/Ontology_repositories
(SchemaWeb seems to be down: 
http://answers.semanticweb.com/questions/785/schemaweb-status )

On the other hand, all the community lists are full with questions 
asking the existence of ontologies for certain domains. I really would 
conclude from this, that there is a real need for an ontology 
repository, that works.

So, if we are planning to do this, I would start the brainstorming right 
now ;)

1. Use google to find ontologies in the first place (bootstrapping)

2 (a). Use DBpedia to link to the ontologies, e.g. 
http://dbpedia.org/resource/Tourism could link to 
http://e-tourism.deri.at/ont/e-tourism.owl
<http://dbpedia.org/resource/Tourism> ontfinder:relatedOntology 
<http://e-tourism.deri.at/ont/e-tourism.owl> .
2 (b) alternatively, we could make an extra endpoint that keeps metadata 
about these ontologies:
e.g. 
http://ontologies.nlp2rdf.org/resource/e-tourism.deri.at/ont/e-tourism.owl
meta data could be a ranking (e.g. number of classes, availability of 
labels, vocab usage ) as well as tips to improve it, as well as some 
linguistic information (e.g. keywords).
Then let DBpedia link there.

3. indexing should be done with a mixture of information retrieval  and 
NLP. We could use frequency classes of words or stopword lists to 
rank/filter them.
This here is also promising to expand the terms:
http://wiki.dbpedia.org/Datasets/NLP
http://dbpedia.org/Wiktionary

Furthermore, I would throw in these:
1. care about multilinguality a little bit, e.g. provide a translation 
in some way turismo, toerismo
2. try to use lemon vocabulary (if applicable)
3. create useful tips, how ontology creators could improve their ontology

Organisationally, we can use a cronjob and publish everything on github. 
Other people could add their ontology urls via pull request. Querying 
google is difficult and should be limited. Maybe, i can get another list 
of all available owl files from somewhere.

Using syntax trees and part of speech tagging is, however, not the right 
tool for the job, I am afraid.
Although somtimes it is completely enough to discard everything except 
the nouns. So the simple formula:
"discard everything that is not of type olia:Noun " might yield salient 
words with sufficient quality ;)

All the best,
Sebastian

Am 13.01.2013 13:03, schrieb Jean-Marc Vanel:
> In fact it is not hard to understand an ontology, but it is hard to
> know which ontology to use.
>
> There is no "directory" of ontologies. It's like the menu of ice
> creams, there are many. There are rather search engines but
> traditional ones, not conceptual, such as swogle, falcons [1] ... It
> is so open, that it's hard even for knowledge experts to choose good
> ontologies.
>
> To remedy this, what I have planned is to create tools to help
> authors, users or developers to annotate ontologies with concepts from
> DBpedia or WordNet, using NLP analyzers.
>
> So what it would be is a tool for extracting salient words from
> English, which outputs 5 to 10 relevant words, typically from a
> rdfs:comment. These words are then (if necessary) disambiguated , for
> example using a Wikipedia Web Service (the one you use when typing in
> the Wikipedia search field).
>
> Salient words (here music), will be put in triples such as:
>
> <myOntology> skos:subject DBpedia:Music.
>
> which can then be used in the ontology itself (the best), or added in
> Turtle or RDF documents online or SPARQL databases and / or
> collaborative sites such prefix.cc [2].
>
> Thus a human or an agent program could find a software component more
> accurately. The issue about ontologies is similar to Open Source
> programs, and many other types of resources.
>
> Ideally, the software component for the NLP extraction would in Java
> and Open Source, which would facilitate the addition in the EulerGUI
> environment [3]. I feel that nlp2rdf could help. It already has a web
> service for parsing. What is missing is processing the syntax tree in
> RDF for the salient words, or directly using an NLP tool.
>
> [1] Finding ontologies on the Web:
>
> http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html#Finding2
>
> [2] collaborative website for ontologies and their prefixes: http://prefix.cc
>
> [3] EulerGUI , GUI environment and framework for Semantic Web and rules
>
> http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html
>
>
>

-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org