Tuesday, February 1, 2011

Semantic Data Acquisition from Wikipedia

We've created a group of three developers: Anna Cudzich, me (Bartosz Kosarzycki), Sławomir Wałkowski and started working on Semantic Data Acquisition project at PUT. The idea was simple - get data from Wikipedia and convert it to an ontology (RDF file). We wanted to benefit from the structure of Wikipedia - like tables with preformatted data. We focused on just a few basic fact on American Inventors:
- name
- date of birth/death
- place of birth/death
- native american or immigrant
- inventions

The first step was to download wiki page content as described here:

Then we:
- parsed HTML files to get: *.out (with text info on inventors) and *.table (with structured HTML table data)
- got rid of all trash along the way (empty spaces, white characters etc)
- written a parser to get desired info from strings of text (*.out) and *.table and merge it together. We differentiated word types - nouns, verbs, pronouns etc to get better quality of information.
- created an ontology (written an XML file in RDF with appropriate info)

The file can be loaded with TWINKLE tool:

We used SPARQL to retrieve information from RDF. SPARQL is a query language defined by w3c for RDF files. more info here:

Sample queries in SPARQL:

just get the surnames of inventors:
prefix inv: <>
SELECT ?surname
{ ?x inv:surname ?surname }

Get some more info:
prefix inv: <>
SELECT ?place ?firstname1 ?surname1 ?firstname2 ?surname2
?x inv:firstname ?firstname1 .
?x inv:surname ?surname1 .
?x inv:wasbornin ?place .
?y inv:surname ?surname2 .
?y inv:firstname ?firstname2 .
?y inv:wasbornin ?place
ORDER BY ASC (?surname1)

To Count professions:
prefix sparql: <>
prefix inv: <>
SELECT (COUNT(sparql:lower-case(?profession )) AS ?total)  ?profession
?x inv:profession ?profession
GROUP BY (?profession)
ORDER BY ASC (?total)

The project site at PUT can be found here:

technologies, IDEs and programs used: 

