Bartosz Kosarzycki's blog: Semantic Data Acquisition from Wikipedia

Tuesday, February 1, 2011

Semantic Data Acquisition from Wikipedia

We've created a group of three developers: Anna Cudzich, me (Bartosz Kosarzycki), Sławomir Wałkowski and started working on Semantic Data Acquisition project at PUT. The idea was simple - get data from Wikipedia and convert it to an ontology (RDF file). We wanted to benefit from the structure of Wikipedia - like tables with preformatted data. We focused on just a few basic fact on American Inventors:
- name
- date of birth/death
- place of birth/death
- native american or immigrant
- inventions

The first step was to download wiki page content as described here:
http://kosiara87.blogspot.com/2011/01/wget-downloading-web-page-content.html

Then we:
- parsed HTML files to get: *.out (with text info on inventors) and *.table (with structured HTML table data)
- got rid of all trash along the way (empty spaces, white characters etc)
- written a parser to get desired info from strings of text (*.out) and *.table and merge it together. We differentiated word types - nouns, verbs, pronouns etc to get better quality of information.
- created an ontology (written an XML file in RDF with appropriate info)

The file can be loaded with TWINKLE tool:
http://www.ldodds.com/projects/twinkle/

We used SPARQL to retrieve information from RDF. SPARQL is a query language defined by w3c for RDF files. more info here:
http://www.w3.org/RDF/

Sample queries in SPARQL:

just get the surnames of inventors:

------------------------------------

prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT ?surname
WHERE
{ ?x inv:surname ?surname }

Get some more info:

----------------------------
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT ?place ?firstname1 ?surname1 ?firstname2 ?surname2
WHERE
{
?x inv:firstname ?firstname1 .
?x inv:surname ?surname1 .
?x inv:wasbornin ?place .
?y inv:surname ?surname2 .
?y inv:firstname ?firstname2 .
?y inv:wasbornin ?place
}
ORDER BY ASC (?surname1)

To Count professions:

-------------------------

prefix sparql: <http://www.w3.org/2005/xpath-functions#>

prefix inv: <http://www.cs.put.poznan.pl/inventors/#>

SELECT (COUNT(sparql:lower-case(?profession )) AS ?total) ?profession

WHERE

{

?x inv:profession ?profession

}

GROUP BY (?profession)

ORDER BY ASC (?total)

The project site at PUT can be found here:
http://semantic.cs.put.poznan.pl/dokuwiki/doku.php

technologies, IDEs and programs used:

RDF, C#, .NET, Twinkle (SPARQL), HTML Agility Pack, NLPLib, Powershell

PowerPoint presentations concerning the project (in Polish):
http://sites.google.com/site/bkosarzyckiaboutme/prezentacjaTSiSS2.pdf
http://sites.google.com/site/bkosarzyckiaboutme/prezentacjaTSiSS1.pdf

results in RDF file:
http://sites.google.com/site/bkosarzyckiaboutme/inventors_20110124_wiki.xml

source code can be found here:
http://sites.google.com/site/bkosarzyckiaboutme/ProjectTsiss_SemanticDataAcquisitionFromWikipedia.7z

Bartosz Kosarzycki's blog

Search This Blog

Tuesday, February 1, 2011

Semantic Data Acquisition from Wikipedia

No comments:

Post a Comment

Total Pageviews

The Internet Defense League