- name
- date of birth/death
- place of birth/death
- native american or immigrant
- inventions
The first step was to download wiki page content as described here:
http://kosiara87.blogspot.com/2011/01/wget-downloading-web-page-content.html
Then we:
- parsed HTML files to get: *.out (with text info on inventors) and *.table (with structured HTML table data)
- got rid of all trash along the way (empty spaces, white characters etc)
- written a parser to get desired info from strings of text (*.out) and *.table and merge it together. We differentiated word types - nouns, verbs, pronouns etc to get better quality of information.
- created an ontology (written an XML file in RDF with appropriate info)
The file can be loaded with TWINKLE tool:
http://www.ldodds.com/projects/twinkle/
We used SPARQL to retrieve information from RDF. SPARQL is a query language defined by w3c for RDF files. more info here:
http://www.w3.org/RDF/
Sample queries in SPARQL:
just get the surnames of inventors:
------------------------------------
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT ?surname
WHERE
{ ?x inv:surname ?surname }
Get some more info:
SELECT ?surname
WHERE
{ ?x inv:surname ?surname }
Get some more info:
----------------------------
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT ?place ?firstname1 ?surname1 ?firstname2 ?surname2
WHERE
{
?x inv:firstname ?firstname1 .
?x inv:surname ?surname1 .
?x inv:wasbornin ?place .
?y inv:surname ?surname2 .
?y inv:firstname ?firstname2 .
?y inv:wasbornin ?place
}
ORDER BY ASC (?surname1)
The project site at PUT can be found here:
http://semantic.cs.put.poznan.pl/dokuwiki/doku.php
technologies, IDEs and programs used:
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT ?place ?firstname1 ?surname1 ?firstname2 ?surname2
WHERE
{
?x inv:firstname ?firstname1 .
?x inv:surname ?surname1 .
?x inv:wasbornin ?place .
?y inv:surname ?surname2 .
?y inv:firstname ?firstname2 .
?y inv:wasbornin ?place
}
ORDER BY ASC (?surname1)
To Count professions:
-------------------------
prefix sparql: <http://www.w3.org/2005/xpath-functions#>
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT (COUNT(sparql:lower-case(?profession )) AS ?total) ?profession
WHERE
{
?x inv:profession ?profession
}
GROUP BY (?profession)
ORDER BY ASC (?total)
The project site at PUT can be found here:
http://semantic.cs.put.poznan.pl/dokuwiki/doku.php
technologies, IDEs and programs used:
RDF, C#, .NET, Twinkle (SPARQL), HTML Agility Pack, NLPLib, Powershell
PowerPoint presentations concerning the project (in Polish):
http://sites.google.com/site/bkosarzyckiaboutme/prezentacjaTSiSS2.pdf
http://sites.google.com/site/bkosarzyckiaboutme/prezentacjaTSiSS1.pdf
results in RDF file:
http://sites.google.com/site/bkosarzyckiaboutme/inventors_20110124_wiki.xml
source code can be found here:
http://sites.google.com/site/bkosarzyckiaboutme/ProjectTsiss_SemanticDataAcquisitionFromWikipedia.7z
PowerPoint presentations concerning the project (in Polish):
http://sites.google.com/site/bkosarzyckiaboutme/prezentacjaTSiSS2.pdf
http://sites.google.com/site/bkosarzyckiaboutme/prezentacjaTSiSS1.pdf
results in RDF file:
http://sites.google.com/site/bkosarzyckiaboutme/inventors_20110124_wiki.xml
source code can be found here:
http://sites.google.com/site/bkosarzyckiaboutme/ProjectTsiss_SemanticDataAcquisitionFromWikipedia.7z
No comments:
Post a Comment
If you like this post, please leave a comment :)