In my last post, I talked about talking some first tentative steps into the semantic web. Two of the commentators suggested that I should check out the book "Programming the Semantic Web" published by O'Reilly. The full reference is below:
"Programming the Semantic Web by Toby Segaran, Colin Evans, and Jamie Taylor. Copyright 2009 Toby Segaran, Colin Evans, and Jamie Taylor, 978-0-596-15381-6."
In this post I review what I have read so far, making notes as I go along.
Chapter 1 is called "Why Semantics?" The book explains the core idea of the semantic web:
"it's about using semantics to represent, combine, and share knowledge between communities of machines, and how to write systems that act on that knowledge." (Page 2)
In particular:
"With a little work you can make the semantic relationships in your data explicit, and program in a way that allows the behavior of your systems to change based on the meaning of the data. With the semantics made explicit, other programs, even those not written by you, can seamlessly use your data. Similarly, when you write programs that understand semantic data, your programs can operate on datasets that you didn't anticipate when you designed your system." (page 3)
Chapter 2 talks about the 'triple' which is the "the fundamental building block of semantic representations" (page 19). Here are two triples:
Zeth writes 'Command Line Warriors'
Zeth's address is 'Buckingham Palace' (not yet at least)
The first part of the triple is the subject (e.g. Zeth). The second part is the 'predicate', which is "a property of the entity to which they are attached" (page 19), so my birthday or my address would be a predicate. The last part is the object, which can be another entity "that can be the subject in other triples" (page 19) or a literal value "such as strings or numbers." (page 19).
Different triples are linked together by sharing objects or subjects. The book starts off with an example spreadsheet of restaurant listings which it turns into a relational database, which it then turns into these triples. So the links are analogous to links between tables in a relational database.
Next the book moves on to building graphs of triples by using shared ids:
zeth first_name "Zeth"
zeth address royal_residence street
royal_residence street_address "Buckingham Palace"
royal_residence street post_code "SW1A 1AA"
commandline_warriors written by zeth
commandline_warriors name 'Command Line Warriors'
So here there are three entities: the person Zeth represented by the ID 'zeth', a house represented by the ID 'royal_residence' and a website represented by the ID 'commandline_warriors'. The post code "SW1A 1AA" is just a literal value at this point, but it could later be turned into an entity also. The first two triples have a shared ID, meaning both statements are about the entity 'zeth'. 'zeth' is not the name of the entity, it is an arbitrarily-chosen ID, the name is provided by the first_name predicate. The ID could have been a hash value or a sequential number.
The book then works through its first code example, which is available online here: simpletriple.py I recommend for you to download it now and have a look over.
The code sample has a class called SimpleGraph which is a simple example 'triplestore'. In the __init__ method, 's' stands for subject, 'p' for predicate, and 'o' for object. So the triples are stored in three different combinations. The book then explains the various methods, which may be evident to you from the code and docstrings.
Next we are shown in pictures and code that if there are two graphs with consistent identifiers, they can be merged together. Then we are to download a csv file of triples and load then using the load method of the SimpleGraph object. Then we perform queries upon this data.
from simpletriple import SimpleGraph
# Make an instance of the class
film_graph = SimpleGraph()
# Load the CSV from the book's website
film_graph.load("movies.csv")
# Now lets find Julie Walters' id
julie_id = film_graph.value(None, "name", "Julie Walters")
print julie_id
# Now lets find out all the films Julie has been in:
julie_films = film_graph.triples((None, "starring", julie_id))
for film in julie_films:
print film_graph.value(film[0], "name", None)
One of the results is the classic film 'Educating Rita'.
educating_rita = film_graph.value(None, "name", "Educating Rita")
Now lets find another actor in Educating Rita:
actor = film_graph.triples((educating_rita, "starring", None)).next()[2]
print film_graph.value(actor, 'name', None)
Sadly there are no dates of the films in the csv file, if there were we could sort an actor's films by year and thus generate a filmography.
Lets instead find the director:
director = film_graph.value(educating_rita, 'directed_by', None)
print(film_graph.value(director, 'name', None))
directed_films = film_graph.triples((None, "directed_by", director))
What other films has he directed?
for film in directed_films:
print film_graph.value(film[0], "name", None)
If you want to play along, use simpletriple.py to find out what other film has this director made that also stars the actor we found above? The answer is in the comments.
The rest of chapter two gives few more examples that can be played with.
Chapter three gives a new query syntax that works by defining various contraints and binding the results to set references. This is most easily demostrated by an example. Start by downloading an upgraded version of the triples module called simplegraph.py.
We load the data in the same way as before:
from simplegraph import SimpleGraph
film_graph = SimpleGraph()
film_graph.load('movies.csv')
You might still have 'actor' and 'director' etc in memory. Assuming that you do not, we can repeat what we did above:
julie_id = film_graph.value(None, "name", "Julie Walters")
educating_rita = film_graph.value(None, "name", "Educating Rita")
actor = film_graph.triples((educating_rita, "starring", None)).next()[2]
director = film_graph.value(educating_rita, 'directed_by', None)
Now we can answer the above quiz question in a far more efficient manner. The question was to find out what other film the 'director' made that also started the 'actor'.
film_graph.query([('?film', 'starring', actor),
('?film', 'directed_by', director)])
You can see instantly that this query is far shorter than the previous attempt which involved manually iterating our way to the correct result. How the query method is implemented can be seen by reading the Python file linked to above. What happens in this case is that for each possible result matching these constraints, a dictionary is returned binding the key 'film' to the ID of the film that has been found.
So far we are part way through chapter 3. Join us next time when we continue working through the book.
<p>Answer to quiz question:</p>
<p>You may have used more concise syntax, but the following is made longer to be clearer:</p>
<div class="highlight"><pre><span class="n">director_films</span> <span class="o">=</span> <span class="n">set</span><span class="p">([</span><span class="n">film</span><span class="p">[</span><span class="mf">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">film</span> <span class="ow">in</span> <span class="n">film_graph</span><span class="o">.</span><span class="n">triples</span><span class="p">((</span><span class="bp">None</span><span class="p">,</span> <span class="s">"directed_by"</span><span class="p">,</span> <span class="n">director</span><span class="p">))])</span>
<span class="n">actor_films</span> <span class="o">=</span> <span class="n">set</span><span class="p">([</span><span class="n">film</span><span class="p">[</span><span class="mf">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">film</span> <span class="ow">in</span> <span class="n">film_graph</span><span class="o">.</span><span class="n">triples</span><span class="p">((</span><span class="bp">None</span><span class="p">,</span> <span class="s">"starring"</span><span class="p">,</span> <span class="n">actor</span><span class="p">))])</span>
<span class="n">common_films</span> <span class="o">=</span> <span class="n">director_films</span><span class="o">.</span><span class="n">intersection</span><span class="p">(</span><span class="n">actor_films</span><span class="p">)</span>
<span class="k">for</span> <span class="n">common_film</span> <span class="ow">in</span> <span class="n">common_films</span><span class="p">:</span>
<span class="n">common_film_name</span> <span class="o">=</span> <span class="n">film_graph</span><span class="o">.</span><span class="n">value</span><span class="p">(</span><span class="n">common_film</span><span class="p">,</span> <span class="s">"name"</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="k">if</span> <span class="n">common_film_name</span> <span class="o">!=</span> <span class="s">"Educating Rita"</span>
<span class="k">print</span> <span class="n">common_film_name</span>
</pre></div>