Plans

by Zef Hemel

I don’t talk about my personal life on this website often. Most of that gets written at Zef.Nu, my Dutch website. However now I will also give those who are unfurtunate enough not to speak Dutch, and those who are not willing to learn it just to read about my life, a bit of insight in what has been happening and will be happening with me personally. Ok that was a very complex sentence, but I don’t feel like putting much effort into making it more understandable.

As most of you know, last September I moved to Dublin to study Networks and Distributed Systems at Trinity College. In two months I will be done. At this moment I’m working on my dissertation. I already hinted this a couple of times, but once I’m done I’ll move back to my city of birth (and where I lived the first 22 years of my life): Groningen in the Netherlands. There I’ll take a very different route, I’ll be studying “Engelse Taal en Cultuur” (translated: English Language and Culture), which is basically English Philololgy as known in many other countries.

That’s generally perceived as quite a change after studying computer science for 5 years and soon ending it with a masters degree from one of the better universities in Europe. But I feel I have to make this change. I have to figure out what I want to do and over the past years I found that a regular job in the IT industry might not be it.

A year ago I already considered this step when I got bored with more and more things related to programming and software engineering. I got more interested in language, particularly English. The English language had always been one of my interests, but as I had a Latvian girlfriend at the time and learned a bit of Latvian I got interested in language in general. How grammar works, how language evolves, different language groups. Linguistics basically.

A week or two ago there was a professor from some US university at Trinity to give a presentation. He talked about large-scale networked applications. However, the interesting thing to me was not that. The guy was clearly from some Spanish speaking country originally. And clearly he had lived and worked in the US for quite some time. His accent was very interesting, sometimes he sounded very American, sometimes he sounded very Spanish. I was intruiged. It’s interesting to hear how people sound that speak a language other than their mother tongue for a couple of years.

You get my point.

It’s not that I’m totally uninterested in my current field. I’m not. To be honest it’s not entirely clear to me what I’m still attracted to and what it is that I hate. I really enjoy reading about new technology. I’m very interested to read about big shifts in the computer industry and in software engineering in particular. Yesterday I was reading about the new features in Python 2.5 and still had that “wow, cool!” feeling. In the past days I’ve been looking at TurboGears and I’m impressed and got enthusiastic. I actually spent a couple of hours yesterday working on making a prototype of the forum 2.0 software I described a day or two ago using TurboGears (SQLObject actually).

But a couple of hours is all it was. After that I had something that fetched RSS/Atom feeds, checked for new posts, extracted all links and stored it in a database. All the data that was needed was essentially there, what remained was figuring out clever queries and algorithms and creating a front-end. And I really don’t like doing that.

Yeah, now that I’m writing about it, I think that’s it. I like the vision of things. The high-level design. Ideas behind things. I don’t care about the details and really, really don’t see myself programming it all. I could do it though, and I probably wouldn’t be bad at it either (I’ve done it for quite some years), but I won’t enjoy it.

So I’m going to study English. I’ll also be doing a journalism minor, I see some potential in there. Maybe I’ll become some kind of journalist that writes about IT related things. I like to explain things in language that normal people understand. As I have quite some friends now that don’t know much about the computer world (my girlfriend doesn’t know the difference between a bug and a virus). I quite often try to explain what I do in terms lay(wo)men understand. And I’m having a good time doing it.

But that’s all in two months, although I’m already looking forward to it. For now I’m working on my dissertation and it’s going well. And I enjoy doing it. I’ve been working on the design of the middleware/framework and I enjoyed that. It’s all design and vision work which is what I like. I also implemented most of it and that wasn’t too bad either (it’s all in Python). Next Wednesday my girlfriend is flying in (from Poland — she’s Polish) and staying for a couple of days, I’m really excited about that. Sunday she flies to England (close to London) where she’ll work for the next few months. I don’t like this long-distance relationship thing but what can you do, at least she’s closer now and can come and visit me more easily now (thanks RyanAir for excellent London Stansted-Dublin connections).

Forum 2.0

by Zef Hemel

With blogging going mainstream, do we still really need forums in their current form? I think it might be time for a shift.

But before I begin talking about that, let’s define what forums and blogs are to make the differences clear.

A forum is a place, a website, where people with similar interests gather. People start topics to talk about, others can reply to those. Topics are usually categorized into boards on certain subjects.

A blog is a website written by usually one person, sometimes a group. Sometimes an idea is brought up, sometimes a question is asked, sometimes a question is answered, sometimes visitors are just pointed to interesting ideas or discussions. Blog posts can be categorized. Some bloggers blog about one subject, some blog about a wide variety of subjects.

Both forums and blogs allow for discussion. Forums centralize the discussion on one site, blogs have distributed discussions which are made traceable through links and trackbacks.

When I participate in a discussion on a forum and I think I have something notable to say (which is longer than a couple of lines) I have to decide if I post it on the forum itself, or if I post it on my blog where a broader audience will read it. Sometimes I double post it or link to my blog post on the forum. I feel that if I put a lot of effort into a forum post it will quickly disappear in the loads of other posts and the effort seems lost, that’s why I like posting it on my blog. It’s easier to find things back and to keep a record of the things I write.

Sometimes I find out about somebody who often has interesting things to say, I would like to know what this person writes, not just on one forum but everywhere.

What if we would post everything we had to say on our own blogs? Would it be possible to recreate the forum experience (easy to follow discussions, only one place to go for topical discussions) with the content coming from blogs? Could we get the advantages of both blogs and forums at the same time?

Well partly this is already happening. If you look at ‘Planet’ sites, sites that use the Planet software, such as Planet Java and PlanetRDF, these take a first step into that direction. They aggregate feeds of blogs that talk about one subject. The blogs itself doesn’t have to be purely about one subject, it’s just the posts that are in one of the blog’s categories (related to the planet site) that are being aggregated. Many pieces of blog software support this, like WordPress. If you would only be interested in posts about my personal life, you could subscribe to this feed for example, which only lists posts in my “Personal” category.

So what you have on a planet site is basically a big group blog of people talking about the same topic. Great, but still not hardly as convenient as a forum. What we want is grouped discussions, a view on how the discussion started and evolved from there. On these planet sites discussions between the bloggers take place but they’re not easy to find as it’s simply presented as a long stream of posts with no easy way to see links between them.

A site like TechMeme takes a more clever approach. What TechMeme does is aggregate a number of blogs and see what they link to. If they all seem to link to one particular page, that’s apparently something important that’s being talked about there. This popular page (or one post linking to it) is promoted to being a main article and the other blog posts linking to it are grouped with as discussion of the main article. The more people link to the article, the higher it ends up at TechMeme.com. The result of this is actually very interesting. It is very easy to see what’s hot on technology blogs right now. And it feels a lot more like a forum already.

I wonder, wouldn’t it be possible to generalize and improve this idea a bit?

Say you’re very interested in poodles. You got friends who share your obsession and you want to set up some kind of place to discuss them. Instead of starting a regular forum each of you start a blog. This is easy and free. Everybody can start one at for example Blogger. Then you create a website for your poodle website and install this new kind of forum software, that I’ll call Forum 2.0 software for now. In there you can add all your friends’ blogs and it will automatically poll them from time to time to see if there are new posts. The new posts are republished on this central forum and if links between the posts are found a thread-like structure is created from them (a bit like TechMeme). As people in this small blogging club link to posts more they are ranked higher. It now becomes very easy to track discussions about poodles now. As new people find out about this forum they want to join in. They can easily add their blogs too.

In the future it would even be possible to query sites like Technorati to find blogs outside the list that are linking to posts of listed bloggers. Additionaly features can be imagined here. Digg-like features that are also present in forums, like rating topics. Maybe even allowing users to post on their blogs from inside the Forum 2.0 application (this is possible with the different weblog APIs available), this way people don’t even have to leave the application to respond, the forum experience can be exactly the same as in a “Forum 1.0″ application. Still the actual posts are stored on each of the people’s blogs.

In this way you no longer have to cross-post on forums anymore either. You just add your blog to each of the forums you’re interested in and your contributions will appear there automatically.

Exit WinFS

by Zef Hemel

Remember WinFS? Windows Future Storage or Windows File System or whatever it meant? Yes one of the three pillars of Longhorn (now Windows Vista). Look, here it is:

Longhorn pillars
(Source)

WinFS was supposed to be the big change in how you managed your data. It would be super easy to search any kind of data. It would be possible to link files to contacts, contacts to images and so on and so forth. A slimmed down SQL server would be powering this on every desktop. It was going to be great.

Then, almost two years ago Microsoft announced WinFS was not going to make it into Longhorn, it was more work than expected. It would be beta around the release of Longhorn (now Vista). A shame, WinFS was the most interesting feature of Longhorn for me. But still Vista will have better search capabilities, but it’s not WinFS.

Two days ago the WinFS team announced the following:

There are many great technical innovations the WinFS project has created – innovations that go beyond just the WinFS vision but are part of a broader Data Platform Vision the company is pursuing. The most visible example of this today is the work we are now doing in the next version of ADO.NET for Orcas. The Entities features we are now building in ADO.NET started as things we were building for the WinFS API. We got far enough along and were pushed on the general applicability of the work that we made the choice to not have it be just about WinFS but make it more general purpose (as an aside – this stuff is really coming together – super cool).

Other technical work in the WinFS project is at a similar point – specifically the integration of unstructured data into the relational database, and automation innovations that make the database “just work” with no DBAs – “richer store” work. It’s these storage innovations that have matured to the point where we are ready to start working on including them in our broader database product. We are choosing now to take the unstructured data support and auto-admin work and deliver it in the next release of MS SQL Server, codenamed Katmai. This really is a big deal – productizing these innovations into the mainline data products makes a big contribution toward the Data Platform Vision we have been talking about. Doing this also gives us the right data platform for further innovations.

These changes do mean that we are not pursuing a separate delivery of WinFS, including the previously planned Beta 2 release. With most of our effort now working towards productizing mature aspects of the WinFS project into SQL and ADO.NET, we do not need to deliver a separate WinFS offering.

They make it seem like WinFS was about large-scale data access for enterprises, for me it wasn’t that at all. It was about managing personal data. Pictures, contacts, e-mail, music, video. Adding meta data to them, linking them. But all of that seems to be forgotten now. Anyway it’s not going to happen. That WinFS is over. Dead. This is just spinning it in a way to make it look like a super-exciting thing. A bit pathetic.

23

by Zef Hemel

When somebody gave me the idea for this I was like, yeah, I’m not going to do a geeky thing like that. But I figured, by this time next year I’ll be studying English and probably quote something Shakespear said, so what the heck.


import pyrdf
from pyrdf import RdfStore, RdfResource, RdfType
from rdflib.Namespace import Namespace

NS_P = Namespace(‘http://www.zefhemel.com/ont/person#’)

store = RdfStore(NS_P)
store.prefix_mapping(‘p’, NS_P)
store.prefix_mapping(‘j’, NS_J)
pyrdf.setDefaultStore(store)
store.load(’peopledata.rdf’)

Person = RdfType(NS_P['Person'])

today = '2006-06-22'

print ‘People celebrating their birthdays today:’
for p in store.findResources(rdf_type = Person, birthday = today):
   print ‘* %s (%d)’ % (p.name, int(p.age))


Output:


* Zef Hemel (23)

There are two important things to note about this. First of all it shows how easy it is to query RDF data using PyRDF. Second of all:

It’s my 23rd birthday today!

Hooray for me!

It’s interesting to see how different people respond differently to a post like The Store. A post that doesn’t seem to make much sense. Long Zheng sent me something quite interesting in response: a link to this video about marketing by Seth Godin (whose last name, incidentally, is the Dutch word for goddess).

Seth Godin

It’s a presentation of about an hour that Seth Godin gave at Google and I was very much impressed by what he said. He’s a clever guy and knows what he’s talking about. It put a new light on marketing for me. If you have an hour I suggest you watch it. He has a blog too. And you can download one of his books for free: Unleashing the IdeaVirus.

It’s all about telling stories.

Watch the video.

PyRDF

by Zef Hemel

Yesterday’s post and comments got me thinking. It still is fairly hard to manipulate and generate RDF data and I don’t think it really has to be. ActiveRDF (a Ruby RDF API) takes an interesting approach and I thought I’d build something similarish in Python, so I started that and after a couple of hours I already have something quite neat. I’ve called it PyRDF for now and here’s a sample piece of code for you to get a feel for how it works.


import pyrdf
from pyrdf import RdfStore, RdfResource, RdfType
from rdflib.Namespace import Namespace

NS_P = Namespace(’http://www.zefhemel.com/ont/person#’)
NS_J = Namespace(’http://www.zefhemel.com/ont/job#’)

store = RdfStore(defaultNS = NS_P)
store.prefix_mapping(’p', NS_P)
store.prefix_mapping(’j', NS_J)
pyrdf.setDefaultStore(store)

Person = RdfType(NS_P['Person'])
Website = RdfType(NS_P['Website'])
Job = RdfType(NS_J['Job'])

zef = RdfResource(NS_P['zef'], rdf_type = Person)
zef.name = ‘Zef Hemel’
zef.age = 22
zef.country = ‘Ireland’
zef.city = ‘Dublin’

job1 = RdfResource(NS_J['job1'], defaultNS=NS_J, rdf_type = Job)
job1.name = ‘Student System Administrator’
job1.description = ‘Fiddling around with Linux servers’
job1.startYear = 2003
job1.endYear = 2005
job2 = RdfResource(NS_J['job2'], rdf_type = Job)
# And without the defaultNS set:
job2.j_name = ‘Writing website’
job2.j_description = ‘Writing own weblogs, not that well paid.’
job2.j_startYear = 2003

zef.hadJob = [job1, job2]

zef.website = []

zefhemelcom = RdfResource(NS_P['zefhemelcom'], rdf_type = Website)
zefhemelcom.title = ‘ZefHemel.com’
zefhemelcom.url = ‘http://www.zefhemel.com’
zef.website.append(zefhemelcom)
zefnu = RdfResource(NS_P['zefnu'], rdf_type = Website)
zefnu.title = ‘Zef.Nu’
zefnu.url = ‘http://zef.nu’
zef.website.append(zefnu)

print store.serialize(format=”pretty-xml”)


Here is the output of that, saves quite some typing eh?Ok, you probably need an understanding of XML and XML namespaces to fully understand this but even if you don’t, it should be pretty obvious. PyRDF right now has three classes:

  1. RdfStore, which stores RDF triples as described before. You don’t have to do much with this except registering some prefixes. Later on you can also use this class to serialize your data into RDF/XML and to save it and load it from files, but that doesn’t work yet.
  2. RdfResource, which represents a resource, you can simply see this as an object. When instantiating an RdfResource you have to give it at least an URI. Additionally you can pass it:
    • store, a place to store the resource’s data, by default it’s all stored in the defaultStore and usually that’s fine.
    • defaultNS, this default namespace that’s used for the property names. More on this later.
    • A number of initial properties and values. This is the same as writing resourcename.property = value, but is just added for convenience
  3. RdfType, this is a direct subclass of RdfResource, it doesn’t do much, hardly anything at the moment. Later it could potentially be used to enforce correct typing and property use and stuff.

RdfResources have properties, just like objects. Properties can have other resources, literals (strings, integers etc.) or lists (of resources or literals) as values. PyRDF tries to automatically guess what kind of type a property is. If you start using it as a list, it will function as a list, if you put or literals or RdfResources in it, it will (hopefully) act as expected.

By default the property name is combined with the default namespace of the resource (or store), so for example if your default namespace is http://www.zefhemel.com/ont/person# and your property name is age, then the URI of the property will be http://www.zefhemel.com/ont/person#age. If you use a prefix followed by an underscore in the property name, like j_description, the default namespace will be overridden by the namespace associated with the j prefix. So in this case the URI will be http://www.zefhemel.com/ont/job#description.

That’s it, that’s all that there’s to it and I think it’s pretty neat. I will now work on the querying capabilities, but I think it’s already quite nice like this.

If you want to play around with it you can do a subversion check-out from http://svn.zefhemel.com/pyrdf or you can just visit that address and download it with your browser. You need rdflib to run it, but I think it comes preinstalled with Python (on Windows anyway).

Me, Myself and RDF

by Zef Hemel

For the past three months or so I have been working on my dissertation full time. I think I mentioned before that it was about context-aware semantic service matching on ad-hoc networks, and more specifically, developing middleware (APIs) to allow developers to do this more easily. I’m not going to talk about this project much (I don’t even know what I am allowed to say as we might be publishing on this). But I feel that after working with one particular technology for a few months now I should be able to define the significance of it. That technology is RDF and understanding what it does and why it matters has been the biggest challenge I encountered so far during this research. And still I cannot say that I fully appreciate its power. But I’ll try to give you at least a feel for why it matters.

The web is great. The web works. The web gives us loads and loads of information we’re interested in through tables, images and plain text. Pages are interlinked which allows us to easily jump from one page to the other. Fantastic.

But now let’s say you were recently hired by a software company that runs a big recruitement site. They list jobs and try to find good people for the jobs. Your assignment is to write a piece of software that spiders the web to find people that fit a particular job and create profiles of them. Stuff we are interested in are personal information like name, address, country, but also work experience and other stuff you usually put on your resume.

How would you do this?

As a normal web surfer this already is a challenge. I mean, how do you find a random person that fits a profile? Your best bet is to do a Google search on the job area and hope you’ll find some individuals. After that it’s not so hard anymore. Personaly websites and blogs usually list some personal information and a resume (in HTML or PDF format) and so on.

But how would you let a computer do this?

To be honest, beats me. The computer can only retrieve web pages and look at HTML code which doesn’t say that much. You can do some good guesses, but the information you can get from a free-form HTML page is always limited.

Why is it so hard?

The answer is the lack of semantics.

se·man·tics (s-mntks)

n. (used with a sing. or pl. verb)

  1. Linguistics The study or science of meaning in language.
  2. Linguistics The study of relationships between signs and symbols and what they represent. Also called semasiology.
  3. The meaning or the interpretation of a word, sentence, or other language form: We’re basically agreed; let’s not quibble over semantics.

(X)HTML’s semantic power is very limited. There are some tags like h1, h2, …, address, strong, em, that add a little bit of semantic information, but it’s not nearly enough. Not even close.

Let’s have a look at my very own about page. There is quite some information on that page that may be of interest to the application yoiu were asked to develop. My full name is there, gender, date of birth, occupation and some contact details. There’s also a link to a (bit outdated) CV. But can a computer understand this? Maybe a bit, I structured this information pretty clearly. It’s very possible to construct a parser that extracts the interesting information from this particular page. But we don’t care about me in particular, it has to be a generic solution. We’re not going to construct parsers for each way of writing a personal website, it would be more efficient to manually input all the data.

No, looking at the HTML code is pretty much hopeless. As mentioned we need semantic information. Statements about a person. For example, information like this would be much more helpful:

http://www.zefhemel.com hasName "Zef Hemel"
http://www.zefhemel.com hasGender http://someuri.org/genders#Male
http://www.zefhemel.com hasDateOfBirth "1983-06-22"
http://www.zefhemel.com hasOccupation http://www.cs.tcd.ie/courses/mscnds
http://www.zefhemel.com hasEmail "zef@zefhemel.com"

You get the idea. If instead of HTML code we would get a string of statements like this, that would be much more helpful. Essentially this kind of information is really simple, it is just a bunch of triples in the form:

subject predicate object

Or less formal:

subject property value

Which is very much like writing object-oriented code:

subject.property = value

If only we had information like this, that would be great. And guess what? This is pretty much what RDF is. There are some small technicalities, which I’ll quickly explain, but essentially this is pretty much it. In RDF, subjects and predicates are all URIs (Uniform Resource Identifiers). URIs are different from URLs in the sense that they don’t necessarily identify Locations but are simply Identifiers, i.e. the “address” you supply with the URI does not really have to exist as long as it’s identifying (unique). The object of each triple can either be an URI (like in the hasOccupation triple) or a literal value (like a number, string, date and so on). So in RDF a triple really looks like this:

http://www.zefhemel.com http://someuri.org/concepts#hasName "Zef Hemel"

or even

http://www.zefhemel.com http://someuri.org/concepts#hasGender http://someuri.org/genders#Male

RDF has different so-called serializations, ways of writing it down. The most common one is RDF/XML. Wordpress keeps messing up any HTML I insert here so I’ll link to a brief example instead.

The question is who defines the predicates/properties you can use (like hasName, hasGender etc.). The answer is you, you have complete freedom in this. In one way that’s very nice, in another way it causes some trouble.

If an application retrieves the above RDF file from the web somewhere it is possible to query it. One can ask “give me all objects where the subject is http://www.zefhemel.com and the predicate is http://someuri.org#hasName” and it would return “Zef Hemel”, so that’s handy. However who says that somebody else on another website used the same set of predicates? Maybe they didn’t use http://someuri.org#hasName but http://myuri.org#name. How can a computer know they mean the same? Well that is a problem, but it can be solved with inference rules. Somewhere on the web the fact should be stated that http://someuri.org#hasName and http://myuri.org#name are the same thing and therefore give you the same information.

Inference rules can be used for many other things, they can be used to infer new statements that weren’t obvious before. For example let’s say that in the semantic version of my resume it says that:

http://www.zefhemel.com hadJob #RuGJob1
http://www.zefhemel.com hadJob #OtherJob
#RuGJob1 hasName "System Administration"
#RuGJob1 startYear 2002
#RuGJob1 endYear 2005
#OtherJob hasName "Writing stuff"
#OtherJob startYear 2003

The fact that no endYear is specified means the job hasn’t ended yet; this person is still doing this job. So one could construct a rule like this:

?p hadJob ?job, ?job endYear ?ey, !bound(?ey) -> ?p hasJob ?job

(you can read the commas as logical ANDs here).

This rule says that if somebody has a job where the endYear is not speficied, this person still has this job. We extracted new information by using a rule.

One can imagine that when a degree is mentioned (like the http://www.cs.tcd.ie/courses/mscnds one), this URI is retrieved and it is checked if it contains any RDF information. It could contain information about the skills somebody has that completed this degree for example. By fetching related RDF resources, a lot of information could be extracted, useful information.

This is called the semantic web.

This looks like an utopian idea doesn’t it? Well it is. There are some problems with the semantic web, however. The biggest one being the amount of semantic data that is available today. For this to work a lot more data should be published in RDF and so far it’s not catching on that much. That’s a big issue. Yesterday I found a website called rdfdata.org that links to sources of RDF data, some quite interesting, like a semantic wikipedia, but we need much and much more.

Tim Berners-Lee, who invented the web and also invented the semantic web, has been fighting for the semantic web to catch on for many years, but with not that much visible result (you really have to look for places where it’s applied).

Maybe you can think about how RDF could be applied in your applications.

Next Page »