The inside scoop on Wikipedia & DBpedia.org
June 30th, 2008Since the creation and launch of Wikipedia back in 2001 many people from all over the world have been busy collaborating, adding and updating content on this very popular wiki web.
The wiki concept started a long time ago back in the early 1990’s, nevertheless, Wikipedia, even though it hasn’t been around too long, it is by far the biggest and most active wiki on the web.
The amount of valuable data that has been accumulating on Wikipedia is already in the millions. The biggest challenge now is making all of that data relevant and meaningful to users exploring and searching for information.
Most people arrive into Wikipedia via Google… for example, if you do a quick google search on pretty much anything you’ll most likely get Wikipedia at the very top of the search results. That’s due to Wikipedia’s huge link popularity and page ranking (Different topic).
Although Wikipedia is public and the data in it is rich and can be used or referenced anywhere on the web, because of the simple nature of a Wiki, the data structure inside Wikipedia articles are a total mess, thats in terms of database schema.
This creates a real challenge when you need to extract and use the data somewhere else. Well, a few companies out there are trying to solve that problem and create a data structure on top of Wikipedia. One these companies is DBpedia.org.
In their own words:
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
With companies like DBpedia facilitating this data structure many other companies are now tackling the “relevancy” and “meaningful” challenge with Wikipedia’s vast amount of information. For example Powerset.com and Freebase.com.
Well, I decided to do my own experiment. I downloaded the title and short abstract datasets from DBpedia.org and loaded them into mysql. I think I used up about 5 gigs of space on my server during this process. Anyway, I also added a search / typeahead / auto-complete feature.
Initially this was taking a LONG time to query/wildcard against 2+million records, so I decided to add 200k records into a MEMORY table in MySQL and that worked really nicely… took about 3 hours to load the heap though!
Obviously, if you were doing this for comercial purposes you would build something a lot more robust!
Without further a due here is the example:
http://www.chrisdevbox.com/lab/dbpedia/index.html
And here is a quick screenshot:
If you would like to know the recipe on how I created the above example send me a comment/note… I might post a tutorial! ![]()



