Archive for September 7th, 2008

Wikipedia data and statistics

(I actually wanted to start a new blog with technical content, but dropped the idea and decided to use my telugu blog until I get this as a habit. I will move these posts into a separate blog once there is sufficient content..)

I’ve been working with Wikipedia data since a while and I thought  I would share some of the tools and download points here.

Wikimedia has been kind enough to make the entire Wikipedia data to be available for research purposes. As the data in Wikipedia keeps on updating, a snapshot is saved and is available here:

http://download.wikipedia.org/ or

http://download.wikipedia.org/enwiki/ (for english wikipedia dumps)

Out of all available downloads, I would suggest XMLs if you want to work on the Wiki data. The static HTML dumps are huge in size, and are the output of the Mediawiki rendering engine for each page. These are useful only if you can’t use the PHP rendering engine provided by Mediawiki. The XMLs are nothing bug pages in Mediawiki format, so are pretty much less in size. When I downloaded, the static, compressed XML dump without images was around 3.7 GB. (For my work I didn’t need images at all).This can go upto 11GB if you unzip and load the data to a database (I used MySQL). I read in some research work that the total size after decompression with page history can go upto 700 GB.

Once you download the dump, the next step is to load it into your favorite database. I used a LAMP (Linux, Apache, MySQL, PHP) combination for this. I downloaded Mediawiki first (from http://www.mediawiki.org/wiki/Download) and installed on my system.  The dump zip will contain one single file with huge size. So you need to be careful in using the file system to perform this (Win FAT32 doesn’t support files of size > 4GB). You can use tools like mwdumper or mwimport to do this. Detailed documentation is available here: http://meta.wikimedia.org/wiki/Importing_a_Wikipedia_database_dump_into_MediaWiki .

However, mwdumper did not work initially with English Wikipedia due to its huge size. So I had to use a different technique: I divided the bzip2 file into small chunks first (using bzip2recover) and then parsed each of them in sequence using a simple perl script. However, the script needs to identify the start and end of each page. Also, a page can span across multiple bzip2 files (though less likely), the script should be clever enough to handle such cases. Another way of dumping is to tweak mwdumper options (available in the above URL)  to get the file loaded. You will most likely encounter issues first time, but everything is well documented in the Mediawiki site above.

One the database loading is done, you should install Mediawiki. This should be a simple step (Look at http://www.mediawiki.org/wiki/Installation). If your web server is Apache, copying the unzipped Mediawiki  into the www folder and running http://<yourhost>:<port>/<Mediawiki_Folder_Name> should open a wizard that takes you through installation procedure.

The next step is to build indices. Go to maintenance folder in the Mediawiki installation and run buildall.php. This can take huge time, depending on your processor capacity. On an 8GB RAM computer, I left the process to run for around a week.

Wikipedia access data:

We were looking for access statistics for Wikipedia for our research purposes and found this:

http://dammit.lt/wikistats/

This is an accurate hourly snapshot data on access to Wikipedia. It has data of the format <Language, Page, AccessCount>. However, we were looking for data in format <Page, Date&Time, IPAddress>. We wanted this because we wanted to identify user sessions on Wikipedia pages. This data, however is not being given out right now.

We also found a site that shows Wikipedia traffic status http://stats.grok.se/ while searching for the access data. this, however is a different view of what is shown in the above web site.

I will keep posted on our progress in this research area and try to keep my blog active.. :)

Add comment September 7, 2008


Categories

Blogroll

Friends

General

Feeds

 

September 2008
M T W T F S S
« Dec   Nov »
1234567
891011121314
15161718192021
22232425262728
2930  

Recent Posts

Firefox 3