DISQUS

Data Wrangling Blog: Wikipedia Page Traffic Statistics Dataset

  • treffer · 6 months ago

    Hi,


    I'm currently trying out hadoop - for fun/learning. I'm parsing the full .xml.bz2 datasets (simple.wikipedia.org / en.wikipedia.org).


    You might be interested in the reader http://github.com/rtreffer/hadoop-wikimedia-fun... - it seems to work (12% on en.wikipedia.org atm) but requires bzip2 split support (trunk + Hadoop-4012-version9.patch).
    It should be easy to alter (drop any text). Edit counts would be quite easy with this...


    Would be interesting to see how edit counts compare to page views....


    Please note that I'm quite new to hadoop :)


    Regards,
    Rene

  • Peter Skomoroch · 6 months ago

    Rene,


    I should have mentioned in the blog post, the Wikidump directory of the dataset has the raw wikipedia content already parsed into hadoop readable tab delimited files:


    692M page.txt
    115M redirect.txt
    987M revision.txt
    17G text.txt


    I'll edit the post to reflect this...


    -Pete

  • malar · 5 months ago
    Hi,
    Now i am working on temporal data mining.I need temporal(Time series) data set related to medical field or share market.Can you provide some data sets for me....
  • pskomoroch · 5 months ago
    Malar,

    Lots of medical time series data is available, check my del.icio.us links:

    http://delicious.com/pskomoroch/dataset
    http://delicious.com/pskomoroch/dataset+timeseries

    Cardiac, EKG, neuron spike trains, there are a bunch of datasets out there depending on what you are trying to do. Some good neural time series data is here: http://www.crcns.org/data-sets

    also check out:

    http://www.neural-forecasting-competition.com/

    -Pete
  • avi · 1 week ago
    Hi i am trying to access the wiki page traffic data but cannot access it.I wanted to access the data via Amazon Management Console via Windows 7 but could not. I first created an instance and then attached the snapshot on that instance. When I logged into that Instance via Remote Desktop, I did not see the datasets anywhere.

    I am very new to this realm and do not have too much of knowledge about how the AWS works. I have coded heavily in matlab and Winbugs pertaining to statistics but this seems like a totally different ball game.

    Your advice would really help me as (how to access via Amazon Management Console) I wanted to use the page traffic data for an academic project.
  • pskomoroch · 1 week ago
    The easiest way would be to access the data from a Linux instance like Ubuntu... With some legwork, you should be able to use Samba somehow to access the volume from Windows - I try to stay away from Windows these days, too many headaches: http://polishlinux.org/linux/ext3-reiserfs-xfs-...