-
Website
http://www.datawrangling.com/ -
Original page
http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset -
Subscribe
All Comments -
Community
-
Top Commenters
-
yaroslavvb
1 comment · 1 points
-
endah
1 comment · 1 points
-
malar
1 comment · 1 points
-
bearrito
1 comment · 1 points
-
pskomoroch
9 comments · 1 points
-
-
Popular Threads
Hi,
I'm currently trying out hadoop - for fun/learning. I'm parsing the full .xml.bz2 datasets (simple.wikipedia.org / en.wikipedia.org).
You might be interested in the reader http://github.com/rtreffer/hadoop-wikimedia-fun... - it seems to work (12% on en.wikipedia.org atm) but requires bzip2 split support (trunk + Hadoop-4012-version9.patch).
It should be easy to alter (drop any text). Edit counts would be quite easy with this...
Would be interesting to see how edit counts compare to page views....
Please note that I'm quite new to hadoop :)
Regards,
Rene
Rene,
I should have mentioned in the blog post, the Wikidump directory of the dataset has the raw wikipedia content already parsed into hadoop readable tab delimited files:
692M page.txt
115M redirect.txt
987M revision.txt
17G text.txt
I'll edit the post to reflect this...
-Pete
Now i am working on temporal data mining.I need temporal(Time series) data set related to medical field or share market.Can you provide some data sets for me....
Lots of medical time series data is available, check my del.icio.us links:
http://delicious.com/pskomoroch/dataset
http://delicious.com/pskomoroch/dataset+timeseries
Cardiac, EKG, neuron spike trains, there are a bunch of datasets out there depending on what you are trying to do. Some good neural time series data is here: http://www.crcns.org/data-sets
also check out:
http://www.neural-forecasting-competition.com/
-Pete
I am very new to this realm and do not have too much of knowledge about how the AWS works. I have coded heavily in matlab and Winbugs pertaining to statistics but this seems like a totally different ball game.
Your advice would really help me as (how to access via Amazon Management Console) I wanted to use the page traffic data for an academic project.