-
Website
http://www.datawrangling.com/ -
Original page
http://www.datawrangling.com/some-datasets-available-on-the-web -
Subscribe
All Comments -
Community
-
Top Commenters
-
yaroslavvb
1 comment · 1 points
-
endah
1 comment · 1 points
-
malar
1 comment · 1 points
-
bearrito
1 comment · 1 points
-
pskomoroch
9 comments · 1 points
-
-
Popular Threads
The LDC link does not work. As a taxpayer, I am forced to wonder why their data is not open source instead of proprietary and subscription-based.
Are there any datasets of chat logs? Chat conversations (from IRC or otherwise)?
civilian,
The LDC site was up yesterday. It may have been hammered by reddit/del.icio.us users? I think some of the datasets they have are extremely large (for example the google N-grams), so there is a handling fee for non-commercial researchers. As far as commercial use fees, many data providers restrict use entirely. Open access to more data would be great ... except where privacy issues are involved. Sometimes there are also competitive reasons for restrictive licenses.
See more on the issues here:
http://en.wikipedia.org/wiki/Open_data
http://en.wikipedia.org/wiki/Data_privacy
related discussion:
http://news.ycombinator.com/item?id=100197
skj,
The WebBase Project link includes some chat data. It would be pretty easy to crawl for that data, provided terms of use for the chat sites are followed. Here is a recent list of hosts Stanford WebBase crawled, which includes chat sites (this link might not be permanent):
http://dbpubs.stanford.edu:8090/~testbed/doc2/WebBase/crawl_lists/crawled_hosts.0403
Looks like Google is going to start providing access to loads of open sourced data sets (http://blog.wired.com/wiredscience/2008/01/goog...>
The omission of cogmap makes me sad! Cogmap provides organization chart data for thousands of companies and exposes it all through a variety of web services.
Brent,
Just added Cogmap to my dataset bookmarks... any chance on releasing a raw dataset or REST api to fetch raw orgchart data?
-Pete
It's in there! http://www.cogmap.com/blog/2008/03/04/cogmap-apis/
-- brent
Brent sorry I missed that, this data will be useful for some identity matching projects I'm testing. I just found the programmable web description of your api as well: http://www.programmableweb.com/api/cogmap
What about including the wiki http://www.numberzoom.com/ which is a user-contributed phone numbers database. It's mosty reverse Caller ID for looking up what telemarketers or collection agencies have called, but there is no reason why other numbers wouldn't be on the site.
I don't know whether you've seen CKAN (Comprehensive Knowledge Archive Network). This is a project started by the Open Knowledge Foundation (of which I'm a part) and was launched about a year ago and seeks to perform exactly the type of registry task you've started upon here (though limited to open material only). As the blurb on the front-page says:
CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own – be that a set of Shakespeare's works, a global population density database, the voting records of MPs, or 30 years of US patents.
Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.
Rufus,
I had bookmarked the project here in July: http://project.knowledgeforge.net/ckan/wiki/package
Looks like you have made a lot of progress since then, I've just subscribed to the Open Knowledge blog: http://blog.okfn.org/
Your message is right on target: "Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge."
Installing/discovering data should be as easy as installing linux software using repository mirrors...
port install library-of-congress
-Pete
what do you think of http://data.un.org ?
Tim,
I like the site, and the capability to download in various formats. A REST or soap API would be nice, or at least an index page for each format with direct paths to the individual downloads.
-Pete
A personal favorite: ITRDB: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/treering/
http://www.fda.gov/Drugs/GuidanceComplianceRegu...
Some analogous time series data might be worth looking at as well:
http://delicious.com/pskomoroch/timeseries+dataset
where can i find the dataset for manufacturing?? such as 'defect' or 'not defect' prediction..please ..help me..i am in urgent condition...thx b4
i am doing a project in association rule mining in datamining for time related data. so i need a dataset with time related data ie,, temporal data.so please send me time related data.it's really helpful for my data
thank u