DISQUS

Data Wrangling Blog: Some Datasets Available on the Web

  • civilian · 1 year ago

    The LDC link does not work. As a taxpayer, I am forced to wonder why their data is not open source instead of proprietary and subscription-based.

  • skj · 1 year ago

    Are there any datasets of chat logs? Chat conversations (from IRC or otherwise)?

  • Peter Skomoroch · 1 year ago

    civilian,


    The LDC site was up yesterday. It may have been hammered by reddit/del.icio.us users? I think some of the datasets they have are extremely large (for example the google N-grams), so there is a handling fee for non-commercial researchers. As far as commercial use fees, many data providers restrict use entirely. Open access to more data would be great ... except where privacy issues are involved. Sometimes there are also competitive reasons for restrictive licenses.


    See more on the issues here:


    http://en.wikipedia.org/wiki/Open_data
    http://en.wikipedia.org/wiki/Data_privacy


    related discussion:


    http://news.ycombinator.com/item?id=100197

  • Peter Skomoroch · 1 year ago

    skj,


    The WebBase Project link includes some chat data. It would be pretty easy to crawl for that data, provided terms of use for the chat sites are followed. Here is a recent list of hosts Stanford WebBase crawled, which includes chat sites (this link might not be permanent):

    http://dbpubs.stanford.edu:8090/~testbed/doc2/WebBase/crawl_lists/crawled_hosts.0403

  • Marc Chung · 1 year ago

    Looks like Google is going to start providing access to loads of open sourced data sets (http://blog.wired.com/wiredscience/2008/01/goog...>

  • Brent · 1 year ago

    The omission of cogmap makes me sad! Cogmap provides organization chart data for thousands of companies and exposes it all through a variety of web services.

  • Peter Skomoroch · 1 year ago

    Brent,


    Just added Cogmap to my dataset bookmarks... any chance on releasing a raw dataset or REST api to fetch raw orgchart data?


    -Pete

  • Brent · 1 year ago
  • Peter Skomoroch · 1 year ago

    Brent sorry I missed that, this data will be useful for some identity matching projects I'm testing. I just found the programmable web description of your api as well: http://www.programmableweb.com/api/cogmap

  • Lili · 1 year ago

    What about including the wiki http://www.numberzoom.com/ which is a user-contributed phone numbers database. It's mosty reverse Caller ID for looking up what telemarketers or collection agencies have called, but there is no reason why other numbers wouldn't be on the site.

  • Rufus Pollock · 1 year ago

    I don't know whether you've seen CKAN (Comprehensive Knowledge Archive Network). This is a project started by the Open Knowledge Foundation (of which I'm a part) and was launched about a year ago and seeks to perform exactly the type of registry task you've started upon here (though limited to open material only). As the blurb on the front-page says:


    CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own – be that a set of Shakespeare's works, a global population density database, the voting records of MPs, or 30 years of US patents.


    Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.

  • Peter Skomoroch · 1 year ago

    Rufus,


    I had bookmarked the project here in July: http://project.knowledgeforge.net/ckan/wiki/package


    Looks like you have made a lot of progress since then, I've just subscribed to the Open Knowledge blog: http://blog.okfn.org/


    Your message is right on target: "Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge."


    Installing/discovering data should be as easy as installing linux software using repository mirrors...


    port install library-of-congress


    -Pete

  • Tim · 1 year ago

    what do you think of http://data.un.org ?

  • Peter Skomoroch · 1 year ago

    Tim,


    I like the site, and the capability to download in various formats. A REST or soap API would be nice, or at least an index page for each format with direct paths to the individual downloads.


    -Pete

  • Ken · 10 months ago

    A personal favorite: ITRDB: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/treering/

  • uma · 5 months ago
    where can i get dataset for mining unexpected temporal association rules(eg:application in adverse drug reaction)
  • pskomoroch · 5 months ago
    The FDA has some data like that:

    http://www.fda.gov/Drugs/GuidanceComplianceRegu...

    Some analogous time series data might be worth looking at as well:

    http://delicious.com/pskomoroch/timeseries+dataset
  • uma · 5 months ago
    the site u mentioned contains the required data for my project.i thank u for this great help
  • halim · 5 months ago
    Dear all,

    where can i find the dataset for manufacturing?? such as 'defect' or 'not defect' prediction..please ..help me..i am in urgent condition...thx b4
  • rajeswari · 5 months ago
    this is rajeswari.,
    i am doing a project in association rule mining in datamining for time related data. so i need a dataset with time related data ie,, temporal data.so please send me time related data.it's really helpful for my data
  • Mandy · 3 months ago
    I need RFID Supply Chain data for my research. Where can I find it??
  • endah · 2 weeks ago
    I need dataset image fingerprint for free,,,where I can find it?
    thank u
  • manoj · 5 hours ago
    i need Boolean dataset for association mining for my MTech project.Please provide me the address for the same. It will be a great help