Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AWS Data Exchange (amazon.com)
173 points by jeffbarr on Nov 13, 2019 | hide | past | favorite | 62 comments


Wow we were just talking about selling shovels in a (ML) gold rush.

Incidentally, what is the open source alternative of this? Data is so cheap that it should be actually free, unlike counterfeit nike shoes.

(Does a bittorrent tracker specifically for research data exist? Edit: there's http://academictorrents.com/)


> What is the open source alternative of this?

There are many public-domain datasets:

* https://aws.amazon.com/opendata/public-datasets/

* https://cloud.google.com/public-datasets/

* https://github.com/awesomedata/awesome-public-datasets

> Data is so cheap that it should be actually free

I feel like you haven't had the opportunity to work with data that has value.


if a dataset is really valuable, it wont be available for public consumption/copying, and, like nike, won't be selling itself on amazon


Sometimes it's not that the data itself doesn't have value, it's that to get the value out of that data requires a lot of blood, toil and treasure in human capital to extract.

That's how data companies make money, by doing that work for you.


> if a dataset is really valuable, it wont be available for public consumption

Data vendors sell valuable data all the time to essentially anyone who is willing to pay their asking price. I don't see how this is any different? Just because aws is trying to run a marketplace doesn't suddenly make the data "public" in an open/free sense, there's still a price tag attached to it.

> and, like nike, won't be selling itself on amazon

Not every data vendor (or retailer since you keep bringing up Nike) has the necessary brand recognition, marketing budget, or technical proficiency to only sell direct to customers: that's why centralized marketplaces and alternate distribution channels exist.


> valuable data all the time to essentially anyone

if it s truly valuabe it isn't sold to 'essentially anyone'. like high value financial info or security.


Sorry, I don't agree. There have been a number of successful start-ups that monetized publically available data in their products. Real estate and finance come to mind, but there's a ton of data, you just have to grab it and make sense of it for someone who's will to pay for that service.


Thompson Reuters, Bloomberg, FTSE, LSE, NYSE, Nasdaq, etc all sell data that is very valuable.

I feel like you haven't had the opportunity to work with data that has value.


i doubt TR or bloomberg will be selling on amazon, considering what they do to other lines of business. unless they re foolish, of course


Bloomberg usually prevents customers from using their other services and platforms if the customers pay for data from a competing provider. The settlements from lawsuits that follow are only a fraction of what Bloomberg gains by compelling customers to use the Bloomberg suite, so they continue to conduct business this way.

It will be a very busy, but lucrative time for corporate law firms as they battle things out.


Emmmm. Wikipedia is pretty valuable yet it is free.


sorry, i meant valuable in the monetary sense. Wikipedia is very valuable but its market price is $0


Thinking that there are no useful freely available datasets is beyond naive.


Data by itself is cheap.

Data that has an useful business or predictive purpose that is clean and constantly updated is not cheap at all.


why not sell them directly through an API instead of paying an amazon tax?


Most people would host that API using AWS, and there you are.


Most people probably already are. The marketplace is just another distribution channel.


> Data is so cheap that it should be actually free, unlike counterfeit nike shoes.

Getting data is not cheap, and maintaining a dataset is certainly not cheep either.


> Data is so cheap that it should be actually free, unlike counterfeit nike shoes.

This really depends on the data. In pharma the right 50 bytes of data can be worth billions. Not all data is personal product preferences for add targeting.


DBnomics (https://db.nomics.world/) is an example of a free & open API to source large volumes of data, in this case economic data. It's also completely open source.



This is nice, but I don't see the pricing after the free trials. The Pitney Bowes data [1] they used as an example in linked article only shows $0 for the free trial, not what it's going to cost you afterwards. It'd be nice to know the long term cost before tying this data into your business.

[1] https://aws.amazon.com/marketplace/pp/prodview-bwf7mapyyjzom...


Their screenshot shows an "autorenew" enabled on a $0 free trial with zero information on how much it will cost after the free trial is up.

You shouldn't allow people to sign up to "autorenew" at the end of their free trial period without clearly displaying the price that will apply when the trial ends.


It looks like the trial dataset in this case is simply a truncated subset of the full dataset (only San Francisco zip code boundaries). It would likely remain free forever under this model of "trial".


AWS Data Exchange product manager here. The auto-renew here only applies to the free trial product that you're subscribing to, and would therefore auto-renew at $0. You can view and subscribe to the full product separately (it's technically a separate product), which will show its price. Hope this helps.


The other datasets show prices for me when I login. A few financial datasets I checked seemed over-priced by as much as 10x.


I had the same thought! I submitted a request for one of the 'free' one month buckets of non-profit consumer geo-data, and clicked no auto-renew. could potentially have some value (though I don't think much). I don't know how much due diligence they will do or if they will just accept our request automatically.


> This is nice, but I don't see the pricing after the free trials.

AWS charges are really obscure. I think they make an effort at making it hard to estimate/determine how much you'll be charged. (ie: You charged for EC2 time, bandwidth, storage, etc...). That makes it very hard to see where the money is going.


At Scale (scale.com), we strongly believe that the “open-source” alternative to this is pretty critical.

We’ve built this index for autonomous driving datasets (https://scale.com/open-datasets) and are building that out for other domains right now.

Open source data has been a pillar to progress in ML (starting with ImageNet). It should continue to be the case that data that enables researches is sufficiently democratized.


It looks like this is targeted at ML/AI but I have a tangentially related question: does anyone know of open source or other publicly available lists of US businesses? Just business name and address?

I’m building out an app and we receive documents from all kinds of vendors from all over the country. The app is for our business to manage our client data. I was hoping to find a list of business I could throw in the db rather then piecemeal add the addresses in one by one as the documents come in.

I looked at some of the data service providers (infoUsa I think was one and d&b being another), but one dataset for just business names and addresses they were asking $50,000 for. I think my use-case is unique in that these companies typically sell this data as sales lead data which it definitely is not in my case (we don’t even sell b2b).

Anyone know of anything like this? I suppose I could just scrape phone books but I think if I can’t find the data we will just resort to one by one entry.


It’s been a minute since I looked at the docs, but could you use the Google Maps API for this (or perhaps OpenStreetMap)? You could query all of the POI categories for a city/state/country and save the addresses and names. Might be something, unless you need the legal name of the business rather than whatever they make public.


Legal names aren't critical but would be nice. I think that if the business posts the name that would be close enough.

Very interesting idea - many thanks!


OpenCorporates is a a good source for that, and it's open data (unlike Google Maps): https://api.opencorporates.com/documentation/API-Reference


You might be better off just using Google Maps Places API. IDK the traffic levels for your app but depending on your usecase it might be much easier to just use Google's API instead of trying to maintain a list yourself.


As someone who deals with HIPAA every day, it really bothers me that Change Healthcare is there even if the data is "anonymized".


I want to know how the hell they get it to begin with. CMS releases procedures/prescription data. But it is always totals by provider and it is provided 3 years in arrears. And that is only medicare data. How are they getting private pay information?


I'm not sure why it would bother you. I can generate random data in the same format and you can't figure out which is real or who's data it is. Thinking about data as if it has intrinsic value by being recorded is foolish, as many find out during acquisitions.


With another dataset that is non-anonymous, you can cross-verify with the anonymous data to increase your confidence of who they are. This happened when Netflix released anonymized user ratings. Researchers were able to deanonymize some users by using IMDB ratings.

https://www.wired.com/2007/12/why-anonymous-data-sometimes-i...


Similar thing happened a while back when AOL released 'anonymized' search data - https://en.wikipedia.org/wiki/AOL_search_data_leak

I dug through it just out of morbid curiosity and unfortunately stumbled upon a few searches indicating that a friend privately suffered a miscarriage. This was only possible because I recognized her as the only person that I knew at the intersection of a few other search terms that were correlated to the same 'anonymized' id.

GP is assuming a naive starting point when that is rarely the case.


Youre making up a mythical insecure map of useful info to worthless information. This is not a compelling reason to treat all anonymized data as useful or vulnerable.


There are many free public datasets available on the web.

I have an open source project on crawling public datasets and make them searchable in one place: https://github.com/findopendata/findopendata.


They're crowd-sourcing valuable information services from third parties to become a market data provider.

What could go wrong for information providers where Amazon controls their market and infrastructure? They become commoditized "data providers". They are coerced into profit sharing with Amazon. They are eventually replaced by Amazon-provided data.

I won't buy from this market because I see where this is heading. I use the same reason that I apply for not buying many other services and products from Amazon. It offers no additional value other than minor convenience to a customer at a much greater cost to the economy and providers.

Buying local isn't just for produce.


I've been thinking about your comment for nearly an hour now, wondering about all of the potential paths this could take.

If your company goal is to sell commodity data and make money in volume rather than high margin less frequent sales then maybe this is not so bad.

To your point about Amazon eventually taking this over, I do believe that to be the case when there is near ubiquitous demand for the data type. And just looking over what is in there now, these are some pretty specific datasets.

I'm not saying you're wrong, I will be considering your comments for quite a while.

I have said this on here before and I think it still holds true. Amazon is like Jay Leno.

Conan O'Brien once said:

“Hosting The Tonight Show has been the fulfillment of a lifelong dream to me. And I want to say to the kids out there watching, you can do anything you want in life unless Jay Leno wants to do it, too.” Apr 17, 2014


exactly. hundreds of companies and professionals who work cleaning data just became outmarketed


As a consumer of a paid dataset, how would I trust that the vendor is publishing accurate and complete data?



How do you ever know?


How does a data provider prevent someone from copying the data from their S3 bucket into a new one, then cancelling the subscription and owning the data forever?


At least in the case of subscriptions I looked at:

1) It's a continually updated file, so if you didn't subscribe you wouldn't get new data weekly/monthly. Likely the subscription aligns with the data refreshes for most products.

2) It's a one-time fee with a hefty cost attached (I saw some healthcare data sets that were $100K+). You are paying for that in its entirely and just have data rights to it.


On one of the example, the update is « quarterly »...


Usually the 'backhistory' of data is only of limited value. For use in real world applications in finance, consulting or business decisions, the last data point is usually pretty important.

Think of the price for data on the prices of stocks. If you want to know what AAPL traded at one month ago, that's free, if you want to know what it traded at 15min ago, that's $25k, and if you want a feed which shows the actual bids and offers, that will run you $2-15mm + infrastructure.


Probably money to be made grabbing hard to get hold of open datasets and listing them here until someone complains?


There is probably nothing wrong with doing that.

Some people would rather pay and have it all in one place. Some people would pay for you to check its sanity and for having someone to blame if it's wrong.

Sometime it's the corporate paying, and the dev team does not care.


Some of these providers technically are already doing that. socialGist for instance crawls articles from many sources and resells them in this marketplace.


Like stolen or leaked data?


Easy to find datasets but hard to use data collected by the government - for example, tons of gov data is released in PDF format (and other obscure XML formats). Even when the data is available in machine readable formats, you still need to read through 50 page read me/data dictionary files to understand the meaning of the data.

There is tremendous value in cleaning and repackaging this data in easy formats (offering an API would be awesome too). Even better if the provider can offer human support.

Obviously it would be ideal if the government releases all this data in easy formats in the first place, which a lot of governments do, but not all. The second best thing is for some private companies to help, even if they charge for it.

Not your parent commenter, but speaking from experience.

Here is a trivial example : I was looking for a list of all government websites (and their social media accounts), starting with federal all the way to small towns. There are lists, but none complete - at least I couldn't find, and definitely not with their social media handles.


Will this affect Bloomberg's business?


Did anyone else find Jeff’s first sentence terribly unoriginal and somewhat wimpy?

“We live in a data-intensive, data-driven world!”

I know these blog posts are turned out fast, but especially for such a sensitive issue as a world awash in data that no one understands and no one - yet - controls...it seemed like it was “whistling past the graveyard”.


Sorry to let you down.


Totally agreed.


This used to be called AWS Juntos lol


Seriously! That's what it was called last year.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: