Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Numerai – A hedge fund built by a global community of anonymous data scientists (numer.ai)
279 points by maxt on Oct 16, 2016 | hide | past | favorite | 113 comments


Hedge fund guy here.

- You need to know something about the domain in order to make sensible predictions. Is the data daily? Is it per second? Is it ticks? You can't build a sensible model if you don't know that, even if you have good predictions. Relative cost will vary a lot between timescales.

- It matters what the features are. Maybe there's some clever reason why it doesn't, but until I hear why I'm going to take the ordinary view that some features are different in nature to others. For instance maybe one feature is volatility, a thing we typically model with GARCH, while another is some fundamental like P/E, which we'd incorporate some other way.

- How are you executing the trades? It matters a lot whether you're click-trading through some broker API, automating via Excel, or running your own network of colo servers. Some things just aren't possible if you're too slow.

- If you make the data encrypted, you'd better know very well what it represents. For instance, you might take all the closing prices of the LSE stocks on a given day as inputs. You can make analyses that are valid with that, and ones that aren't, because the data you've collected do not represent a snapshot of the market at a specific time. It might sound like it does, but it doesn't on deeper inspection (market opens and closes are not simultaneous).

Does anyone know how it's going for them?


Although numerai could control for some of the factors that you mention above, I'm pretty sure they are falling into the data-mining fallacy that most people do when they approach quantitative trading from the outside world. I completely agree with your skepticism of their approach.

While prediction based on data may be valuable in some cases, it isn't robust enough to scale up in any meaningful way. Context matters, like you state above, and most quantitative traders start by taking their contextual knowledge of the markets, and then collecting data on features, and THEN they fit a model to it.

Skipping these steps is only going to lead to a bunch of blowups. I doubt they have any meaningful sharpe that they could scale up or publicly defend with the approach they have taken so far. I'd guess they are paying people VC dollars, not actual market profits right now.

I do think its cool that they have been able to use homomorphic encryption to solve the problem of wanting anonymize data, but I'm not sure it actually helps in this case.


Knowing what features represent is an advantage but isn't necessary. At least that's what nn research shows. But currently a lot of data of required for good representation learning. Ensembling also works, or can work, and it seems that's what they're doing. Although it doesn't look like datasets are large enough for NNs out that ensembles are large enough for good prediction.


You don't really need to know what the features mean to make a good model. Even if only you knew the meaning of the features you'd still get beaten by someone using a blackbox approach. This is exactly what happened to the owner. He worked as a quant making predictions on this expensive dataset, and in the first tournament he was beaten by around 50 other modelers using anonymized features.


Based on the contest (weekly submission) and data (binary classification), it is my guess that they are running a long-short equity portfolio with weekly rebalancing. The features are known ahead of time and the dataset is not BIG, so I think some sort of fundamental approach.

Im not optimistic...


I think hedge fund's founders motivation behind this idea could be like this:

Let's give many people some [encrypted] data to play with. May be someone will find a curious and unexpected pattern we didn't looked at before. May be this idea won't be workable straight away but we [hedge fund] are going to investigate this idea further and probably use this in our trading strategy.

In short, they are probably doing ideas mining using power of crowd.


From the competition rules: "You retain all intellectual property rights to your model. You never have to tell anyone how you built it and you never have to tell us who you are. You only upload your predictions."


I don't know how it's going for Numerai, but I'd be interested in picking your brain a bit. If you don't want to put contact info in your profile, do you mind shooting me an email?


Your constraints are rather simplistic given the people behind this. It is possible, that all of the factors you've just described, are encoded in the data set.


"It's just a pure math problem. It's like a math competition. You don't need to know anything about finance, you don't know anything about hedge funds... you don't even have to speak English..."

This is not a pure math problem. Eventually the outcome of all these models and predictions affects the stock prices and - if it becomes as successful as you hope - the economy as a whole. And the physical world: people, animals, plants, pollution, CO2 and so on.

I would much rather see work in that area - than this data juggling deep learning bullshit which results in profits being paid out to a bunch of intelligent, greedy and unwise people.

I just hope that when the intelligent people finally become wise, it won't be too late.


> than this data juggling deep learning bullshit which results in profits being paid out to a bunch of intelligent, greedy and unwise people.

Holy anti-intellectualism, Batman!

If you describe machine learning as "data juggling bullshit", this is very strong evidence that you simply don't understand what it is. This is an indictment of you, not of machine learning. Machine learning would more accurately be called "applied computational statistics" in 99% of cases.

What makes you think that the people using applied statistics to make money are "unwise"? Based on your tone, I would guess that it's because what they're doing doesn't agree with your folk definition of "an honest day's work" or something like that. This isn't really a good criticism; it just means you don't see the utility of what they're doing, which requires some degree of abstract thinking about the market.


I read it mainly as a criticism of this site and this application of deep learning. Predicting stock prices slightly better doesn't really make the world a better place or improve anyone's lives. But it does consume the time and resources of very intelligent people who might be able to do incredible things elsewhere.

Particularly the problem of this website, is it's huge focus on anonymity. You don't need to expose your code or methods, or even your name. It's an ongoing competition so it encourages secrecy so you can make more money next week.

This means even if someone does invent a new super ML method, they have every incentive to not share it and keep it a secret. So this website could actually have a net negative impact on the world.


Don't you think that is says something (not terribly good) about society that these very intelligent people must spend their time this way if they want to be rewarded?

Why spend 80 hours a week chasing grant money when you can spend 60 moving numbers around?


No. First off, "reward" is not just about the money[0]. The job presented on this site, on the other hand, due to the encryption, is motivated purely and entirely by monetary reward. Everything else is shielded.

In fact I'd say this job is quite uniquely, extremely, only about the money. Even when compared to other kinds sweet/lucrative, perhaps "stupid lucky" jobs you may find, that may require 60 hours mindlessly pushing papers, yet paid extremely well. I'd be hard-pressed to come up with any hypothetical kind of job could be more sharply, singularly focussed on the "do extremely-intelligent-monkey-dance, receive ample bio-survival-tickets" to the exclusion of any other meaning or fulfilment than this Encrypted Machine Lottery/Learning/Sudoku.

Second, I could be wrong (or too optimistic) but I'd hope that most "very intelligent" people demand more nourishing reward than just money to regularly put in more than 40h/wk. You can do this for research and advancement of science, or maybe because one considers the work itself a net positive to society, or maybe you chose to sacrifice those hours of your life to support loved ones, maybe there's some deep personal reward in it, or maybe it's temporary and you're saving for something rewarding. But to pretend just money is enough to waste your life on, hey it's a choice, but I'm going to have to see you turn in your "very intelligent"-card.

On the other hand, there will be exceptions. Some people don't care, just want to maximize $$$ for the least amount of effort/hours of one's life, regardless of side-effects. This website does not fill a niche for these people. If you are extremely intelligent, just want money and don't care about moral aspects, there's always been plenty "business opportunities" to fulfil these particular needs.

Finally, if one would argue, but this job isn't quite explicitly "badwrong" like those others. I'd suggest to think it through: On the one hand, a (possibly) well-paying job that is designed from the bottom up to have no way of determining whether it has a net positive or negative external effect besides the money it pays you. On the other hand, literally anything else you could do with your time.

I can see people trying this for a short time, as a funny puzzle, at most.

[0] I don't mean this one, but because I still think it's funny, I'll point out the way-too-easy retort here: this says something (not terribly good) about you. (j/k)


> Predicting stock prices slightly better doesn't really make the world a better place or improve anyone's lives.

Actually, it does. Increasing liquidity, communicating price information, more accurately accounting for predictable future changes in price, and absorbing risk allow for increased agricultural and industrial production. Like I said, you have to think through a few layers of abstraction.

A nice rule of thumb is that if someone is getting paid to do something you personally think is useless, it's either a government job or you just don't get what they're doing, but the market does.


I don't believe that increasing liquidity produces much, if any, value for the world. The stock market itself is pretty disconnected from any real value, and at best any benefits just benefit rich investors buying stocks. It certainly doesn't increase production.

But the opportunity cost is what I'm getting at. So many very intelligent people could be producing immense value in other areas, but instead drained away doing this garbage. And they are incentivized to keep anything useful they invent a secret.


Wrong, wrong, wrong. Avoid speculating on such things when you clearly have no background in them. You will accomplish nothing but spreading misinformation.

http://www.investmentreview.com/files/2010/07/The-value-of-l...

https://fp7.portals.mbs.ac.uk/Portals/59/docs/Finger,%20Mark...

http://people.stern.nyu.edu/adamodar/pdfiles/papers/liquidit...

https://www.macquarie.com.au/dafiles/Internet/mgl/au/mfg/mim...

https://en.m.wikipedia.org/wiki/Liquidity_crisis

The stock market is absolutely connected to "real value" (which doesn't make sense as a concept in the way you're trying to use it). Besides allowing companies to raise capital for future endeavors, it also allows the market to communicate price information, which is critical for reducing the risk associated with a given transaction.

> at best any benefits just benefit rich investors buying stocks.

It benefits everyone; investors risk their capital by selling it to companies in exchange for partial ownership, which allows the company to grow, and if the investment pays off the person who risked their money is rewarded. Without the stock market, companies would have to get loans for growth, which is risky and too expensive to be practical.

> It certainly doesn't increase production.

Yes, it very much does, for the reasons I outlined above. This is like Econ 101; why are you commenting so strongly when you clearly do not have any background here?

> but instead drained away doing this garbage.

A small efficiency increase expressed over 100 trillion dollars of trades every year is a huge productivity increase for society.


I think you are the one spreading misinformation. I'm well aware of the arguments and find them extremely unconvincing.

Look, for instance, at high frequency trading. Traders spend millions of dollars trying to get signals across the earth a millisecond faster than the competitors. Does it actually benefit anyone that information travels a millisecond faster? Who benefits from the high frequency traders at all? Yet they eat up millions, maybe even billions, of the economy's resources doing this nonsense.

Look at these trader spending millions of dollars to fly drones over oil tankers or parking lots, to get slightly more accurate information faster than anyone else. Look at these hedge funds spending tons of money on bribes to people in companies to get insider information before the public does. These things clearly provide zero benefit to the world, it's just a massive waste of resources.

This hedge fund is a bit different, in that they seem to be actually doing fundamentals and longer term bets. That's not so bad, but it's still pretty disconnected from any real world benefits. At best they predict a company will increase in value and buy it's stock. How does that benefit anyone? It's almost a zero sum game. The people that buy the stock that week lose, because the price is higher than it otherwise would have been. The people that sold the stock lose, because it was really worth more than they actually sold it for. Any money the hedge fund makes necessarily comes from someone else losing that money.

The stock market itself is not really connected to the real world. Sure sometimes companies sell stock to raise capital. But most stocks being sold and bought are not by the companies themselves. It's a huge indirect chain of traders and investors and speculators, trading these stocks many, many years after the company sold them. And the company itself would still be able to sell stocks if there were fewer traders. At worst the prices would be slightly less accurate.

Price information of stocks is also a horribly inefficient way of getting this information. These companies are often involved in many separate projects. So even if someone has a model that can exactly predict the success of products, it's very difficult to use that information. You also need to determine how much the company is worth given all of it's projects and businesses and property it owns. It doesn't directly give that information to the company itself - they see their stock went down, but they may have no idea why. Not to mention Keynesian beauty contest problems, where you have to predict not what the company is worth, but what other people think it will be worth, and how much they know currently, and what the interest rate is, etc.

But you didn't address my main concern, which is the bad incentives of secrecy, and the opportunity cost of employing smart people doing this, instead of doing something else. Even if it does produce some value, it doesn't mean it's worth the cost.


It's also plausible that they are intelligent and hard working because of their willingness to earn money.


good grief you're right, I never realised, the market is always right about everything!


Nice straw man, but the market doesn't have to be right about everything; it just has to not be completely wrong about a market worth tens of trillions.

The stock market is one of the mechanisms that allows the market to be right by communicating price information.


> What makes you think that the people using applied statistics to make money are "unwise"?

Because it is an incredible waste of talent.

I know this is in contradiction with all the 'values' that have been drilled into our minds since we were born, but it's about time we wake up and reorder our priorities.

>Based on your tone, I would guess that it's because what they're doing doesn't agree with your folk definition of "an honest day's work" or something like that.

If you look at the state of the biosphere / atmosphere / oceans - the data - and if you have children - then it should be quite obvious.

Also there are the socio-economic challenges that the world faces right now - in fact it is unclear if we're going to make it to the next century as a species.

It really don't matter how much "money" you have in your account when your city sinks under the ocean...


The curse of Plato strikes again. You (or other eiltes) who know better than the actual people doing what they do that they waste their time and could use it in more meaningful/fruitful ways.

The fact that these people chose to spend their time doing numerai probably means that they think that's the best use of their time.


> in fact it is unclear if we're going to make it to the next century as a species.

The only way we're going to wipe ourselves out is with nukes, nanotech, or engineered viruses. Global warming is not an insurmountable concern. It will be expensive for coastal cities, though.

There's no reason we can't focus on multiple things. Global warming isn't so pressing that we need to dedicate all intellectual output to it. Every 10 or 20 years, the environmental alarmists come up with something new to obsess over (I think it was landfills previously), and they're doing it now with global warming.


>It really don't matter how much "money" you have in your account when your city sinks under the ocean..

You just move. That's what money enables. No city is going to sink so rapidly people just drown.


So you're literally arguing that we just need to focus on making money, and ignore things like the environment? Yep, sounds about like a textbook example of "unwise" to me.


Nope, just pointing out how stupid the conclusion is that the poster made.


What of the poor whose villages are (in effect) being sunk by the rich, or destroyed in various other creative ways?


So the 2012 (the movie) way of solving environmental problems: The cities can drown as long as we don't belong to the people who drown with it.


I didn't read it as a criticism of machine learning or intellectuals, but rather of the many bright minds that are trying to make money off of the stock market by writing software to make better trades. This kind of activity doesn't create value or even, to my knowledge, liquidity.

I believe he's suggesting that there are a lot of other areas where their knowledge and skills could be applied which would create some societal benefit as a byproduct.


> This kind of activity doesn't create value or even, to my knowledge, liquidity.

It does both, the latter by more efficiently communicating price information and incorporating future events into asset prices. More accurate AI traders absorb risk, which is good for the production economy.


I'm currently watching George Soros's lectures on his reflexivity theory appliend to financial markets [1], where it explains why their evolution is not easy to model due to "human uncertainty principle", which postulates that our actions, based on incomplete and distorted data make things unpredictable.. It makes a lot of sense.

[1] - https://www.youtube.com/watch?v=RHSEEJDKJho


The data is encrypted. There are a set of factors in sequential order. You know little more. So for instance it's not obvious that the predictions are of stock prices at all.

They are in effect crowd sourcing curve fitting because the number of possible models for this maths problem is so large.

Think of crowd sourcing here as a pruning heauristic in a search problem.


Agreed - it is a terrible waste of talent.

Ironically, it's also not entirely a math problem - the underlying system doesn't exactly follow the expected rules of math.


What expected "rules of math" does it not follow?


We didn't start the fire It was always burning Since the world's been turning


Spent some time experimenting with Numerai. Really fun competition, clean (encrypted) dataset, and Bitcoin payouts. I wrote about my experience and open-sourced all of the models here[0] if you're looking to get started.

[0] https://github.com/jimfleming/numerai


Awesome. Thanks for the write up


Their dataset reeks of startup hustling.

I just downloaded the training set [1] and plotted some of its descriptive statistics [2]. It looks that all features are uniform distributions and the response variable is Bernoulli coin-flipping. In layman's terms, you can't really come up with a good predictive model with this training set.

I give them the benefit of the doubt that they wanted to have something in place to push the website live, but I cannot imagine any serious data scientist not noticing this.

[1] http://datasets.numer.ai/57feb95/numerai_training_data.csv

[2] https://cl.ly/051G3Y2Z2O0W


The features are all encrypted using homomorphic encryption, which ends up mapping the overall distribuiton uniformly onto [0, 1]. If you play with the dataset, though, you'll find there are some significant correlations between the different features and you can make a model that does substantially better than chance.


I would like to see a comparison between random generated uniform features and these features. Can you fit the noise and come up with "statistically significant" predictions? If so, is anyone doing any better than can be done on random data? If no one is doing better than that, what is the likelihood that these aren't any better than monkey with a dartboard?

I would love to see peer reviewed articles from numerai with some of their behind the scenes results.


Whenever I make a model I always do cross-validation to make sure that I'm not overfitting. If we were just fitting random noise I would always see the performance of my test set being no better than chance (or worse).


The key property that you should be testing for is dependence. It would be surprising if you could identify anything in the descriptive statistics, that would make this all child's play?

I can't imagine any serious data scientist not knowing that.


>to push the website live __ The website has been live since like December 2015. There's a new tournament weekly. (I think it used to be every second week)... New tournament, new data -- always encrypted. Just look at the leaderboard; look at some of the participants' submission history

>any serious data scientist __ - I mean, any human - data scientist or not - would see at least that the website wasn't just pushed live; or at least have noticed the film is from like August... ^are you a serious data scientist? If you are (which I assume you are - just also an eager beaver to download the datasets without learning about the company sufficiently) -- then you'd do well to join the tournament and have a real go before you [attempt] to knock it as you have here -- and then come back and give some real feedback -- I'm sure everyone would appreciate to hear the 'after' report from you -- since you're so kind as already to dish out benefits of doubt :) Looking forward to catching you above controlling capital!


I don't see how Numerai can avoid the multiple comparisons problem [0]. If people submit thousands of random models, then some subset of them will do a fantastic job in predicting prices in historical simulations but do poorly under real market conditions. As long as the models are black boxes, there's likely no good way to distinguish them from noise.

[0] https://en.wikipedia.org/wiki/Multiple_comparisons_problem


This can be mitigated by evaluating all the models on a hold-out test set (similar to what kaggle does and what was done in the netflix prize). The multiple comparisons problem is also mitigated by the fact that the models wont be completely random, there will likely be some positive correlation between them.

edit: Also, by hoeffding's inequality the number of training examples needed for a given level of confidence is only logarithmic in the number of models (even assuming they are independent). See page 6 here: http://cs229.stanford.edu/notes/cs229-notes4.pdf


They would need a model built on top of the submitted models to weight or select which specific trading signals to act on. A plausible model might be walkforward testing on out of sample data, or as simple as trailing performance in live trading, or both. Still, with a large set of models, you won't be able to get around multiple comparisons problem entirely.

That doesn't matter though because you only a need small amount of good signal to be profitable in the money management biz. The larger issue is whether the aggregate signal quality is high enough to be able to pay for development and trading costs.


That's true only if stock market data is completely random and that there's no signal to predict on. That doesn't seem to be the case, considering hedge funds successfully use ML on historical data to capture alpha. You don't need a model to work forever to be successful.


It looks like you missed the point of my comment. I'm saying that Numerai won't be able to distinguish between zero alpha and positive alpha models, if all they're doing is running historical simulations on black boxes.


"In December 2015, we created the world’s first encrypted data science tournament for stock market predictions. Since then, Numerai data scientists have submitted 13,350,675,598 equity price predictions. The most accurate and original machine learning models from the world’s best data scientists are synthesized into a collective artificial intelligence that controls the capital in Numerai’s hedge fund."

What does this mean? What do they do?


Leaving out the marketing hyperbole, it means that Numerai's core software is an automated trading system derived from, and ostensibly trained on, the most accurate pricing models and insights put forward by participants (the data scientists).

Participants with the most accurate [1] insights are paid a sum in exchange for the insights being incorporated into the hedge fund's trading system, which in turn manages the fund's capital.

_________________________________

1. "Accurate" is a tricky word and I can't comment here on how much effort Numerai puts into ensuring that insights are accurate over meaningful time scales instead of being the result of e.g. overfitting. The insights could presumably be accurate on ranges of anywhere between "once" and "months."


The "predictions" are submitted only knowing a portion of the dataset, but the winners are chosen based on the entire dataset, which is supposed to prevent overfitting.


Winners are chosen based on a part of the test dataset which the user never gets feedback on/never is able to see the true values of during the competition round - so it's very difficult to overfit to that private set.


Monkeys throwing darts at the board.

This brings to mind the Buffett hedge fund wager, where he invested in a vanguard s&p 500 tracking fund (VFIAX) and a hedge fund actively managed an equal amount, and Mr. Buffett ended up winning handily.


It's easy to be dismissive of this, especially by working from the popular Warren-Buffet-index-funds-beat-hedge-funds story that makes the rounds. It's true that most hedge funds and active traders lose money (or at least, underperform the market). But as I am fond of pointing out, there are a non-negligible number of funds and traders who consistently and demonstrably earn significantly stronger returns than the market benchmark, even net of fees.

I am skeptical of Numerai for different reasons. If someone can consistently churn out profitable and novel equity pricing insights, it would be more rational for them to work for a more well established hedge fund in quantitative research. Perhaps more importantly, I'm skeptical of how they judge accuracy in their participant-volunteered insights.


Check their board of directors -- all very smart and established people.

A former tournament winner did well on both the public and private leaderboard. It is very difficult to do this by luck. He was also a student from Bangladesh who got the opportunity to play with hedge fund data with just the cost of an internet connection and zero risk for messing it up. Should he start working for a hedge fund now, without any finance experience? Could be a good bet. Numerai would still beat him, because they can aggregate all the top models into an ensemble. It is hard to beat 50 individuals, but near impossible to beat a team of 50 competitors. Compare the variance of a single decision tree with a Random Forest.


They are betting that there exist people who are capable of performing that work who cannot work at a hedge fund, for whatever reason.

Perhaps they live in the wrong location -- it's hard/impossible to get a quant job if you don't live in a major market center. Not everyone is 23 years old and unattached and prepared to move around the world for a job.

Perhaps they can do the predictions but they didn't go to a high end university and have no track record -- just try even getting an interview at a hedge fund without one of those.


Or people who do work at hedge funds, but want to moonlight on the side and get paid via bitcoin anonymously


Really smart monkeys. And monkeys which will (hopefully) be self-correcting to converge on the bulls-eye.

Most people don't realize that the "markets" are 49% random, 48% sentiment driven, and 3% fundamentals. If you approach the problem with that assumption held true, monkeys throwing darts isn't such a horrible mechanism for investing.

See: "Monkeys Are Better Stockpickers Than You'd Think: Why dart-throwing primates demolish S&P 500 returns and most active fund managers don't even come close." http://www.barrons.com/articles/SB50001424053111903927604579...


> Most people don't realize that the "markets" are 49% random [...] If you approach the problem with that assumption held true, monkeys throwing darts isn't such a horrible mechanism for investing.

Maybe I haven't understood that part of data science but I never got why throwing more unpredictability on an already unpredictable data source would somehow make it more predictable.


It doesn't make it "more predictable". You're simple achieving better fit to the sample data distribution.

I guess another way of saying it is that your mess is starting to look more like their mess.

Anything that deals with the future is inherently non-predictable. Using chaos theory as a framework, we say it is unpredictable because we are unable (and will always be unable) to model the currently system completely. There will always be data that was not captured hiding between the data that was captured. Follow the arrow-of-time far enough out into the future and that non-captured data will manifest itself in the captured data, thereby (usually) creating a deviation from the modeled future.

To get around this we use statistics and probability. We rely on the law of large numbers and regression to the mean. In other words, we hope that the future won't get too weird and will be similar enough to the past, within some confidence interval.

So, we're not really predicting a specific outcome, we're predicting that the outcome will be some point within some confidence interval.

The reason we can get better at this, is as we capture more data we can better guess the inputs and assumptions we use to create the model. We throw out stuff that didn't happen to be relevant. We discover stuff that we should have considered relevant. If we're lucky the model closely follows the physical laws of our reality and we can apply the frameworks so arduously worked out by chemists, physicists, biologists, etc. If we're not so lucky we're dealing with sentiment, conjecture, or any of the other human inputs of the financial markets, and we are forced to make up formulas that work until they fail spectacularly.


Investing in S&P500 is a strategy, which can be optimised. You can ask if it is better to invest in top 100 or top 1000 instead of a top 500 index. Why even go after the top index? There are all sorts of inactive strategies available.

So in a sense, the "active" manager is just as much a monkey throwing darts as the "inactive", yet one of them is better than the other.

Why can't a third monkey be even better?


Buffet's wager is a trick, which works on simple people who don't understand statistics.

When you take a group of hedge funds, they end up being a very good proxy for the market. At about 20, they will be almost indiscernible.

So - a passive fund, buying the market, with lower fees, should outperform.


The Protégé basket of funds contains five hedge funds, not twenty.

The returns are far from indiscernible from S&P 500. At the end of 2015, the S&P 500 portfolio had seen 3x the profits of the Protégé portfolio.


In his case, these were 5 funds of hedge funds. Which underneath could be much more than 20 funds.

You're also forgetting how much the compounding of the fees can cost, and you have two layers of them. This could easily make the S&P at least 3x higher in returns after 10 years.


Aha, I had skimmed past the fund of fund parts.

In CFA Institute's estimate, the difference in fees only made up about half of the underperformance at the end of 2014, especially considering that 2008 was an easy win for hedge funds:

https://blogs.cfainstitute.org/investor/2015/02/12/betting-w...

Note that at the end of 2015, eight years had elapsed, not ten.


Fair enough, there is a slight bit more to the buffet bet than I trivialized.

The short portion of the hedge fund will have been losing as we are in a bull market. The full length of a cycle should see this effect reverse (and indeed 2008 shows this).

So Buffet did have to pick a bull market in making this bet, but he still had better than 50/50 odds regardless.

You also have other costs specific to hedge funds over active funds, such as higher brokerage from more frequent trades, and short interest. These have the same drag as the management fees, and I should have included that also in the description.


> Mr. Buffett ended up winning handily.

The bet isn't over until December 31, 2017: http://longbets.org/362/


I don't understand how this works. What exactly is being predicted? The data isn't a time series. The outputs are only binary. There are only 27 features. I don't understand how this represents market data at all. In fact they probably destroyed most of the information trying to convert the data to this format.


The training set just has anonymized features. Data scientists generally would like to know the nature of the data they are working with. Does the site at any point give access to the labeled featured?


You don't necessarily have to know the meaning of the features to build a successful model. That has been done on Kaggle a lot.


No, at least thus far you know nothing about what the features represent. They are just labeled feature1 through feature21.


This is fishy. The entire point of encrypted data is that one message cannot be distinguished from another without decryption (which would require the private key). In other words, the entire premise shouldn't work. This means that either bad encryption is being used (i.e. statistical information about the data is leaked) or the good results we see are just noise. Or, the whole thing is a scam to get funding: the best algorithms are planted, and the company just shuffles BTC between accounts it controls.


The data is homomorphically encrypted, meaning you can do operations (such as add and subtract) on the ciphered message and it will also perform them on the underlying data.


Yes, I realize that. The issue is that the result of any operation is also encrypted, which means that there should be no way to connect the target of the training data (encrypted or not) to the output of a function of encrypted data. Suppose the unencrypted data is (a,b,c) where a+b=c, and (x,y,z)=encrypt((a,b,c)). We have an addition function plus on encrypted data such that decrypt(plus(x,y))=a+b=c=decrypt(z), but it is not the case that plus(x,y)=z (at least, not if plus is computable in polynomial time, and assuming the encryption scheme is sound). If it were, we could statistically distinguish encrypt((a,b,c)) from encrypt((rand(),rand(),rand())) which would mean the encryption is not sound.


They could just be normalizing every data point to be between 0 and 1 by dividing by the range. That's a homomorphic encryption.. it passes your weird assumptions.

I don't know why you're harping on about sound encryption, the point of this is to keep the statistical information intact in the cipher, without giving away the underlying market data.


It's very weird to call data normalization "encryption". This is just a standard procedure done on most datasets. Encryption implies they've gone to extra processing to make sure you can't figure out what the variables mean.

I think it's either an abuse of the word 'encryption'. That, or they really have done something weird to this dataset. Which will probably make it useless for statistical algorithms. Even normalization destroys a lot of useful information.


They only need to go as far as to obfuscate the market data this was derived from, so they don't have to pay exchange licencing fees.

It doesn't have to have an exponential time complexity on decryption to qualify as 'encryption'. Multiplying by 2 could be considered homomorphic encryption.

You might think encryption means something else and that it's an abuse of the word but unlike the spy novella that you derive this impression from, these guys actually are ex-spies.


> They only need to go as far as to obfuscate the market data this was derived from, so they don't have to pay exchange licencing fees.

Thanks for explaining this, I was struggling to figure out the difference between this and https://www.quantopian.com/

That's actually pretty clever.


What I'm talking about is indistinguishability [0], [1] which shows up in definitions of homomorphic encryption (e.g. [2], [3]). If the data is only normalized, or even encrypted in an order-preserving way, it seems possible to figure out information about the underlying data (e.g. if the target is whether a symbol moves up or down, and if you can figure out what even one of the features refer to, there's enough information to turn your model on the data into predictions you can just trade on).

[0] https://en.wikipedia.org/wiki/Computational_indistinguishabi... [1] https://en.wikipedia.org/wiki/Ciphertext_indistinguishabilit... [2] http://cs.au.dk/~stm/local-cache/gentry-thesis.pdf [3] https://arxiv.org/ftp/arxiv/papers/1305/1305.5886.pdf


I read through the blog posts, and it seems like the encryption is order-preserving. It's designed to leak enough information to be useful in prediction, but not enough for users to trade on it independently.


If I had a winning strategy why would I feed it to Numerai instead of instavest.com?


You don't know if you have a winning strategy. You'd have to put your own money and take a risk to find out. Plus you'd have to take care of data and feature engineering yourself.


Because you don't have capital, risk-adversity, and access to the (expensive) dataset. You only have encrypted predictions, which are worthless for you.


Why should I think that being anonymous should give some advantage ? I think it would be more of a disadvantage due to should I trust you? Also, how do I know these people are truly anonymous ?


Anonymous in the sense that numerai does not ask for any personally identifying information. You could easily prove you are who you say you are in numerai.

Also, I think the big benefit from a data scientist working on this is you can test methods with generic features, submit the results and get paid if they are good, but not submit any part of the methodology to any third party.

If you can kill it on numerai then maybe you would consider buying data sources and apply your methods to your own data. Although you still don't know what the features are.

It's the polar opposite of open source.

The owners don't have to trust the data scientists. They evaluate their results against additional data.


That's what I thought - if you are smashing it on numerai, wouldn't you be better off raising some capital for your own fund to scale it?


But how do you trust the owners not to make fake accounts and only pay out to them?


Couldn't I make thousands of fake accounts and submit thousands of slightly different models. Then after a while I could push one or two of my stocks higher in my models artificially. If you represented a large enough % of the "data scientists" in this hedge fund you could make it look like your stock is "definitely" work investment. After they invest in your company, you could take off to the hills.

Hell, you don't even need to be the owner of the company. This would be a great way to obtain large amounts of political sway/power. Like a company/want it to succeed for some agenda? Make it look better as an investment opportunity. Dislike a company? Well that stock is going to do horrible next quarter. It's also a self fulfilling prophesy.

For all of you finance people out there, is what I am saying impossible or stupid? I hope I'm wrong otherwise this is a horrible idea.


You don't even need to push one or two stocks higher. With thousands of fake accounts you could just pick a different stock for each account at random. Certainly a few of those stocks will turn in huge returns. Collect your Bitcoin reward from Numerai. Repeat.

Maybe they have some way of preventing people from gaming the system that way?


The thing is, with the data no one knows what each row represents, or even what the features are or what they're predicting is. Each submission has 30,000 predictions, so you would need to have an unreasonably good random guess to get anywhere near the top of the leaderboard.


I didn't think of that, great idea!


This won't work. As far as I can tell, you don't get to "pick stocks". You just submit predictions for unlabelled variables that could represent anything. It's no different than a standard kaggle competition.


Then could the operators not do this? Just claim that they sent it out to N people and they all suggested the stock. How would you prove they didn't.


Well the hedge fund manager can invest in whatever stocks he wants already, so I don't see the point of this scheme.


The manager could take bribes to invest in a few failing companies and cover his ass while doing so by saying "anonymous statisticians" told him to.


It sounds a bit like pump and dump, which is considered fraud. However, instead of being against hundreds or thousands of individual investors, you're going against one (obstinately) accredited investor, so you might be one firmer legal ground, as they should know better.

It's probably against their TOS though.


Aren't the "data scientists" anonymous? Login over tor, exert some basic opsec, and live in a country with no extradition treaty. Do it as a service for billionaires who want to play the never ending game of chess that is the the economy.

For 100k per "tweak" you could make a lot of money and still give people rather significant influence over the world (especially if a LOT of people used this service).

> pump and dump, which is considered fraud

Yea it is definitely a pump and dump but if it makes you rich and you're in a country with no extradition treaty who cares? Remember the golden rule of the "elite class": laws are for poor people. As long as you get rich before anyone sees what you aredoing, and you "cant" be found, you're all good.


You don't know anything about the dataset they give you so this wouldn't really work. Its not like they tell you which stocks each row in the dataset correspond to. On top of this you know nothing about the features.


Shhh! Don't give my strategy away!


If they aren't managing correlation between different models there is an even bigger issue...


They are - your rank on the leaderboard is the importance of your model in their 'meta-model' so submitting the same model as someone else would just net you 0.


In my limited experience in London at the trading level, they do not want collective intelligence at all: they do the bulk with algorithms at very low level latencies and want outliers (unpredictable singletons are less reproducible than the average from millions heads) to run high risk - high reward books.


Is this an execution of "algorithm(model) as a service"?


i suppose this could be, uh, useful for insider trading



How ominous.

In today's news anonymous scientists form human genetics laboratory to improve the human species.


Hope they will have more luck than LTCM


Awesome!


This is darn cool


Things I do not understand:

* How does a logloss relate to earnings?

* They only receive predictions based on old data (by definition) and not the models, how do they just the predictions them to make trading decisions?

* How can I invest in this hedge fund?


tldr - novel encryption method allows for sharing of data sets and better machine learning models. Aggregation of models into a portfolio offered as a hedge fund.

Found this company interesting, read a bunch of blogs from them and tweetstormed: https://twitter.com/Royal_Arse/status/787725301908242432


Someone got financing for a crowdsourcing of data mining of commercial data set. Very clever.

It seems that only very few nerds are taking the theoretical impossibility of predicting the future seriously and we are missing the opportunity to get funding for some crappy models from the greater fools.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: