You are right regarding many small files. Interestingly reading from many small files didn't seem to be so much of a problem with CephFS as it was to keep a large file open while reading and writing to it from thousands of processes (the legacy authorized_keys file).
Clearly CephFS as weak spots, but for what I've seen those are sports that we can work out, rough edges here and there. The good thing is that we are much more aware of these edges.
We are already working on what the next step will be to soften out these weaknesses so we are not impacted again. And of course to ship this to all our customers, either they run on CephFS, NFS appliances, local disks or whatever makes sense for them.
We started using Ceph because we wanted to be able to grow our storage and compute independently. While it worked well for us we ended up having much larger latencies is as a result of this. So we developed FSCache support.
Even better yet, if you data is inherently shareable (or has some kind of locational affinity) you can end up always serving data with a Ceph backend from the local cache with the exception of a server going down or occasional request. I'm guessing it is (repo / account)
On your API machines serving out the git content of out the DFS you can setup a local SSD drive to read only caching. Depending on you workload you can end up significantly reducing the IOPs on the OSDs and also lowering network bandwidth.
With the network / IOPs savings we've decided to run our CephFS backed by Erasure Coded pool. Now we have lower cost of storage (1.7x vs 3x replication) and better reliability because now with our EC profile we can lose 5 chunks before data loss instead of 2 like before. That's because we more the 90% of requests are handled with local data and there's a long tail of old data that rarely accessed.
If you're going to give it a try, make sure you're using a recentish kernel such as a late 3.x series (or 4+). That has all the Cephfs FSCache / and upstream FSCache kinks work out.
If you're using relatively recent kernel such late 3.x series or 4+ (as in ubuntu 16.04).
We are running a recent kernel as in ubuntu 16.04.
The reason I'm framing the caching not so much at the CephFS level is because we are shipping a product, and I don't think that all our customers will be running CephFS on their infra. Therefore we will need to optimize for that use case also, and not only focus on what we do at GitLab.com.
Thanks for sharing! Will surely take a look at this.
Clearly CephFS as weak spots, but for what I've seen those are sports that we can work out, rough edges here and there. The good thing is that we are much more aware of these edges.
We are already working on what the next step will be to soften out these weaknesses so we are not impacted again. And of course to ship this to all our customers, either they run on CephFS, NFS appliances, local disks or whatever makes sense for them.