> Most of the metadata activity is contained within a single shard:
>
> - File creation, same-directory renames, and deletion.
> - Listing directory contents.
> - Getting attributes of files or directories.
I guess this is a trade-off between a file system and an object store? As in S3, ListObjects() is a heavy hitter and there can be potentially billions of objects under any prefix. Scanning only on a single instance won't be sufficient.
> The firm started out with a couple of desktops and an NFS server, and 10 years later ended up with tens of thousands of high-end GPUs, hundreds of thousands of CPUs, and hundreds of petabytes of storage.
So much resources for producing nothing of real value. What a waste.
Great project though, appreciate open sourcing it.
In theory what they are doing of value, is that at any time you can go to an exchange and say "I want to buy x" or "I want to sell y" and someone will buy it from you our sell it from you... at a price that's likely to be the accurate price.
At the extreme if nobody was providing this service, investors (e.g. pension funds), wouldn't be confident that they can buy/sell their assets as needed in size and at the right price... and because of that, in aggregate stocks would be worth less, and companies wouldn't be able to raise as much capital.
The theoretical model is:
- You want to have efficient primary markets that allow companies to raise a lot of assets at the best possible prices
- To enable efficient primary markets, investors want efficient secondary markets (so they don't need to buy and hold forever, but feel they can sell)
- To enable efficient secondary markets, you need many folks that are in the business of XTX
... it just so happens that XTX is quite good at it, and so they do a lot of this work.
These qualifiers would seem to belie the whole argument. Surely the volume of HFT arbitrage is some large multiple of what would be necessary to provide commercial liquidity with an acceptable spread?
Does the HFT volume actually matter? Is it a real problem that the HFT volume exceeds the theoretical minimum amount of volume needed to maintain liquid markets?
Over 500PB of data, wow. Would love to know how and why "statistical models that produce price forecasts for over 50,000 financial instruments worldwide" require that much storage.
I have worked on exabyte-scale storage engines. There is a good engineering reason for this type of limitation.
If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will always fit in memory.
A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.
It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.
It is dependent on the intended workload but there are a few common design problems. Keep in mind that you can't just deal in the average case, you have to design for the worst possible cases of extremely skewed or pathologically biased distributions. A lot of the design work is proving worst case resource bounds under various scenarios and then proving the worst case behavior of designs intended to mitigate that.
An obvious one is bulk deletion, which is rarely fast at any scale. This may involve trillions of updates to search indexing structures, which in naive implementations could look like pointer-chasing across disk. Releasing storage to allocators has no locality because you are streaming the allocations to release off that storage in semi-random order. It is unhelpfully resistant to most scheduling-based locality optimization techniques. You also want to parallelize this as much as possible and some of these allocators will be global-ish.
The most interesting challenge to me is meta-scheduling. Cache replacement algorithms usually don't provide I/O locality at this scale so standard mitigations for cache-resistant workloads like dynamic schedule rewriting and latency-hiding are used instead. Schedulers are in the center of the hot path so you really want these to be memory-resident and fast. Their state size is loosely correlated with the number of objects, so in some extreme cases these can easily exceed available memory on large servers. You can address this by designing a "meta-scheduler" that adaptively optimizes the scheduling of scheduler state, so that the right bits are memory-resident at the right time so that the scheduler can optimally schedule its workload. It is difficult to overstate how much of a breaking change to conventional architecture this turns out to be. These add some value even if the state is memory resident but they greatly increase design complexity and make tail latencies more difficult to manage.
A more basic challenge is that you start dealing with numbers that may not be representable in 64-bits. Similarly, many popular probabilistic algorithms may not offer what you need when the number of entities is this large.
I aggressively skirted these issues for a long time before relenting. I deal more with database storage engines than filesystems, but to a first approximation "files" and "shards" are equivalent for these purposes.
It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).
It'd be helpful to have a couple of usage examples that illustrate common operations, like creating a file or finding and reading one, right after the high-level overview section. Just to get an idea what happens at the service level in these cases.
Yes, that would be very useful, we just didn't get to it and we didn't want perfect to be the enemy of good, since otherwise we would have never open sourced :).
But if we have the time it would definitely be a good addition to the docs.
(Disclaimer: I'm one of the authors of TernFS and while we evaluated Ceph I am not intimately familiar with it)
Main factors:
* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).
* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.
* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.
There are definitely insanely large Ceph deployments. I have seen hundreds of PBs in production myself. Also your usecase sounds like something that should be quite manageable for Ceph to handle due to limited metadata activity, which tends to be the main painpoint with CephFS.
I'm not fully up to date since we looked into this a few years ago, at the time the CERN deployments of Ceph were cited as particularly large examples and they topped out at ~30PB.
Also note that when I say "single deployment" I mean that the full storage capacity is not subdivided in any way (i.e. there are no "zones" or "realms" or similar concepts). We wanted this to be the case after experiencing situations where we had significant overhead due to having to rebalance different storage buckets (albeit with a different piece of software, not Ceph).
If there are EB-scale Ceph deployments I'd love to hear more about them.
There are much larger Ceph clusters, but they are enterprise owned and not really publicly talked about. Sadly I can’t share what I personally worked on.
The question is whether there are single Ceph deployments are that large. I believe Hetzner uses Ceph for its cloud offering, and that's probably very large, but I'd imagine that no single tenant is storing hundreds of PBs in it. So it's very easy to shard across many Ceph instances. In our use-case we have a single tenant which stores 100s of PBs (and soon EBs).
Digital Ocean is also using Ceph[1]. I think these cloud providers could easily have 100s of PBs Clusters at their size, but it's not public information.
Even smaller company's (< 500 employees) in today's big data collection age often have more than 1 PB of total data in their enterprise pool. Hosters like Digital Ocean hosts thousands of these companies.
I do think that Ceph will hit performance issues at that size and going into the EB range will likely require code changes.
My best guess would be that Hetzner, Digital Ocean and similar, maintain their own internal fork of Ceph and have customizations that tightly addresses their particular needs.
Last point is an extremely important advantage that is often overlooked and denigrated. But having a complex system that you know inside-out because you made it from scratch pays in gold in the long term.
CephFS implements a (fully?) POSIX filesystem while it seems that TernFS makes tradeoffs by losing permissions and mutability for further scale.
Their docs mention they have a custom kernel module, which I suppose is (today) shipped out of tree. Ceph is in-tree and also has a FUSE implementation.
The docs mention that TernFS also has its own S3 gateway, while RADOSGW is fully separate from CephFS.
The seamless realtime intercontinental replication is a key feature for us, maybe the most important single feature, and AFAIK you can’t do that with Ceph (even if Ceph could scale to our original 10 exabyte target in one instance).
TernFS is Free Software. The default license for TernFS is GPL-2.0-or-later.
The protocol definitions (go/msgs/), protocol generator (go/bincodegen/) and client library (go/client/, go/core/) are licensed under Apache-2.0 with the LLVM-exception. This license combination is both permissive (similar to MIT or BSD licenses) as well as compatible with all GPL licenses. We have done this to allow people to build their own proprietary client libraries while ensuring we can also freely incorporate them into the GPL v2 licensed Linux kernel.
Ha ha, I forecast, SPY goes up, and I’ve already made more money than XTX or any of its clients…
Look I like technology as much as anyone. Improbable spent $500 million on product development, and its most popular product is its grpc-web client. It didn't release any of its exotic technology. You could also go and spend that money on making $500m of games without any exotic technology, and also make it open source.
> Most of the metadata activity is contained within a single shard: > > - File creation, same-directory renames, and deletion. > - Listing directory contents. > - Getting attributes of files or directories.
I guess this is a trade-off between a file system and an object store? As in S3, ListObjects() is a heavy hitter and there can be potentially billions of objects under any prefix. Scanning only on a single instance won't be sufficient.
> The firm started out with a couple of desktops and an NFS server, and 10 years later ended up with tens of thousands of high-end GPUs, hundreds of thousands of CPUs, and hundreds of petabytes of storage.
So much resources for producing nothing of real value. What a waste.
Great project though, appreciate open sourcing it.
Your comment contradicts itself. They produced this project at least.
If price action trading is horoscopes for adults, they're a modern a day oracle.
In theory what they are doing of value, is that at any time you can go to an exchange and say "I want to buy x" or "I want to sell y" and someone will buy it from you our sell it from you... at a price that's likely to be the accurate price.
At the extreme if nobody was providing this service, investors (e.g. pension funds), wouldn't be confident that they can buy/sell their assets as needed in size and at the right price... and because of that, in aggregate stocks would be worth less, and companies wouldn't be able to raise as much capital.
The theoretical model is: - You want to have efficient primary markets that allow companies to raise a lot of assets at the best possible prices - To enable efficient primary markets, investors want efficient secondary markets (so they don't need to buy and hold forever, but feel they can sell) - To enable efficient secondary markets, you need many folks that are in the business of XTX ... it just so happens that XTX is quite good at it, and so they do a lot of this work.
> In theory
> At the extreme
> The theoretical model
These qualifiers would seem to belie the whole argument. Surely the volume of HFT arbitrage is some large multiple of what would be necessary to provide commercial liquidity with an acceptable spread?
Does the HFT volume actually matter? Is it a real problem that the HFT volume exceeds the theoretical minimum amount of volume needed to maintain liquid markets?
Over 500PB of data, wow. Would love to know how and why "statistical models that produce price forecasts for over 50,000 financial instruments worldwide" require that much storage.
If you keep all order book changes for a large number of financial instruments volume adds up quickly.
Me too. Is is really hard for me to understand, what XTX is actually doing. Trading? VC? AI/ML?
Have you seen their portfolio?
PS: Company seems legit. Impressive growth. But I still don't understand what they are doing. Provide "electronic liquidity". Well....
computing correlations between 50.000 financial instruments (X^T X) and doing linear regression ;).
Cool project and kudos for open sourcing it. Noteworthy limitation:
> TernFS should not be used for tiny files — our median file size is 2MB.
I have worked on exabyte-scale storage engines. There is a good engineering reason for this type of limitation.
If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will always fit in memory.
A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.
It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.
This sounds like a fascinating niche piece of technical expertise I would love to hear more about.
What are the biggest challenges in scaling metadata from a trillion to a quadrillion objects?
It is dependent on the intended workload but there are a few common design problems. Keep in mind that you can't just deal in the average case, you have to design for the worst possible cases of extremely skewed or pathologically biased distributions. A lot of the design work is proving worst case resource bounds under various scenarios and then proving the worst case behavior of designs intended to mitigate that.
An obvious one is bulk deletion, which is rarely fast at any scale. This may involve trillions of updates to search indexing structures, which in naive implementations could look like pointer-chasing across disk. Releasing storage to allocators has no locality because you are streaming the allocations to release off that storage in semi-random order. It is unhelpfully resistant to most scheduling-based locality optimization techniques. You also want to parallelize this as much as possible and some of these allocators will be global-ish.
The most interesting challenge to me is meta-scheduling. Cache replacement algorithms usually don't provide I/O locality at this scale so standard mitigations for cache-resistant workloads like dynamic schedule rewriting and latency-hiding are used instead. Schedulers are in the center of the hot path so you really want these to be memory-resident and fast. Their state size is loosely correlated with the number of objects, so in some extreme cases these can easily exceed available memory on large servers. You can address this by designing a "meta-scheduler" that adaptively optimizes the scheduling of scheduler state, so that the right bits are memory-resident at the right time so that the scheduler can optimally schedule its workload. It is difficult to overstate how much of a breaking change to conventional architecture this turns out to be. These add some value even if the state is memory resident but they greatly increase design complexity and make tail latencies more difficult to manage.
A more basic challenge is that you start dealing with numbers that may not be representable in 64-bits. Similarly, many popular probabilistic algorithms may not offer what you need when the number of entities is this large.
I aggressively skirted these issues for a long time before relenting. I deal more with database storage engines than filesystems, but to a first approximation "files" and "shards" are equivalent for these purposes.
Yeah, that was the first thing I checked as well. Being suited for small / tiny files is a great property of the SeaweedFS system.
Shameless plug: https://github.com/Barre/ZeroFS
I initially developed it for a usecase where I needed to store billions of tiny files, and it just requires a single s3 bucket as infrastructure.
What happens if you put a tiny file on it then? Bad perf, possible file corruption, ... ?
It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).
Probably wasting space and lower performance.
...which places it firmly in the "just like every other so-called exascale file system." We already had GPFS...
That was a good read. Compliments to the chefs.
It'd be helpful to have a couple of usage examples that illustrate common operations, like creating a file or finding and reading one, right after the high-level overview section. Just to get an idea what happens at the service level in these cases.
Yes, that would be very useful, we just didn't get to it and we didn't want perfect to be the enemy of good, since otherwise we would have never open sourced :).
But if we have the time it would definitely be a good addition to the docs.
Thanks a lot. Regarding your company, it is really hard for me to understand, what XTX is actually doing. Trading? VC? AI/ML?
Trading using ML.
How does TernFS compare to CephFS and why not CephFS, since it is also tested for the multiple Petabyte range?
(Disclaimer: I'm one of the authors of TernFS and while we evaluated Ceph I am not intimately familiar with it)
Main factors:
* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).
* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.
* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.
There are definitely insanely large Ceph deployments. I have seen hundreds of PBs in production myself. Also your usecase sounds like something that should be quite manageable for Ceph to handle due to limited metadata activity, which tends to be the main painpoint with CephFS.
I'm not fully up to date since we looked into this a few years ago, at the time the CERN deployments of Ceph were cited as particularly large examples and they topped out at ~30PB.
Also note that when I say "single deployment" I mean that the full storage capacity is not subdivided in any way (i.e. there are no "zones" or "realms" or similar concepts). We wanted this to be the case after experiencing situations where we had significant overhead due to having to rebalance different storage buckets (albeit with a different piece of software, not Ceph).
If there are EB-scale Ceph deployments I'd love to hear more about them.
There are much larger Ceph clusters, but they are enterprise owned and not really publicly talked about. Sadly I can’t share what I personally worked on.
The question is whether there are single Ceph deployments are that large. I believe Hetzner uses Ceph for its cloud offering, and that's probably very large, but I'd imagine that no single tenant is storing hundreds of PBs in it. So it's very easy to shard across many Ceph instances. In our use-case we have a single tenant which stores 100s of PBs (and soon EBs).
Digital Ocean is also using Ceph[1]. I think these cloud providers could easily have 100s of PBs Clusters at their size, but it's not public information.
Even smaller company's (< 500 employees) in today's big data collection age often have more than 1 PB of total data in their enterprise pool. Hosters like Digital Ocean hosts thousands of these companies.
I do think that Ceph will hit performance issues at that size and going into the EB range will likely require code changes.
My best guess would be that Hetzner, Digital Ocean and similar, maintain their own internal fork of Ceph and have customizations that tightly addresses their particular needs.
[1]: https://www.digitalocean.com/blog/why-we-chose-ceph-to-build...
Ceph is more of: here's a raw block of data, do whatever the hell you want with it, not really good for immutable data.
Well sure you would have to enforce immutability at the client side.
It's more that it has all the systems to allow mutability which add a lot of overhead when used as an immutable system.
Last point is an extremely important advantage that is often overlooked and denigrated. But having a complex system that you know inside-out because you made it from scratch pays in gold in the long term.
Any compression at the filesystem level?
No, we have our custom compressor as well but it's outside the filesystem.
CephFS implements a (fully?) POSIX filesystem while it seems that TernFS makes tradeoffs by losing permissions and mutability for further scale.
Their docs mention they have a custom kernel module, which I suppose is (today) shipped out of tree. Ceph is in-tree and also has a FUSE implementation.
The docs mention that TernFS also has its own S3 gateway, while RADOSGW is fully separate from CephFS.
The seamless realtime intercontinental replication is a key feature for us, maybe the most important single feature, and AFAIK you can’t do that with Ceph (even if Ceph could scale to our original 10 exabyte target in one instance).
Wow, great project.
GPLv2-or-later, in case you were wondering https://github.com/XTXMarkets/ternfs/blob/7a4e466ac655117d24...
Licensing
TernFS is Free Software. The default license for TernFS is GPL-2.0-or-later.
The protocol definitions (go/msgs/), protocol generator (go/bincodegen/) and client library (go/client/, go/core/) are licensed under Apache-2.0 with the LLVM-exception. This license combination is both permissive (similar to MIT or BSD licenses) as well as compatible with all GPL licenses. We have done this to allow people to build their own proprietary client libraries while ensuring we can also freely incorporate them into the GPL v2 licensed Linux kernel.
Thanks for sharing.
seems like a colossusly nice design.
Sadly lacking a nice big table to lay out the metadata on.
could be a tectonic shift in the open source filesystem landscape?
I see what you did there.
Ha ha, I forecast, SPY goes up, and I’ve already made more money than XTX or any of its clients…
Look I like technology as much as anyone. Improbable spent $500 million on product development, and its most popular product is its grpc-web client. It didn't release any of its exotic technology. You could also go and spend that money on making $500m of games without any exotic technology, and also make it open source.
This sounds like it would be a good underpinning for a decentralized blockchain file storage system with its focus on immutability and redundancy.
But a blockchain is already immutable. It becomes decentralised and redundant if you have multiple nodes sharing blocks.
No need for an underpinning, it is the underpinning.
And yet no one needed a blockchain to implement this.