This post just generated a tremendous amount of discussion in my company. Not just for the SQLite wrapper, but also about the Berkeley XML DB. Anyone have experience using Berkeley for medium-to-large scale data stores (1-100GB?)? How's the concurrency? What's the licensing model? Is there a way we can use this without having to give away our source?
Project Voldemort uses BerkeleyDB Java Edition as one of the storage engines (the most frequently used one). Java Edition differs from the C version in being log-structured (very fast writes, at the cost of somewhat slower reads). In one deployment, we're able to store ~130 gb per node data (each node having 32 gb of memory, 18gb of it going to JVM heap, 10gb of the JVM heap going to BDB cache), while having average read latency times (which include the overhead of decoding a request packet) of <1 ms (with some outliers due to large value sizes, etc...).
I should add, this is also while using a single bdb environment for all partitions that belong to a node -- note the most efficient way to store this data (bdb je 4.0 does support sharing a cache between multiple environments, so we might move to separate environment per partition model). We've also built the distribution layer ourselves, rather than use the built-in HA features, but if I recall correctly, the first version of Amazon SimpleDB (prior to Erlang re-write) was built with BDB-JE HA (not to be confused with Amazon Dynamo, which is built on regular BDB-JE and provides its own distribution mechanisms).
Overall, we've found that BDB-JE performs best when there's 1:5 ratio between memory and data set size. With JE (as opposed to classic BDB) the issues are: Java implies a certain vertical scalability limit (due to GC pauses with very large heap sizes, however the GC does get better with every release), log-structured B+Tree is great for writes, but might mean more seeks being made for reads (especially if you were to use range query functionality).
Licensing might be an issue if you're shipping a commercial product to customer site. It doesn't matter if it's used on your own servers or in an open source project (and unlike GPL, the BerkeleyDB license is compatible with Apache licensed projects). The licensing also doesn't apply to Perl/Ruby/Python/other dynamic language bindings for BerkeleyDB, you're free to package them (and then request the customer to install BerkeleyDB themselves, much like you would with MySQL).
In short, it's a great solution if you fit its use case. I should also add that interacting with the Sleepycat people online (e.g., on the forums) also makes you forget they're a part of Oracle.
I don't have experience with the Berkeley XML DB. I have used the BerkeleyDB 'Java Edition' (a different codebase and on-disk format) for multi-GB spill-to-disk queues and maps/sets.
The BerkeleyDB license is roughly: free to include in open-source projects, pay if you want to redistribute as part of a closed-source product. So like with the GPL, using strictly inside your company might not be considered 'distribution'. But also, if you are selling software licenses their bundling rates might not be bad compared to your unit price. Of course, do your own license analysis before business use.
What would be the licensing model if the product is open source, but is NOT free (i.e., customers are charged for an open-source product in which BDB is used)?
I'm not sure I get the point of this. Two of SQLites strong points (among many others) are:
(a) Short dependency list
(b) Platform independent filesystem storage
..doesn't this negate both of those for what seems like little gain?
BerkeleyDB doesn't have many dependencies, uses only plain files as storage, and is itself highly-portable open source (with a quasi-copyleft condition).
So you don't give up much to get the claimed benefits of this combination -- only the ability to use public-domain SQLite in proprietary distributed software.
I've used berkeley for hundreds of concurrent queries -- it is quite good in those situations, SQLlite is not -- its just not designed for those situations.
combining the two brings an easy interface that sqlite provide, and the concurrent performance that bdb has, is definitely providing value.
I do. I wrote a tool to help me understand what I could do on various storage tiers running as safely as possible. Here's one of my results from linode:
This is an awesome shootout! Be careful with BDB, its default tuning is geared more towards an embedded environment then a modern web-app. That's not to say it can't go fast!
I took a look at your BDB demo and saw that you weren't creating the DBs in an environment, which meant every operation was straight to disk, and also meant that you weren't running with write-ahead logs (which will give you durability). I didn't look too closely to see if the other databases had caching enabled or not (or what their defaults were).
On my macbook air, configuring with caching (and no logging) yielded this from your benchmark:
Air:kvtest jamie$ ./bdb-test
Running test ``test test'' PASS
Running test ``write test'' Ran 284669 operations in 5s (56933 ops/s)
PASS
Of course, page caching makes all the difference :-)
I have nothing but good things to say about SQLite. Small. No dependencies. Lightning fast. Public domain. I use it exclusively for all things databases where the customer for whatever reason does not demand one of the other players. I'm glad to see a major actor in the DB industry picking up on it.
"As of version 3.6.23 (all statistics in the report are against that release of SQLite), the SQLite library consists of approximately 67.2 KSLOC of C code. (KSLOC means thousands of "Source Lines Of Code" or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 679 times as much test code and test scripts - 45678.3 KSLOC."