Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Noms – A versioned, forkable, syncable database (github.com/attic-labs)
399 points by jaytaylor on Oct 16, 2016 | hide | past | favorite | 97 comments


Open-source tech like this is nice. This could be used to build a distributed document editing application, for example. Or any application where you want to spin off multiple instances and reconcile the data later.

EDIT: At least one team is investigating layering Noms on top of IPFS [1]. I guess the idea would be to construct something similar to GitTorrent [2]; layering various version-controlled datastores on various p2p protocols could result in several viable architectures.

[1] https://github.com/attic-labs/noms/issues/2123#issuecomment-...

[2] https://github.com/cjb/GitTorrent



Wow, I missed that. That violates the URI spec, assuming the author(s) were intending to use a URI.

https://tools.ietf.org/html/rfc3986 (see "3.3. Path").


Right, it intentionally violates the URI spec by appending something to the end of it. The data structure they're storing has a natural pair-of-structures at its top level:

    data Database = InMemory | LevelDB Path | ViaHTTP URL

    newtype DataSet = DS Text
    newtype Hash = Hash Text 
    data Accessor = AccessDS DataSet | AccessValue (Either DataSet Hash) Path
    
    type DBAccessor = (Database, Accessor)
They elected to basically encode a DBAccessor above as a string which you can split on "::", with the URL above being stored on the left in the case of the ViaHTTP databases.


The string wasn't originally intended to be a URI, but I've been subsequently convinced that it would be useful for it to be one. We'll change it eventually.



Adam Leventhal (DTrace, OpenZFS) took a look at building a FUSE filesystem on Noms using Go.

http://dtrace.org/blogs/ahl/2016/08/09/nomsfs/

https://news.ycombinator.com/item?id=12255450


Sounds like what you'd want for a self-hosted Dropbox clone. I wonder what Syncthing, for instance, uses for reconciling differences on different clients.


Syncthing doesn't reconcile differences. Instead a copy of the file is created.

  Syncthing does recognize conflicts. When a file has been modified on two devices simultaneously, one of the files will be renamed to <filename>.sync- conflict-<date>-<time>.<ext>. The device which has the larger value of the first 63 bits for his device ID will have his file marked as the conflicting file. Note that we only create sync-conflict files when the actual content differs. 
https://docs.syncthing.net/users/faq.html


Hi Hacker News. I'm one of the founders of the Noms project and Attic Labs, the company behind it. Happy to answer any questions.

In the meantime, as long as I've got your attention, here's a few new stuffs we've been working on since last time Noms was discussed here in August:

- A prototype query language, and a demo of how to create indexes in Noms: https://www.youtube.com/watch?v=fv6_T5yaWns

- Support for merging concurrent (and potentially conflicting) changes: https://www.youtube.com/watch?v=--7dgoJBdjU


Does/will Noms have proper support for multiuser sync? That’s an issue with CouchDB.

I mean: instead of syncing whole the database, only syncing the parts that a user has access to, and being able to define those accesses. The standard use case for consumer apps.


Noms syncs at the level of a "value" not the entire database. Since a Noms database is a tree of values, you can sync the entire db, but you don't have to. You can also sync a single list.

So yes, in principle, we can definitely do this. In practice we are missing some conveniences that would make it a really easy drop-in feature.


There's quite a dramatic claim on the website, "merge [...] changes efficiently and correctly days, weeks, or years later." How does that work? For example if you have two records saying userid 3's name is "ann" and userid 3's name is "jane", I don't see how you could merge those without extra information or human input.


The claim on the website is not meant to suggest that any two changes can be automatically merged. I will try to clarify that.

The world contains logical conflicts because physical constraints mean that processes can operate disconnected from each other. No database can wave that away.

Noms will automatically, efficiently, and correctly merge changes that don't logically conflict. Which is a pretty cool and unique property in a database.

If any conflicts are found, there is a callback to user software to perform a resolution.

More info in the documentation:

https://godoc.org/github.com/attic-labs/noms/go/merge


IBM Domino (aka Lotus Notes) has been automatically, efficiently, and correctly merging changes that don't logically conflict since 1989. How is the functionality in Noms unique?


It's a very hard problem to solve. The best you can really manage is to automatically merge non-conflicting changes, and defining what is a conflicting change will require some extra knowledge if you have multiple tables with relations between them.

For extra fun try doing this with geographic data and try merging geometry changes correctly.


Well, hopefully you're not just using an incrementing user id field. Twitter's Snowflake format is much closer to "definitely unique". So then you have a very simple set, which you could very simply merge with other sets. So that might help you get started with merging those pieces of data.

But I agree, there are certain kinds of data that you can't so efficiently merge. Your best bet is to try to adapt it into some sort of known, proven CRDT. (Dunno how you do that with this database, haven't really read up on it.)


Any pointers to design docs or papers that inspired Noms?


What about ACID? How far do you plan to divert from it?


More specifically, what isolation guarantee do you make? Are writes linearizable? What consistency guarantees do you claim? How well do you think you will do when reviewed by aphyr (https://aphyr.com)?


You should understand that Noms is doing something quite a lot simpler than the systems that are usually reviewed on Aphyr.

Noms doesn't manage its own storage - it relies on an underlying key/value store that must provide strongly consistent reads for at least one key. In other words, we delegate most of the hard part to somebody else.

With that all said...

Currently our intent is that:

- Transactions that read and write from a single dataset have strong serializability

- Transactions that read from multiple datasets and write to a single dataset have snapshot isolation

- Transactions that write to multiple datasets aren't possible

In the future, we will probably allow additional configuration, such that, e.g., one could choose snapshot isolation within a dataset for additional concurrency, or strong serializability for transactions that span datasets.


curious on how the irregular black shape title bar is created?


This is the most important question in my mind.

If it isn't ACID it needs to make a very strong case for itself to even be played with by most DBA's, including myself.


Depending on details, content addressing could make it very hard to not be ACID. Here's to hoping.


Yes, it kind of falls out of content-addressing. See: https://news.ycombinator.com/item?id=12722023


How do you deal with access control? I can imagine that it would be necessary in many applications to have parts of the database only accessible to some people, and that access rules could be complicated and also depend on values in the database itself.


We haven't implemented access control yet. See https://github.com/attic-labs/noms/issues/1183 for one idea.


Looks really interesting! This may sound like a stupid question, but do you have any publication I can cite by any chance?


What was the process like for raising funding for an OSS project? Especially one that was pre-launch.


Do you have any plans to provide C/C++ bindings?


We want to. It would be nice if we could just use cgo but it's not complete. So we need to build some kind of simplified API to Noms that can be exported via cgo.

I created a bug for this just now: https://github.com/attic-labs/noms/issues/2718

Please feel free to get involved there.


Wow, congrats - this looks really interesting!


I've been wanting to use something like this.

But...

* It's a big jump from relational or noSQL DB's, so there aren't (m)any adapters that I can see for it for JPA, ActiveRecord, etc.

* I'd really like to see a benchmark for each noms implementation compared to postgres, mysql, oracle, and mssql server, if there is a way to do apples-to-apples.

* "noms" is unfortunately is really bad for SEO because noms is a common word in French. If it could be nomsdb or nomnomnoms or something less exactly French, that'd be better. It's going to be tough to find support online easily otherwise.

* SQL compatibility.

* Fault tolerance (how easily does it corrupt), HA, mirroring, full/partial replication, sharding, archival, partial history truncation, etc.

It seems a little like a dolphin jumping into a pool of hungry sharks. It might be more evolved and more capable in some ways, but it's going to get its ass handed to it on speed and lack of features.

Still- I can't wait to try it.


>It seems a little like a dolphin jumping into a pool of hungry sharks. It might be more evolved and more capable in some ways, but it's going to get its ass handed to it on speed and lack of features.

I'm inclined to agree for large, centralized databases, but I wonder if this would be a good fit for places where sqlite is used? This seems like it could be a good foundation for situations where you want to sync information without a central server, like between devices. An Access/Filemaker clone built on top of this would be cool, too.


With content addressing, you may loose data but never get corrupted data.


I think there is a very similar library in Haskell called project m36. Here's its github page on transactions: https://github.com/agentm/project-m36/blob/master/docs/trans...


Previous discussion from back in August: https://news.ycombinator.com/item?id=12211754


Noms is a great example of the power of decentralized database technology, the interesting research that goes into such systems, and wonderful documentation to browse.

I do want to note some tradeoffs with Content-Addressed and Append-Only systems, as my work on a similar project ( an Open Source Firebase, https://github.com/amark/gun ) made me move away from those ideas (even though they are great ideas).

- Content-Addressed stores are going to revolutionize data integrity and efficiency. But they do have a trade off, it makes it a lot harder to read the data if you do not already know the data you are trying to read! The bottom of the repo metions for instance that a query system has not yet been built. From my experience the reason why is because it is difficult to build query systems on Content-Addressed stores, which is a tradeoff from all the gains you can get from it.

- Append-Only gives you rich features like offline-first support and (if implemented) lovely things like rewind/fastforward data time travel. All very cool. However, do not forget that this then also makes it difficult for you to retrieve the latest whole snapshot of your data. So you are not going to get the read performance that you could.

But the only possible way for us as a community, and people playing around with databases, can figure out what the best system is - is for people to build and experiment. Which is part of the reason why Nom is so cool. It is an invitation to others to actually join, play, and experiment with database technology in an open and encouraging environment. That is incredibly valuable and needed!


1. We have prototyped basic query functioanlity already (https://www.youtube.com/watch?v=fv6_T5yaWns) and Noms was designed from the beginning to support efficient indexes and range scans. So I'm not sure why it would be harder for us to support a query language than any other database.

2. It's true that content addressing can exacerbate data locality which can hurt read performance. However, there are thing you can do to get a lot of that back.


Looks farther along than http://dat-data.com/, another commendable distributed VCS for data. One distinction is that dat provides additional utilities for querying and compositing the data structures represented in any csv, json, and yaml files that stores.


One of the other design goals of Dat is to support continually-divergent forking, which they perceive as being useful for communities of analysts processing common datasets but to ultimately different ends. Of course, you never have to merge forks in git, but in it's current form they (dat devs) say that it's not really ideal.


I really like this, I've always thought that git needed to support diff modes different from textline-based because even if this is fit for most programming languages what you really what is to see differences between ASTs (take into account those absurd change counts when just changing the indentation or imagine a normal diff of LISP source). Maybe there's some way of replacing git with noms to get there(even if it may be killing flies with cannonballs)


For what it is worth, in my experiments most ASTs (the rare exception being something like Roslyn's C#/VB ASTs) don't do well in "degenerate states" such as a partially finished files. (A good source control system should let you commit unfinished work.) I did have great success using a syntax highlighting tokenizers. I was able to create really nice-looking character-based diffs that were relatively semantic, quite quickly. I've not tried to use that as the basis diffs for something like git, though I've suggested trying it before.

Python code, if interested: https://github.com/WorldMaker/tokdiff


If you take a language parser and pipe the resulting AST into Noms, you basically get something like codeq (http://blog.datomic.com/2012/10/codeq.html).


Git lets you define custom "merge drivers."


Very interesting, I think we need a git for data. What is the performance of diffs and merges? What data size does it become too slow?


I definitely agree here. As a data scientist, sometimes it seems like we are in the wild west as far as reproducibility and versioning of our analyses.

This seems like an interesting project that tackles some of the data versioning stuff. However, I believe that, at least in data science, we need data versioning closely tied to the analyses themselves for complete reproducibility.

That is, we need the versioning tied to the inputs/outputs of data pipeline stages, such that we can reproduce pipeline runs at any time and incrementally improve and run pipelines based on diffs in data.

As mentioned elsewhere in the comments, Pachyderm (http://pachyderm.io/) does exactly this. Working both as git for data, but also enabling data pipelining and analyses with the data versioning.


Noms performs diff and merge in time proportional to size of the diff. The size of the source data is not really relevant.

One way to think of Noms is that is an index optimized for computing diffs.

Here's a screencast that shows off Noms diffing things fast: https://www.youtube.com/watch?v=Zeg9CY3BMes


So.. CRDTs?



Those reasons are a bit outdated, you can certainly have add-and-remove crdts with or-sets, and you can do so with almost no garbage. One of the Tomtom engineers explained one approach at Strangeloop last month: https://www.youtube.com/watch?v=veeWamWy8dk


If by "Git for data" you mean accumulate-only (or append-only), immutable data stores... There are already many existing solutions. It's always good to see alternatives, though!


Can you point me at some? Because I've tried a few immutable data stores and been disappointed every time. Given about 10 GB of JSON structures, I keep finding things that can't outperform the boring combo of:

* Convert the versioned data to tab-separated values

* COPY it into Postgres every time

* Hope Postgres can act immutable enough even though it wasn't designed to be

The closest I've come to improving this situation was Kyoto Cabinet (unusable license) and rolling my own damn hashtable (it worked okay but adding new kinds of indexes was just unmaintainable, there's a reason databases should be made by experts).


(Disclaimer - I work at pachyderm)

http://pachyderm.io

Pachyderm is git for data. We work hard to make sure we can store data of different types (binary, text, json) efficiently. We also work hard to give you good mechanisms to read the data in a distributed way. I'd be curious how this suits your purposes.


Just started looking at Pachyderm.

While I can see how a git-based filesystem can help with some use cases, does it do any kind of indexing at all? I see that the FAQ recommends exporting the data from Pachyderm into PostgreSQL, which leaves me where I am now.


I was curious about Camlistore and Tahoe-LAFS at one point. I didn't investigate them very thoroughly, however. How do they compare?


Which ones are you thinking of?

The nice thing about git is that it doesn't really require much of the server. I run my own "git hosting" with linode, apt-get, and ssh.

Most of the solutions for immutable big data don't have that level of convenience, as far as I know. Anything which requires a lot of sysadmin and ops work will eventually become a commercial cloud service... git on the other hand is useful by itself, and good companies can obviously be built on top of it as well.


Most binary formats don't diff well.


No not really. CVS and git are basically the same thing but git is a lot better in many ways. I don't know if noms is really the git for data but those other tools are more like CVS in this analogy: clunky and slow.


They are so different it's hard to know what you mean by "basically the same thing"


Yes there are significant differences in terms of feature, implementation, etc but they have the same goal: manage source code versions.


That's how Git is used but I think that by design, it is foremost a userlevel append-only content-addressable file system.

The difference between this and a version control system is a major part of why the UI is so awful.


I hope there is a prune option to delete very old commits.


There isn't yet, but there could be (ala shallow clone in Git).


Nice. I'm super excited about this!

I've hand-rolled something a lot like this already for the Shaxpir backend, but it would be really nice to have a well-engineered database that already supports this kind of model, out of the box.


I've been working on some syncing addressbook, calendar, password manager, and notes applications. My idea was to use mdns to announce presence and git to sync, but this might be (more) useful


Noms should definitely be more useful in that scenario. We have some customers who were using Git the way you describe and replaced it with Noms and have been very happy with the results.


Have you had a look to the finance world? Git for data seems to be something we in finance really need, especially the possibility of seeing all the changes and of reconciliating things.


diff for financial data, with attendant workflow for breaks, is an already existing whole market segment. Duco, the startup I work for, is tackling it as a service.


How does this compare to gun.js.org?


Their logo is a squirrel giving an invisible blowjob


It's an otter, floating on its back.


You say that like it's a bad thing.


How is it that you have 2 reference implementations, written in 2 different cross platform environments, yet there is no support for Windows?

Why would I use this if I can't use it everywhere?


... well, from the link... "Noms is supported on Mac OS X and Linux. You can compile a Windows build from source, and it usually works, but isn't officially supported."

Also, supporting Windows is often a pain in the ass compared to Linux/Mac. I don't fault them for not supporting it officially, especially in the project's life.


To be fair, Mac support is also a huge pain, unless you happen to have a Mac.

Just getting a machine to test on is expensive. If you look for Mac OS VMs in the cloud, you find they start at $1 an hour (https://www.macincloud.com/) or around $80 / month (http://xcloud.me/pricing-signup/). Compare that to around $5 / month for a Linux VM.

And then you have to go the whole dance of getting xcode and homebrew just to have a development environment. It's been a few years since I used Mac OS X to develop stuff, but it wasn't intuitive at all back then.


Mac hosting doesn't tend to be VMs largely because of Apple's licensing rules for macOS (which is a big hassle, no question) so they'll naturally be more expensive than a $5 virtualized Linux VM.

But Macincloud has ~$20/month plans (with 8GB of RAM which isn't a $5 option for most Linux or Windows VMs) and there are several others going for between $30-50, so you hardly have to go with $80/month.


Yeah, Mac devs forget that mac->linux immensely easier than linux->Mac. At least Windows VMs are cheap.


Not everyone uses Windows. I will use it because it supports macOS and Linux, the two platforms that I use every day. Not everyone has the same needs.


Most devs dont use windows these days.


Good devs use all three platforms. Or at least two.


Use, because they have to. This being said they usually develop more in one than the others. I have not met until now anyone who was equally proficient in developping software across all platforms/environments.


> Use, because they have to.

...in your very uninformed opinion...


Incorrect.


Last time I checked it's close to 50/50, with Windows progressively losing share among devs:

http://www.geekwire.com/2016/mac-overtakes-linux-as-develope...

So my comment should be correct in a year or two, hopefully.


> "So my comment should be correct in a year or two, hopefully."

Look again at that graph:

http://stackoverflow.com/research/developer-survey-2016#tech...

OS X is treated as a single category, but Windows is split over multiple versions. When you add up all the Windows versions, OS X isn't close to 50% market share.

Also, I'd question why it should be 'hopefully' correct. Other than better support for Unix command line tools, what gives OS X the edge over Windows 10 as a dev environment?


OS X is treated as a single category, but Windows is split over multiple versions. When you add up all the Windows versions, OS X isn't close to 50% market share.

What was claimed is that most developers don't use Windows, not that they use macOS. If over 50% of developer use either macOS or Linux, then the claim is true.


26.2% + 21.7% = 47.9%. So based on that survey the statement that most devs don't use Windows anymore is incorrect.

Also, that doesn't answer my other question. Other than better support for Unix command line tools, what makes OS X (or Linux) a better platform for devs than Windows 10?


> Other than better support for Unix command line tools, what makes OS X (or Linux) a better platform for devs than Windows 10?

Some that come to mind: Granular packaging systems with everything developer under the sun in them (including binary and source packages). Better support and easier install of a vast array of developer tools and languages (just one example: git). Much more automatable (eg not every install on windows can be automated. Many require GUI interaction). Containerizable. More powerful filesystems like layered filesystems or content addresses filesystems like ZFS. Cloud orchestration tools work better (puppet, chef, ansible). Tiling window managers to streamline screen work. Much wider choice of code editing environments and code manipulation tools (Windows is much more centric around the offerings of Microsoft). Better interoperation with other tools and filesystems (Linux plays much nicer with windows than windows plays with linux). Less bugs in the APIs and development systems themselves (a result of open source enabling bugfixing independent of a vendor). Better system debug and development tools (eg. strace/dtrace/ktrace). Almost every dev tool included in the distributions (no need to go download some dodgy .exe of tucows or where ever). More example open source code to reference and work with makes coding similar ideas less error prone.


> "Granular packaging systems with everything developer under the sun in them (including binary and source packages)."

Yes, that's true.

> "Better support and easier install of a vast array of developer tools and languages (just one example: git)."

This falls under Unix command line tools for me, but okay.

> "Much more automatable (eg not every install on windows can be automated. Many require GUI interaction)."

This is really just the same point as the package management one you already mentioned.

> "Containerizable."

Windows Server now has native support for Docker.

> "More powerful filesystems like layered filesystems or content addresses filesystems like ZFS."

Linux's support for ZFS isn't exactly a strong point. Perhaps you had OS X in mind? In any case NTFS is a fairly decent file system, I don't really see it as a weak point for Windows.

> "Cloud orchestration tools work better (puppet, chef, ansible)."

Automating Windows configuration is easily done through PowerShell. I know that Chef and Ansible both use PowerShell to get their Windows support. I'd suggest taking a look at Desired State Configuration if you're unfamiliar with how these tools utilise the existing infrastructure on Windows.

https://msdn.microsoft.com/en-us/powershell/dsc/overview

> "Tiling window managers to streamline screen work."

In my experience, tiling window managers are nice if you've got a keyboard-heavy workflow, but not that much more efficient when you switching around GUI apps. Windows has some basic tiling windows shortcuts built in, plus it now has virtual desktops built in and shortcuts to switch between them, so I don't feel like I'm missing out on much.

> "Much wider choice of code editing environments and code manipulation tools (Windows is much more centric around the offerings of Microsoft)."

Which code editing environments are you thinking of that you like that aren't also available on Windows? As for the MS tools, if you can show me a better IDE than VS on any platform then I'll be impressed.

> "Better interoperation with other tools and filesystems (Linux plays much nicer with windows than windows plays with linux)."

The upcoming Linux subsystem for Windows 10 should go a long way in addressing that.

> "Less bugs in the APIs and development systems themselves (a result of open source enabling bugfixing independent of a vendor)."

Hmm, I don't think you can back that up. Let's put it like this, I've tried Linux multiple times, but I always come back to Windows, and generally speaking that's because of bugs I've found in Linux or Linux software. I have far fewer problems with Windows. Can you share some of the problems you've had with Windows?

> "Better system debug and development tools (eg. strace/dtrace/ktrace)."

Sure, I'll admit these tools are probably better than the Windows equivalents.

> "Almost every dev tool included in the distributions (no need to go download some dodgy .exe of tucows or where ever)."

You don't need to get dodgy dev tools, there are plenty of useful dev tools from Microsoft and other well known software companies.

> "More example open source code to reference and work with makes coding similar ideas less error prone."

Are you familiar with MSDN? If you knew how easy Microsoft makes it to become a proficient Windows coder, you wouldn't be saying that.

To be fair to you, there's one advantage of Unix OSes I think you missed, and that better networking tools (such as firewall software).


So based on that survey the statement that most devs don't use Windows anymore is incorrect.

Yes, but the post you replied to said "in a year or two".

Also, that doesn't answer my other question.

That's because I don't have an answer, it wasn't me who said "hopefully" :) As long as Linux is considered a first-class platform, I don't really care who's on top.


> 26.2% + 21.7% = 47.9%. So based on that survey the statement that most devs don't use Windows anymore is incorrect.

I said close to 50/50 so it's not false either, if you take in account the fact that there is probably some margin of error anyway.

> Other than better support for Unix command line tools, what makes OS X (or Linux) a better platform for devs than Windows 10

Maybe it depends what kind of developer you are/who you talk to, but most devs I interact with tend to live in the command line (and need proper package management as well) - things that Win10 does not do too well yet.


Pffft. "probably some margin of error anyway" is a gross understatement. It was 47.9% of the 40k developers who responded. Not even a half percent of the entire software development community :)

Enterprise devs make up the largest segment of pro developers and Windows rules the enterprise.

Win 10 also has the Ubuntu command-line now. I've been using it since beta and it's glorious. Macs can't compete with this - they don't even ship with new GNU utils and you'll have to fight with Apple if you want them because updates will break your setup.

Meanwhile it takes 5 minutes to get a modern Unix command-line in Windows.

I'm glad to say that I'm quite certain that your hopes of a non-Windows world will never be realized.


Nah, it's not going to happen. Especially because that "50/50" number isn't even close to being a true measurement.

The survey covered a whole ~40,000 people out of the ~11 million pro developers in the world. That's nothing.

The numbers also don't jive at all with the empirical evidence. Walk into any small, medium or large business IT shop that employs programmers and you'll find Windows more than any other OS. If they're running Linux it's in a VM on Windows.

Macs are still extremely rare for anyone outside of Mobile developers.

What do you think Enterprise devs use? Not Macs...


You shouldn't assume the SO developer survey is a good sample of the whole profession.


Ok, but do you have other sources then?


Nope :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: