Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The exact way in which git handles commits is very muddied - it's snapshots on the surface, a bit of diffs when packed and a lot of operations on commits are actually 3-way merges (including merges, rebases, cherrypicks and reverts). Keeping track of all these matter (esp the operations that use diffs), but it can also get overwhelming for a tool.

In my opinion, it's probably good enough to understand the model git is trying to emulate. Commits are stored more or less like snapshot copies of the working tree directory with commit information attached. The fact that there is de-duplication and packing behind the scenes is more a matter of trying to increase storage efficiency than of any practical difference from the directory snapshot model. Meanwhile, the more complex git operations (merges, rebases, reverts, etc) use actual diff algorithms and 3-way merges (way more often than you'd imagine) to propagate changes between these snapshots. This is especially apparent in the case of rebases, where the snapshot model falls completely on its side (modifying a commit will cause the same change in all subsequent commits).

This actually makes sense if you consider the development workflow of linux kernel before git. Versions were directories or on CVS and a lot of development was based on quilt, diffutils and patchutils. Git covers all these use cases, though it may not be immediately apparent.

Added later: It's also interesting to look at Mercurial's model. Like Git, Mercurial uses both snapshot and diffs for storage. But unlike the Git way of layering these two, Mercurial interleaves them - as diffs with full snapshots occasionally. This is more like the video codec concept of keyframes (I think that's what inspired it). This means that Mercurial, unlike Git, doesn't need repacking. And while Git exposes its internal model in its full glory, Mercurial manages to more or less abstract it away.



Well-said, although I disagree that it's "muddled".

The data model is that commits are snapshots, and diffs between snapshots are computed as needed. The whole system is designed around this.

Packing is an implementation detail.

The fact that internally it can store snapshots as diffs is more or less unrelated to the user-facing diffs. IMO it's confusing to even mention it in an educational context, except in response to the question of "how does Git prevent repo size from exploding?".


> Packing is an implementation detail.

It's so much of an implementation detail that even if the pack has a diff/delta between the two objects to diff, that WON'T be used to produce the output from git diff.


It's so much of an implementation detail, that git didn't have packing at first! All it had was loose objects ("disk space is cheap"). Packing was later added as an optimization, but the object model is still the same. It doesn't matter whether an object is in a pack file or not, it's treated the same.


> although I disagree that it's "muddled"

I understand. I meant muddied (not muddled) in the sense that it can be confusing for beginners. For some reason, many long-time git users also don't seem to progress beyond the initial image they have. (That includes me too - I struggled with rebases for a long time). That description wasn't a criticism of the git model. Git model is clear if you take some time to study it.

> Packing is an implementation detail.

> The fact that internally it can store snapshots as diffs is more or less unrelated to the user-facing diffs.

My point exactly! To summarize, a git user needs to remember only two things:

1. Git commits are modeled as snapshots of work tree.

2. Many operations are (user-facing) diff-based.

Every other detail is a finer implementation detail that's good to know but not essential to get started.


The thing that's muddied for beginners are bad YouTube tutorials (which the internet is full of), not Git or the actual documentation. People should really read the Git documentation, it's very well-written and explains the correct mental model.

Also, people really shouldn't teach implementation details to beginners. Or intermediates. Perhaps anyone who casually mentions that Git stores diffs to anyone not currently opening the source code for Git itself should be disqualified from ever giving explanations for technical stuff ever again.


I agree with your point about the official Git documentation. It is the only one I learned from and it's easy and comprehensive partly due to the involvement of actual git developers. But there is one area where I wish they stressed a bit more. Git documentation talks about the snapshot model so many times - you're never left in doubt how it's stored (including packing). But they don't stress particularly upon the fact that rebases, merges, cherrypicks and reverts are based on diffs (3-way merges). For example, I was expecting the 'drop' operation in interactive rebases to just delete that commit and leave all the subsequent commits intact (except for the DAG linkage). But to my surprise, the change introduced by that commit disappeared from all subsequent commits - leading me to suspect that they were using diffs in this stage. I eventually found a single confirmation of this in the official documents. But it's obscure. In fact, I tried and failed to find it for reference in this reply.


That's a great point, and I think we all agree that the documentation does a poor job of distinguishing between when the "snapshots" and the "diff" models are in use. But what is never exposed in the docs or user interface is the internal implementation details of how snapshots are stored. And that's what I was arguing is not one of Git's many documentation and UX problems.


> But to my surprise, the change introduced by that commit disappeared from all subsequent commits - leading me to suspect that they were using diffs in this stage.

Good way to think about rebase is that it's nothing more than automated reset and cherry-pick. You can rebase by hand without using `git rebase`, it's a convenience tool just like `git bisect`. `drop` does nothing, and removing the line does the same thing - it's just not cherry-picking that particular commit, skipping right to the next line. You can even add new lines to an interactive rebase and cherry-pick completely unrelated commits this way.

Once you know how cherry-pick works, rebase (and revert) becomes clear too.


It’s confusing right up until it isn’t. If you rename a file, and edit it then it is helpful to understand that whether that is shown as a creation and a deletion or as a rename plus an edit is immaterial to some parts of git, and yet very important to others, and can change if you squash or rebase.

Git’s abstractions are pretty leaky.


I agree -- so many of the advertised advantages of git depend on operations on diffs, especially the confusing ones that people find difficult to learn, which makes it very confusing for beginners and casual users when they hear "commits are snapshots" said in a tone that seems to imply that thinking of them as diffs is an abhorrent error. Yes, understanding that they are conceptually snapshots is useful, but git wouldn't be git if it didn't do a ton of work on diffs day to day.


  > This is especially apparent in the case of rebases, where the snapshot model falls completely on its side (modifying a commit will cause the same change in all subsequent commits).
I disagree. During a rebase is precisely the time the diff model is problematic. A modified commit does not cause changes in subsequent commits.

Modifying commit A is modifying that commits snapshot into A~.

Now the subsequent commits will be cherry-picked on top of A~.

If there are subsequent commits with changes that depends on A, you have a merge conflict.

A~ does not cause changes in B~, C~. The changes of B are applied on top of A~ becoming B~, the changes of C are applied on B~ etc.

Thinking of commits as diffs during rebase is a recipe for confusion


What you've described is a bunch of operations that apply diffs.


You missed the point I was making about snapshots and diffs. In git, the identity of a commit isn't a diff/change. It's a snapshot. Many operations like commit, push, fetch, etc require you to think so too. Based on that definition, the commits are essentially changed if snapshots change - even if the change introduced by them remains the same.

It's clear by your own definition that B~ and C~ are not the same snapshots/commits as B and C. They have absorbed the changes from A to A~ (or the delta from A to A~ is now reflected in snapshots B~ and C~). The fact that diffs on B and C remained the same in B~ and C~ is irrelevant to the commits' identity.

> Now the subsequent commits will be cherry-picked on top of A~

Here is the important point. Cherry picking is implemented as a 3-way merge. It involves actual diffing algorithm.

> Thinking of commits as diffs during rebase is a recipe for confusion

Here again, there are two issues. I didn't say that commit have to be thought of as diffs. I said many operations (incl rebase and cherrypicking) use diffs to propagate changes between snapshots. This reasoning is necessary to understand why snapshots B~ and C~ are different from B and C.

The second part is that thinking of rebases in terms of diffs is far from a recipe for confusion (3-way merges actually, but diff is an easier approximation). It actually help me understand the operations and allowed me to predict the results of different operations in advance. That single realization actually made Git far more approachable for me and gave me the confidence that I can solve most Git issues without having to delete the copy and cloning it again.


I think it's great if thinking of commits in the context of commits in a rebase as diffs works for you. I only caution against it because there are many situations during a rebase where the results can be very confusing with such a perspective. Precisely because a 3-way merge can make things much more complicated.

I think you're muddling the concepts of tree (a snapshot) and commit somewhat. A commit is not merely a snapshot, it's a tree as well as metadata.

> the commits are essentially changed if snapshots change - even if the change introduced by them remains the same.

If by commit you mean tree, then yes. One can think of B and B~ "introducing" the same changes if the diff between A and B is the same as A~ and B~.

For example, say you add a new file in A~ and then cherry-pick B on it, the tree of B~ will not be the same as B, but the diffs of A and B will be the same as A~ and B~.

The main reason I caution against this perspective is that you can easily end up "introducing" other changes when you reorder commits.

Change A-B-C to A-C~-B~ and very often you'll find yourself "introducing" changes from B in C~

That's not too say that doing git show REBASE_HEAD, to view the diff of B-C isn't a bad idea, just that thinking of commits as diffs during a rebase, imo, is often a false friend


> I think you're muddling the concepts of tree (a snapshot) and commit somewhat. A commit is not merely a snapshot, it's a tree as well as metadata.

My intention was to approximate definitions to the bare essentials without losing too much fidelity. This criticism feels like a nitpick (apologies if that wasn't your intention) because the metadata was implied as it's well understood.

> If by commit you mean tree, then yes. One can think of B and B~ "introducing" the same changes if the diff between A and B is the same as A~ and B~.

That is the diff model. You are cautioning against treating commits as diffs during rebasing, and yet insist on using that definition to oppose my notion. My stand is a bit more consistent here. Treat commits as snapshots. But rebase and similar operations use diffs on those snapshots.

> Change A-B-C to A-C~-B~ and very often you'll find yourself "introducing" changes from B in C~

I find this claim bizzare. The change works exactly as expected when viewed as diff (3way merge) operations. The diffs introduced by C and (B -> B~) end up in commit B~ (tree snapshot + metadata and whatever else necessary - just to be pedantic).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: