Thursday, May 7, 2009

Fun with Git

I met a bunch of smart people at the Montreal Ubuntu 9.04 release party, and I talked to one person a lot about the distributed SCM Git. I had heard of distributed SCM's but had never had any reason to try anything other than SVN. Still, I was intrigued, and so decided to look into it. Here are some useful links:

For an overview of what Git is, see Linus Torvald's tech talk at Google. For an overview of how to use Git, see Randall Schwartz's talk at Google. For an overview of how I have so far successfully used it in my GSoC project, read on.

My goal was to use Git to solve a particular problem, which was as follows:

There is an SVN repository A, and a CVS repository B. The code in B was probably forked at some point from the code in A. B was iterated on in a private sandbox before being put into CVS, so B has no history, just HEAD. A has also changed since the time it was forked. I need to take the code in B and compare it to each of the major release tags in A, so as to determine which tag is the most similar to B, and therefore which tag B was most likely to have been forked from. This basically means diffing two different filesystem trees. Because Git is supposed to be very good at merging different branches based on content, I assumed it would be able to give me some very intelligent diff output.

In this contex, A is the GWT SVN repo, and B is a project in Eclipse's CVS repo. The Eclipse project org.eclipse.swt.e4.jcl appears to be forked from GWT, so I needed to determine when it was forked, what changes were made after, and (hopefully) why.

The person I met at the Ubuntu party, Derek, wrote me very excellent instructions on how to do this, which I have gratuitously copied and pasted here.


Here are the general steps:

- Use git-svn-clone to clone the Subversion repository A.
- Checkout the CVS HEAD of repository B into a separate workspace.
- Create a new branch in the git-svn repository. You can branch
starting from any tag or commit, but the closer that the branch origin
is to the original fork tag, the fewer differences you will likely
encounter. So, try to correctly guess the fork tag and branch from
that guess.
- Overwrite the workspace associated with the new branch with all the
files in the CVS HEAD workspace. Stage and then commit the changes
into the Git repository.
- For every Subversion "tag" branch in Git (Git stores every
Subversion tag as a Git branch, not as a Git tag), display the
differences between the branch that contains CVS HEAD and the Git
Subversion "tag" branch. Note the one that gives you the fewest
differences.

Specific commands (untested):

# Clone the Subversion repository A. Option "--stdlayout" tells Git
that the repository has the standard branches, tags, trunk layout
git svn clone --stdlayout http://domain.com/repositoryA

# Create a branch from a tag
git branch subversiontag cvsbranch

# Switch to branch "cvsbranch"
git checkout cvsbranch

# Overwrite files in Git "cvsbranch" workspace with those from CVS
HEAD workspace. Make sure that you don't copy the CVS metadeta
directory in each workspace directory. Consider using rsync which can
ignore CVS and Subversion metadata directories.

# Stage all CVS changes
git add --all

# Commit CVS changes
git commit --message "Added changes from CVS repository B"

# List all Git Subversion tag branches
git branch | grep svn

# Compare each Git Subversion tag branch with tag "cvsbranch". Option
"--name-only" displays only the names of the files that changed:

for tag in `git branch -r | grep tags`; do git diff --name-only $tag
cvsbranch; done

# Note the tag that produced the fewest differing files.



While this mostly worked great, I ran into one gotch using Git from the Ubuntu 8.04 repo (git version 1.5.4.3), which was that git-svn did not capture all of the tags from svn. When I tried it with git from the 8.10 repo (git version 1.5.6.something), it worked fine. Really, I should be git-cloning the latest git, but I haven't quite figured out how to do that successfully. Nevertheless, beware.

I also tried using svn2git, which is supposed to basically be svn-git with some manipulations to make the imported svn repo more git-like. I spent an hour trying to get this to work, and met with marginal success. It turned out not to be neccessary.

In the end, I ended up using the following code to guage similarity.

The first one just counts the number of files that changed between each tag and the cvs branch.


olpc@OLPC:~/workspace-gsoc/svn/user/super/com/google/gwt/emul/java$ for tag in `git-branch -a | grep tag`; do echo $tag; git-diff --relative --name-only $tag cvsbranch | wc; done;

tags/1.3.1
151 151 3767
tags/1.3.3
151 151 3767
tags/1.3.3@288
151 151 3767
tags/1.4.10
150 150 3746
tags/1.4.59
150 150 3746
tags/1.4.60
150 150 3746
tags/1.4.60@1399
150 150 3746
tags/1.4.61
150 150 3746
tags/1.4.61@1504
150 150 3746
tags/1.4.62
150 150 3746
tags/1.4.62@2104
150 150 3746
tags/1.5.0
78 78 1874
tags/1.5.0@2941
78 78 1874
tags/1.5.1
70 70 1709
tags/1.5.1@3391
70 70 1709
tags/1.5.2
70 70 1709
tags/1.5.2@3587
70 70 1709
tags/1.5.3
73 73 1756
tags/1.5.3@3776
73 73 1756
tags/1.6.0
78 78 1863
tags/1.6.0@4621
78 78 1863
tags/1.6.1
78 78 1863
tags/1.6.1@4846
78 78 1863
tags/1.6.2
78 78 1863
tags/1.6.2@5035
78 78 1863
tags/1.6.3
78 78 1863
tags/1.6.3@5110
78 78 1863
tags/1.6.4
78 78 1863
tags/1.6.4@5112
78 78 1863
tags/1.6.4@5189
78 78 1863


This second analysis actually counts the number of lines in each diff
between each tag and the cvs branch.


olpc@OLPC:~/workspace-gsoc/svn/user/super/com/google/gwt/emul/java$ for tag in `git-branch -a | grep tag`; do echo $tag; git-diff --relative $tag cvsbranch | wc; done;

tags/1.3.1
18257 77427 515421
tags/1.3.3
18257 77427 515421
tags/1.3.3@288
18257 77427 515421
tags/1.4.10
18261 78017 524249
tags/1.4.59
18158 77486 520272
tags/1.4.60
18158 77486 520272
tags/1.4.60@1399
18158 77486 520272
tags/1.4.61
18158 77486 520272
tags/1.4.61@1504
18158 77486 520272
tags/1.4.62
18158 77486 520272
tags/1.4.62@2104
18158 77486 520272
tags/1.5.0
5667 22620 165036
tags/1.5.0@2941
5667 22620 165036
tags/1.5.1
4357 17759 127102
tags/1.5.1@3391
4357 17759 127102
tags/1.5.2
3920 15766 113373
tags/1.5.2@3587
3920 15766 113373
tags/1.5.3
4045 16322 117623
tags/1.5.3@3776
4045 16322 117623
tags/1.6.0
4974 19823 144457
tags/1.6.0@4621
4974 19823 144457
tags/1.6.1
4986 19881 144836
tags/1.6.1@4846
4986 19881 144836
tags/1.6.2
4986 19881 144836
tags/1.6.2@5035
4986 19881 144836
tags/1.6.3
4986 19881 144836
tags/1.6.3@5110
4986 19881 144836
tags/1.6.4
4986 19881 144836
tags/1.6.4@5112
4986 19881 144836
tags/1.6.4@5189
4974 19823 144457


So, it looks like 1.5.2 has significantly fewer changes than any other tag. The diff to 1.5.2 is attached to this email. It basically looked like 48 new files were added, 0 were removed and many were modified.

Once I had established which GWT SVN tag was most similar, it was trivial to diff the branches and get meaningful output:


olpc@OLPC:~/workspace-gsoc/svn/user/super/com/google/gwt/emul/java$ git-diff --relative cvsbranch tags/1.5.2 > 1-5-2.diff


I won't post the resulting diff here, but suffice it to say that it was exactly what I was looking for, and I am now moving onto the task of analyzing the diff to figure out exactly what was changed, and why.

My final impressions of git: I'm very happy with it. The ability to import an entire SVN repo so that it can be used offline; the ease with which one may branch and merge; and the ability to push changes back up into the svn repo, are all things that will be very useful to me. I think I'm actually going to get to use Git a lot on my GSoC project.

No comments:

Post a Comment