Support .gitattributes export-ignore in distribution archives#515
Support .gitattributes export-ignore in distribution archives#515jberdine wants to merge 9 commits into
Conversation
Parse .gitattributes files and exclude paths marked with the export-ignore
attribute from distribution tarballs. This is the mechanism used by
git-archive to exclude files, allowing projects to exclude dev-only files
like dune-workspace from releases.
Supported patterns:
- Exact matches: dune-workspace
- Directory patterns: .github/**
- Glob patterns: *.log, test_*, file?.txt
- Double star: **/build, src/**/test.ml
Paths are normalized before matching (handles ./ and ../).
Not supported:
- Escaped patterns (\\! for literal \!)
- Quoted patterns ("a b" for patterns with spaces)
- Case insensitivity (core.ignorecase)
- Negation patterns (\!pattern)
- Subdirectory .gitattributes files
Testing:
- Unit tests comparing behavior against git check-attr
- Tests cover pattern matching and parsing edge cases
- .gitattributes content is generated from test cases to ensure sync
- Archive integration test verifies end-to-end exclusion without requiring git
Signed-off-by: Josh Berdine <josh@berdine.net>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Leonidas-from-XIV
left a comment
There was a problem hiding this comment.
Very nice PR, the extensive test coverage is definitely a highlight and good to see as the glob to regex code is somewhat hairy.
I have some suggestions for improvements here.
| let path_set_of_dir dir ~exclude_paths = | ||
| let not_excluded p = Ok (not (Fpath.Set.mem (Fpath.base p) exclude_paths)) in | ||
| let path_set_of_dir dir ~exclude_paths ~export_ignore = | ||
| let not_excluded p = |
There was a problem hiding this comment.
To avoid the negation, maybe rename it to included. It's kind of hard to wrap one's head around the double negated name and then the negation in the code.
There was a problem hiding this comment.
Renamed. My thinking was that conceptually what is being done is "excluding" and so kept that base term, but fair point about double-negation.
| (* Skip directories - they will be created implicitly when their | ||
| contents are added. This ensures that directories whose contents | ||
| are excluded via export-ignore patterns don't appear as empty | ||
| directories in the archive. *) |
There was a problem hiding this comment.
But isn't this what one would expect to do? If I exclude a directory, I would expect the directory to show up, but not its contents.
Git does this but its more to do with the fact that Git doesn't track directories at all, so an empty directory is unrepresentable. But dune-release doesn't have this issue.
There was a problem hiding this comment.
My thinking is that the context here is reading a .gitattributes file, and so the expectation is to interpret it the way that git archive does (as much as possible). As a user, if some workflows used git archive and some dune release, and I had to adjust the .gitattributes file because they interpreted it differently, I would assume that dune was wrong since the name .gitattributes indicates that git "owns" the format.
| is preserved. | ||
| (** [tar dir ~exclude_paths ~export_ignore ~root ~mtime] is a (us)tar archive | ||
| that contains the file hierarchy [dir] except: | ||
| - relative hierarchies present in [exclude_paths] (basename matching) |
There was a problem hiding this comment.
I wonder if it wouldn't make more sense to just translate exclude_paths to a Gitattributes.pattern and use the more generic mechanism. That way there's only one way to ignore paths.
There was a problem hiding this comment.
That could make sense. My worry is that exclude_paths excludes based on basename, so at any depth, while Gitattributes.patterns are path-anchored. Translating exclude_paths could have a semantic change if not done exactly right. So I think such a change ought to be in a separate PR, and I would want to have a much better idea for a good way to test it than I have at the moment.
| (any chars except /), [?] (single char except /), and [**] (any path | ||
| segments, but only when adjacent to /). *) | ||
| let glob_to_re pattern = | ||
| let buf = Buffer.create (String.length pattern * 2) in |
There was a problem hiding this comment.
Why is the initial size pattern * 2?
There was a problem hiding this comment.
Dropped the * 2. It is a guess, based on the observation that each char of the pattern might expand into 2 regex chars, but the buffer grows fine so it is simpler to just skip the * 2.
| let glob_to_re pattern = | ||
| let buf = Buffer.create (String.length pattern * 2) in | ||
| Buffer.add_char buf '^'; | ||
| let len = String.length pattern in |
There was a problem hiding this comment.
Nitpick but the length of the pattern is determined twice, so could be pulled up in the function an reused.
| (* Strip UTF-8 BOM if present at start of file *) | ||
| let content = | ||
| if String.is_prefix ~affix:"\xef\xbb\xbf" content then | ||
| String.Sub.to_string (String.sub ~start:3 content) |
There was a problem hiding this comment.
And use String.length utf8_bom here.
|
|
||
| (** {1 Patterns} *) | ||
|
|
||
| type pattern |
There was a problem hiding this comment.
| type pattern | |
| type t |
| let files = | ||
| [ | ||
| ("CHANGES.md", "changes"); | ||
| ("foo.opam", "opam"); |
There was a problem hiding this comment.
| ("foo.opam", "opam"); | |
| ("foo.opam", {|opam-version: "2.0"|}); |
| in | ||
| List.fold_left | ||
| (fun acc file -> acc >>= fun () -> create_file file) | ||
| (Ok ()) files |
There was a problem hiding this comment.
I think List.iter followed by some kind of Result.List.all_unit (or the like) would be a bit easier to understand but it's of minor importance.
There was a problem hiding this comment.
Perhaps. I like the way the current formulation uses >>=/Result.bind to chain the results just like the rest of the code.
| - Double star in path: `**/build`, `src/**/test.ml` | ||
| - Path normalization: handles `./` and `../` in paths | ||
|
|
||
| **Not supported:** |
There was a problem hiding this comment.
I think it would be good to state what "not supported" means in practice. Will it error? Will it ignore the pattern?
There was a problem hiding this comment.
Good point. I have pushed a commit that detects the unsupported pattern syntax, and updated the README to distinguish such cases (that are skipped with a warning) and other limitations (that are silent).
Avoid the double negation in the predicate name used by OS.Path.fold's `Sat filter. Signed-off-by: Josh Berdine <josh@berdine.net>
The Buffer grows on demand; the 2x factor was a guess with no measurement behind it. Signed-off-by: Josh Berdine <josh@berdine.net>
Compute the pattern length once and reuse for both the Buffer size and the loop bound. Signed-off-by: Josh Berdine <josh@berdine.net>
Bind `utf8_bom` so the byte sequence appears once and the skip length is derived from it rather than a magic 3. Signed-off-by: Josh Berdine <josh@berdine.net>
Follow the standard OCaml convention of naming the primary type of a module `t`. Signed-off-by: Josh Berdine <josh@berdine.net>
Signed-off-by: Josh Berdine <josh@berdine.net>
Detect negation (`\!pattern`), escape (`\\`), quoting (`"`), and character class (`[...]`) syntax in parse_pattern and skip the pattern with a Logs.warn rather than compiling to a regex that silently matches no files. Change parse_pattern to return a t option to reflect this. Split the README "not supported" list: syntactic cases are warned-and- skipped uniformly, while case-insensitivity and subdirectory .gitattributes are documented as separate limitations. Signed-off-by: Josh Berdine <josh@berdine.net>
Parse .gitattributes files and exclude paths marked with the export-ignore
attribute from distribution tarballs. This is the mechanism used by
git-archive to exclude files, allowing projects to exclude dev-only files
like dune-workspace from releases.
Supported patterns:
Paths are normalized before matching (handles ./ and ../).
Not supported:
Testing:
Signed-off-by: Josh Berdine josh@berdine.net