Adding beman.infra as a dependency of beman.exemplar using git subtree or submodule

Hi,

I wanted to start a new thread about beman.infra; the existing discussion is
here,
but in this thread I wanted to provide a brief summary of the problem, talk about some
work I’ve done to address it, and focus the discussion specifically on the git workflow
that we want to use.

The background is that the beman.exemplar repository is simultaneously:

  • A template that we want users to base new projects on
  • A collection of tooling and best practices that we are continually updating as we evolve
    our recommendations for Beman libraries

We have an issue with the fact that there’s no good way to propagate updates to our
template into projects that have already been generated from that template.

To address this, we want to try moving some parts of exemplar into a separate repository
that other projects can pull in as a dependency.

So far, we’ve built up a consensus that we don’t want to use git for dependency management
in Beman. However, in the specific case of beman.infra, I think that there is an important
technical reason to make an exception. The reason is that beman.infra is the only
dependency that we need to already be available in order to even launch a CMake configure,
because we want beman.infra to contain the CMake toolchain files. Every other dependency
can be configured by CMake itself, but with beman.infra, there’s a chicken-egg problem
that makes git dependency management the only solution.

I’ve put up two pull requests representing potential approaches for adding beman.infra to
beman.exemplar via git:

  • PR 157 adds beman.infra to beman.exemplar as a git subtree
  • PR 158 adds beman.infra to beman.exemplar as a git submodule

Here is my brief summary of the conceptual differences between git submodule and git
subtree.

Adding a git submodule to a repository is essentially equivalent to storing a tuple of
{URL, commit hash, subdirectory path} inside the repository. Then, various git submodule
commands are available which can reify the tuple by checking out the repository into the
specified path at the specified commit, or can update the tuple’s properties to reflect
changes to the URL, commit hash, or subdirectory path.

The principal disadvantage of submodules that raises objections is the fact that the
submodule’s subdirectory is not automatically kept in sync with the tuple; doing so
requires manually running git submodule commands. If the user forgets to run
git submodule update --init after a git clone, the submodule’s subdirectory will be
empty; similarly, without running git submodule update after a git pull, the
submodule’s subdirectory can be outdated.

On the other hand, git subtrees take advantage of git’s support for merging together
unrelated commit histories in order to incorporate the git commits of the dependency into
the commit graph of the parent. For example, imagine we have a git repo called parent
with the following history:

o Frobnicate widgets
|
o Reticulate splines
|
o Initial commit

And a repo called dependency with the history:

o Reinitialize enigmas
|
o Calibrate flux capacitors
|
o Initial commit

When dependency is added as a subtree, the repository’s history will look like:

o Add 'dependency/' from commit '12345abcdef'
|\
| \
|  o Reinitialize enigmas
|  |
|  o Calibrate flux capacitors
|  |
|  o Initial commit
|
o Frobnicate widgets
|
o Reticulate splines
|
o Initial commit

When a further change is made to dependency that needs to be brought in to parent,
it’s incorporated as a merge commit:

o Merge commit '54321fedcba'
|\
| \
|  o Ameliorate checksums
|  |
o  | Add 'dependency/' from commit '12345abcdef'
|\ |
| \|
|  o Reinitialize enigmas
|  |
|  o Calibrate flux capacitors
|  |
|  o Initial commit
|
o Frobnicate widgets
|
o Reticulate splines
|
o Initial commit

On the other hand, the corresponding git history of parent with submodules would look
like:

o Update 'dependency' submodule pointer to '54321fedcba'
|
o Add 'dependency' as submodule
|
o Frobnicate widgets
|
o Reticulate splines
|
o Initial commit

Although either option is an improvement on the status quo, I would prefer that we add
beman.infra to beman.exemplar as a git submodule rather than a git subtree, for the
following reasons.

Git submodules are a more natural way of maintaining a single source of truth for the
contents of beman.infra. The structure of git submodules forces every change to
beman.infra’s contents to actually live inside of the beman.infra repository, because the
submodule is always fetched from its upstream. These changes can either be on the main
branch or on a feature branch, but they must be encapsulated inside of beman.infra. By
contrast, since a git subtree would not force encapsulation of the dependency, it makes it
easier for developers to make local changes to files in their own copy of beman.infra
without contributing them back upstream, leading to potentially messy git merges every
time they update beman.infra.

Git submodules are more flexible with respect to the commit workflows they support. There
is a longstanding holy war between users of merge-based workflows and users of
rebase-based workflows; I’m on the rebase workflow side, and beman.utf_view enforces a ban
on merge commits via unticking GitHub’s “Allow merge commits” checkbox in its settings. I
personally think that merge commits make the git history much more difficult to
interpret. Using a git subtree would require me to add merge commits to my git history,
which breaks my preferred commit workflow; on the other hand, using git submodules
wouldn’t break the commit workflow of developers that prefer git merges.

The main objection to git submodules is the need to keep the submodule in sync with git submodule update commands. I mainly just see this as an education problem; it just means
that users that want to consume Beman’s git repositories need to learn how git submodules
work. I think that’s a reasonable requirement, since git is the de facto standard for
version control, with a 94% adoption rate, and almost all git users will eventually
encounter git submodules-- most users are already familiar with them, and any user who
isn’t is probably going need to learn them at some point anyway.

We would mainly just need to convey:

Run git submodule update --init after you clone, pull, or checkout.

That’s sufficient to consume the library, although it gets more complicated if you want to
contribute your own update to the submodule; but not more complicated than the git subtree
contribution workflow.

Finally, the git subtree workflow is dissimilar to the more typical dependency workflows we use elsewhere. If my library depends on gtest, I specify in CMake that I want gtest to be made available to my library, or do something like adding it to a package manager lockfile, but I don’t incorporate gtest’s entire git history into my repo.

Thanks for reading, and let me know what you think.

1 Like

However, in the specific case of beman.infra, I think that there is an important
technical reason to make an exception.

That’s a good point, but the feature is specifically for FetchContent workflows. In packaging workflows, you want to use the provided toolchain file, which is often generated and provided for you already tuned to the selected profile, etc.

As to the question at hand, I am flexible other than wanting to see a decision be made reasonably quickly.

That’s because I expect all of this needs to go at some point since implementing dependency management in git is, like implementing dependency management in CMake, a local maximum. I want to see some of us develop some tool that can see a build requirement on beman.infra at a particular version, download that released tarball, and wire it up to the current environment using standard hooks like PATH, CMAKE_TOOLCHAIN_FILE, CMAKE_PREFIX_PATH, etc. That is, something in the vein of pipenv or uv run or something like that. That approach would allow solutions to the all of the different downsides Eddie mentions.

Some day. For now, let’s be productive, but let’s not get too dogmatic and trap ourselves in a local maximum.

1 Like

I wanted to capture the discussion that we had on this subject over the lunch break at
C++Now.

The question of whether to use git subtree vs git submodule received relatively less
discussion, and git submodule got consensus.

The main topic of discussion was whether we should use git submodule, or whether we should
use the approach that Bret suggested above in this Discourse thread of adding a tool,
potentially in the form of a bash script or even a CMake script, which would implement a
more comprehensive dependency management solution which would acquire the dependency on
infra using the same packaging machinery that acquires all the other dependencies.

This would require changes to the existing CMake workflow. Setting aside presets for now,
the existing git-ops way to manually clone and build exemplar, as documented by README.md,
is:

git clone https://github.com/bemanproject/exemplar.git
cd exemplar
cmake -B build -S . -DCMAKE_CXX_STANDARD=20
cmake --build build

The git submodule approach preserves this workflow with the only difference being that the
git clone can be changed to git clone --recursive to checkout the infra submodule.

However, the script approach would require executing that script before performing the
main CMake invocation, which I objected to on the basis that Beman novices would have more
preexisting familiarity with the CMake workflow that we use currently.

Bret raised concerns about the way that packaging tarballs would work under the submodule
approach. The status quo is that the release tarball is an exact copy of the contents of
the exemplar git repository, but under the submodule approach, we would need to augment
the tarball with the contents of the infra submodule as well. This discrepancy raised
concerns that it could result in potential supply chain vulnerabilities.

For example, when a malicious version of xz was released, the release tarball was the
result of a build process that was itself compromised, so even though the vulnerability
was not visible in the git repository’s C source files, malicious code ended up in the
tarball anyway because the compromised build process was able to inject malware disguised
as unit test data. Under a model where the release tarball is a simple snapshot of the
repository, this kind of transformation wouldn’t be possible.

I didn’t have an immediate answer to this concern but I agreed to look into potential
solutions.

Ultimately, the group that met at lunch agreed that we had consensus to go ahead with the
submodule approach as long as I updated my pull request on Wednesday to ensure that
release tarballs of beman.exemplar also include infra, which my original pull request
hadn’t accounted for.

If this is too hard, I think we need to quit and become farmers:

./bx cmake -B build -S . -DCMAKE_CXX_STANDARD=20

We also have the upside of designing a bx.json that sets the C++ standard for you, etc. If we want that.

Thanks everyone for all the engagement in the discussion at C++Now today. I wanted to be sure to share the links for the content I shared:

Slides:

Changes to infra in my fork:
Comparing bemanproject:main…bretbrownjr:main · bemanproject/infra

You can copy bx from that repo wherever you want your git work area, a temp dir added to PATH, or ~/.local/bin could be interesting locations.

I also renamed lockfile.json to bx.lockfile.json, though we could always undo that change. I’ll get a fork up of exemplar that includes a copy of bx and the newly named lockfile for you all to look at, but if you don’t want to wait, it’s not hard to do that rename for yourself.

And to caveat, this was just a strong proof-of-concept, there’s a lot of cool things we can do from here, so teamups are warranted and appreciated.

1 Like

Thanks! Here’s my content from this morning as well:

Slides: http://www.ednolan.com/toolchains_slides.pdf

beman.exemplar fork feature branch for git subtrees (option 1): GitHub - ednolan/exemplar at enolan_infrasubtree3

beman.exemplar fork feature branch for git submodules (option 2): GitHub - ednolan/exemplar at enolan_infrasubmodule1

beman.exemplar fork feature branch for bemanmodules (option 3): GitHub - ednolan/exemplar at enolan_bemanmodule1

Experimental release workflow: exemplar/.github/workflows/release.yml at enolan_releaseaction1 · ednolan/exemplar · GitHub

1 Like

I am sorry if this comes over rude, but in my opinion the whole discussion about infra is fundamentally flawed.

Sorry, but, no. The only things that you need to build any CMake project is CMake and that project. Packagers will use their existing toolchain that they package for. Clients that want to use a Beman project will want to use the toolchain of the project that they want to integrate it to. Developers of Beman projects will want to use their local development environment that they already use to build any other CMake project with. There is simply no need at all for Beman projects to contain any kind of additional infrastructure. Neither directly, nor indirectly. The whole discussion how to pull the infrastructure, whether to use a submodule, subtree or a script is moot.

1 Like

Sure, but sometimes, CMake code is shared by several repositories. The question is where to place it.

Some developers, yes, will have something like this set up in their, e.g., CMakeUserPresets.json file.

In my experience, a significant number of developers don’t have something like this set up and would be happy to get decent presents that simplify building and testing the project. That was the intent of CMakePresets.json. Developers don’t have to use this, but it’s there for folks who don’t want to think about what set of sanitizer and C++ revision flags are best for a dev build.

Ideally, all code that is useful for multiple CMake projects should be upstreamed to CMake.

The CMakeUserPresets.json feature is not really thought-out. It cannot be used to share settings across projects, as it has to be placed into the projects root directory. By “local development environment”, I meant things like an editor or IDE with a set of plugins. I assume that most if not all developers have a highly opinionated setup like that in place. Some share their settings as dotfiles.

Wait, the discussion here is about moving “stuff” to a beman.infra repository and whether that “stuff” should be managed as a submodule, subtree, or provisioned with a wrapper tool bx. A single file that has to be placed in the root directory cannot be managed like that. We really need to come to an agreement what we actually mean that “stuff” to be.

Duplication I’m observing:

  • CMake toolchain files for local development workflows
    • Note: Conan, vcpkg, and other situations come with their own toolchain files, so they’re not all of the toolchain files
  • [To be created] CMake modules to reduce boilerplate in CMakeLists.txt
    • Creation of these modules is blocked on a solution to this problem
    • Example: Trivial specification of tests that validate correct compile failures
    • Example: Reducing boilerplate and future-proofing Beman libraries against future packaging expectations (*-config.cmake files now, CPS files later)
    • Example: Assuming a consistent enough project structure, we could have around three or four statements in a top-level CMakeLists.txt including a project_is_beman() declaration and that’s it.
  • CI configuration details
  • Documentation
    • Build instructions shouldn’t and really don’t vary across projects
  • CMake preset settings
  • clang-tidy, clang-format, pre-commit, etc. settings

Basically, if it’s not C++ code, tests for that code, docs unique to that code, and other project-specific details, I’m hoping for ways to get it out of each individual repo or (for clang-format and pre-commit files) at least have tools that automatically update relevant files on demand. Other than the obvious reduction in complexity and sharing of work across repos, this will make it easier to make exemplar into a project template, and it will probably make it easier for non-Beman projects to use Beman best practices if they wish.

I’ll take what I can get in the meantime I suppose, but the toolchain files are a small part of the complexity for what it’s worth.

1 Like

Been there, done that: ryppl-cmake/Modules at master · ryppl/ryppl-cmake · GitHub

This had cmake commands for nearly all of the things listed under “to be created” and even more.
Note that this was before CMake had builtin support for usage requirements, precompiled headers, export sets, co-compiling, and presets.

After maintaining those workarounds in CMake code for two years, I realized that functionality like this should better be upstreamed. Five years later, I presented how Boost can use CMake without any custom commands (See: “Effective CMake” at C++Now 2017).

If the desire for custom beman_ commands comes up, I get that the next generation of developers needs to go through the same pain and eventually will reach the same conclusion.

I presented “Modern CMake Modules” at CppCon 2019. It demonstrated how to package and depend on CMake modules like these. We’ve had wild success maintaining tens of thousands of projects with modules like these. Though to make this work, Beman will want to ship anything mentioned from CMakeLists.txt via all interesting packaging systems.

And, yes, I would like to upstream everything and eliminate the need for all of this. But like everything else, it’s best to standardize existing and widespread practice whenever possible.

It doesn’t have to be one or the other. Bloomberg has successfully upstreamed internal practice via CMAKE_COMPILE_WARNING_AS_ERROR, CPS development, among other things. Beman should do likewise.

My pull requests for beman_module.py are up.

For infra: Add toolchain files and beman_module.py by ednolan · Pull Request #13 · bemanproject/infra · GitHub

For exemplar: Add infra to exemplar as a beman_module by ednolan · Pull Request #162 · bemanproject/exemplar · GitHub

Documentation for the new script can be found here: infra/beman_module at enolan_bemanmodule4 · bemanproject/infra · GitHub

1 Like

I don’t know why I didn’t brought this up at today’s sync.

But I don’t think we should use beman.infra, but instead to create another repo that holds common files that we want to copy to other, maybe something like beman.commmons. Common files (which is what we want to ship) is not necessarily the same as all “infra” files.

Currently beman.infra hosts all the scripts for container generation, I don’t think we will need these files at all repos. We will likely need to setup infra for other scripts that does’t need to be shipped to all repos but is needed org-wise for beman.

I can also move my container generation files out. But this distinction still persists.

I don’t love the name “beman.common” since it could have many interpretations, but I agree that it makes sense to split these things out.

A couple more name ideas: beman.cpp_library_boilerplate, beman.cpp_build_harness

1 Like

My take is that “infra” stays “infra” but all the Docker machinery gets moved into a separate “images” repository.

2 Likes

That sounds good to me.

Another idea would be a sparse checkout.
But idk how portable sparse checkout is.

Update: These have now been merged. The script has been renamed beman-submodule to avoid confusion with C++ modules.