Leniency of beman-tidy documentation checks

During the review of the pull request to the Beman standard resulting from the earlier discussion of the formatting of the license file, there was a discussion about the wording of a new rule that the type of license must be specified in the README.md.

The original wording I proposed was the following:

REQUIREMENT: The README.md must contain a section called “License” which specifies the license or licenses used by the library. A non-exhaustive list of examples follows.

## License

beman.exemplar is licensed under the Apache License v2.0 with LLVM Exceptions.

/* further examples… */

After code review, this was modified to:

REQUIREMENT: Following the library status line and a new line, the README.md must have a LICENSE section.

Use exactly the following format:

## License

beman.exemplar is licensed under the Apache License v2.0 with LLVM Exceptions.

<!-- Other optional mentions go on this line ... >

My reasoning for the less restrictive rule is this:

I find this situation analogous to the strategy I use to unit test log messages. Say I have this function:

// Returns a string containing the file contents, or if the file cannot be accessed, 
// logs the error and return std::nullopt
std::optional<std::string> slurp(std::filesystem::path const&, logger&)

I can unit test the error logging in one of two ways: either,

EXPECT_EQ(logger.messages[0], "failed to slurp file with: No such file or directory");

or:

EXPECT_EQ(logger.messages[0], "No such file or directory");

The second approach is better because the only thing we specifically care about is that the log message propagates the strerror message from the operating system, not the phrasing of the text around it. If someone reimplements the function such that the error message is “file slurp failed” instead of “failed to slurp file,” they shouldn’t need to update the unit test.

I think this README rule should be treated similarly. I don’t think we should be failing beman-tidy if someone’s README file says, “The license of beman.foo is the Apache License v2.0 with LLVM Exceptions,” instead of, “beman.foo is licensed under the Apache License v2.0 with LLVM Exceptions.” The important thing is just that the “License” section exists and the string “Apache License v2.0 with LLVM Exceptions” is present-- that’s what beman-tidy should check for here.

I think a couple of the other rules like README.IMPLEMENTS and README.LIBRARY_STATUS could possibly be similarly relaxed-- and I think determining what principle we should use here might become more important as we develop our documentation system further, so I wanted to open a discussion here.

Thanks for the great topic @ednolan! I’m trying to understand the tradeoffs here. Am I missing anything?

Strict Consistency Benefits:

  • Reduced cognitive cost for a developer trying to figure out how to word something.
  • Simplified and more accurate tooling.
  • Less variability in wording quality.
  • Easier to make cross-Beman changes.

Lax Consistency Benefits:

  • Authors enjoy more freedom of self-expression
  • Exceptional cases are better handled. (e.g. in the case where someone wants to call out different parts of the library falling under different licenses)
  • More opportunities to innovate on a smaller scale, which can then gain organic adoption elsewhere.
1 Like

That sounds exactly right to me.

During the review of the pull request to the Beman standard resulting from the earlier discussion of the formatting of the license file, there was a discussion about the wording of a new rule that the type of license must be specified in the README.md.

I hope now everything it’s clear and my review was strictly for applying current principles used when writing BEMAN_STANDARD.md and its rules. IMO, beman-tidy should not be relevant here. My current understanding and expectation, is to also check same things if somebody will ask me to manually check if beman.$new_library is Beman Standard compliant (e.g., I expect the exact format for the README file). So same result for manual and automated checking.

I think determining what principle we should use here might become more important as we develop our documentation system further,

If we want to change the principles, that’s other story.

Strict Consistency Benefits:

  • Reduced cognitive cost for a developer trying to figure out how to word something.
  • Simplified and more accurate tooling.
  • Less variability in wording quality.
  • Easier to make cross-Beman changes.
    Lax Consistency Benefits:
  • Authors enjoy more freedom of self-expression
  • Exceptional cases are better handled. (e.g. in the case where someone wants to call out different parts of the library falling under different licenses)
  • More opportunities to innovate on a smaller scale, which can then gain organic adoption elsewhere.

I agree with this comparison, BUT do note that we have some special cases IMO.
We have 3 types of docs in Beman:

  1. library docs what an author will put inside docs/ in library’s repo - here we just mandate to be in docs/. NO other rule at all for the moment.
  2. Root README.md inside library’s repo.
    2.1. There are some sections where we right now we mandate a content (e.g., SPDX identifier, followed by title, followed by library short description, followed by implements line, followed by status line, and now followed by License section, which only mandates first line).
    2.2. Rest of the sections in the README are free-form.
  1. docs in beman/repo - e.g. BEMAN_STANDARD.md

IMO, 2.1. and 3. must fall into the Strict Consistency category, because we must not allow basic and common data, like implements line, to have divergent format across libraries. Think about how bad would be when then project scales.

A second argument:

  • 2.1. does not change to often! Most likely it’s only about bootstrap, and then maybe put a new badge or change library status.
  • 3 does not change too often, and not by all people. If people don’t know strict rules for this docs , review will solve the issue.

TLDR: Docs which do not affect lots of library devs OR do not change very often → Strict consistency. Other docs can be in relaxed category, depending on their purpose.

Can you elaborate more on the problems you forsee that would stem from that divergence? From my perspective, the point of the “implements” line is to communicate to human readers which paper the repository implements, and as such, as long as people are able to understand what the line is saying, it shouldn’t matter if the format diverges between libraries.

I don’t think we currently have any use case for the “implements” line of the readme being intended for machine consumption rather than human consumption, other than beman-tidy itself, so I don’t think it qualifies as “data,” just documentation.

Can you elaborate more on the problems you forsee that would stem from that divergence? From my perspective, the point of the “implements” line is to communicate to human readers which paper the repository implements, and as such, as long as people are able to understand what the line is saying, it shouldn’t matter if the format diverges between libraries.

Well, I can tell you an example outside the README file, where people actually don’t care about repo rules. So in the last months I was fixing repo settings myself, and I saw lots of strange configurations, or repos without admin etc. This is how I proposed this: Repos inside Beman org don't require code review - let's enforce that · Issue #157 · bemanproject/beman · GitHub. Now it’s easier also for library owner and org admins to track this kind of issues. Strict consistency allows to write a bot.

I don’t think we currently have any use case for the “implements” line of the readme being intended for machine consumption rather than human consumption, other than beman-tidy itself, so I don’t think it qualifies as “data,” just documentation.

I think we should have even more automatization and bots. Let’s say by example, a CI workflow which runs weekly or whatever, and generates a PR for the website and adding new libraries inside the library table (Beman Libraries | The Beman Project). So that would be status line. But same thing, we may add papers names inside that table, also automatically pulled from root README.md files. Strict consistency allows to write a bot/workflow.

TLDR: Just because something is not yet automatized today, we should not assume we won’t need it. I propose to have all relevant sections (e.g., 2.1) directly be feed into tooling/boots.

we may add papers names inside that table, also automatically pulled from root README.md files. Strict consistency allows to write a bot/workflow.

I’m skeptical of this idea because:

  • I don’t think a README.md file is a good place to store machine-readable information;
  • I don’t think that updating the library table on the website is necessarily a good use case for automation (how much time will we spend maintaining the bot vs just manually updating the table?);
  • I don’t think we should be restricting our README.md files for the sake of a use case that’s currently hypothetical.

I don’t think a README.md file is a good place to store machine-readable information;

We had a similar discussion at some point. We wanted to avoid duplicate info and the info was kept inside README.md. I don’t have a strong opinion, I think README.md it’s a decent place to store this info. Creating another config file would be a bigger effort for library dev, IMO.

I don’t think that updating the library table on the website is necessarily a good use case for automation (how much time will we spend maintaining the bot vs just manually updating the table?);

Nobody will actually remember or know to update that table. A PR review you will do, because you get a notification from the bot. With strict consistency, the bot will not change at all after v1. I am volunteering to do the bot, and maintain it. I won’t go to manually check if we have or not a new library in the repo. (again, let’s think about 1-2 years in the future).

I don’t think we should be restricting our README.md files for the sake of a use case that’s currently hypothetical.

That’s false, we have a WIP issue - Add automatization for importing the docs · Issue #51 · bemanproject/website · GitHub.
This PR was 100% generated with automation Beman docs sync 2025-07-19 by neatudarius · Pull Request #96 · bemanproject/website · GitHub. It’s just the first step, I know, but generating the library tables is a must have before putting this as job on CI. Again, the end result is an CI-automation generated PR, which in the end will be reviewed by website owners.

@ednolan , I would not like to focus the discussing only around automation. It’s a thing I really want, yes.

But, as a library user, I expect consistency inside the project. Why should ever the implement line be different or in other position between beman.library1 and beman.library2? It’s not rocket science, but it should be trivially readable also for the human user. IMO, we should require minimal effort for library owner to actually facilitate easier reading for all other members and external users, otherwise I don’t know what’s the purpose of the Beman Standard document :smiley:

Creating another config file would be a bigger effort for library dev, IMO.

We have a third option: if we don’t have the bot, we don’t need the config file or the machine-readable README.

but it should be trivially readable also for the user

There are many trivially readable alternative spellings of the license section; “beman.foo is licensed under the MIT License” and “The license used by beman.foo is the MIT License” are both trivially readable, but the second one is rejected currently. Why do we care? Neither of those takes more effort to read; the effort comes in when I write the second one in my README file and then get an email that CI failed and have to figure out why.

Consistency can be a useful property but it carries downsides in this case. I would prefer to have a justification other than just consistency itself for why we should enforce this.

I would not like to focus the discussing only around automation.

I’ll leave it alone then, but in summary I just think that all this automation is missing the point if it’s aimed at making life easier for the maintainers of the beman project generally (like making it easier to update the website) as opposed to the maintainers of the actual libraries themselves and their users.

We have a third option: if we don’t have the bot, we don’t need the config file or the machine-readable README.

That’s not an argument, IMO. We still need a viable solution for automatic docs updates, etc. Not having automation at all will lead to lack of control and docs out of sync. Call that software whatever you like, it must be a script which updates docs on long run!

There are many trivially readable alternative spellings of the license section; “beman.foo is licensed under the MIT License” and “The license used by beman.foo is the MIT License” are both trivially readable, but the second one is rejected currently

README.LICENSE can be changed to “contain few possible strings”. That’s still consistency IMO, because if you don’t match exactly few substrings, you still fail the check. I’m not sure if you do care much about this specific example, or about the entire BEMAN_STANDARD.md / root README.md format.

I’ll leave it alone then, but in summary I just think that all this automation is missing the point if it’s aimed at making life easier for the maintainers of the beman project generally (like making it easier to update the website) as opposed to the maintainers of the actual libraries themselves and their users.

I disagree with you. We already have cookiecutter and exemplar being an amazing state! (Thanks Eddie).

  • Due to this, your mentioned problems don’t exist for new libraries, IMO, as you get everything from scratch from free!
  • Existing libraries will be updated to be Beman Standard compliant before enabling beman-tidy on CI.
    beman-tidy running on CI will make sure you always get small and incremental reports afterwards, for both cases.
    Summary: Most of the time the actual library maintainers do not have work to do or do not lose time figuring out what to do for these conventions, because Strict consistency is already applied in exemplar template and beman-tidy! This is the only reason we have exemplar as template and BEMAN_STANDARD.md doc trying to be relevant.

My perspective is that we already help this category of maintainers (and it’s an amazing state, not only a good enough one). We should not block others. And again, we are talking about targeted strict consistency (not all docs, not entire root README.md).


On a separate topic, not only for docs, we should do automation for anything that can be done IMO. There may be exceptions, but I think it should be a strong principle for as in trying to achieve both control and quality. Ofc, for any flow it should be a review or some human control from time to time. We won’t create tens of tools, but we will probably create tens of libraries (I hope) - lacking automation is a no go for scaling, IMO. And lacking automation from early state it’s also a big mistake.


I would also like to address that quality should be also included as property in this discussion. Lacking of automation for some of flows, will generate poor quality for docs, build system, site updates, libraries. The quality of docs or website should be important for Beman org, and understood by library owners. At least my hope is to be a requirement from org level ([CORE.QUALITY] Highest quality. Standards track libraries impact countless engineers and, consequently, should be of the highest quality. beman/docs/BEMAN_STANDARD.md at main · bemanproject/beman · GitHub), which library authors must understand and ensure. Docs as much as important as code - maybe we should explicitly state that.

IMO we should aim for highest quality code + docs + whatever we have near them, with a high level of control over every aspect and process in this org, while trying do generate minimal overhead for maintainers, NOT zero (but close to zero). Putting only few fixed things in README, or CMakelists.txt, is the same topic for me, because we restrain the library authors! Details do not matter, they will be unhappy if that’s a real problem. We provide tools and automation for this problem to disappear.

I shouldn’t have said “all this automation” earlier-- I want to be clearer that I think that what beman-tidy does is an excellent way of enforcing almost all the existing rules in the beman standard, and my issues are pretty narrowly focused.

README.LICENSE can be changed to “contain few possible strings”. That’s still consistency IMO, because if you don’t match exactly few substrings, you still fail the check.

This would satisfy my complaint about this rule. I care much less about changing the status quo for the other [README.*] rules.

On a separate topic, not only for docs, we should do automation for anything that can be done IMO.

I agree that automation is usually a good thing, and I have very few complaints about the automation that’s been added so far. The only things I’ve found questionable so far were the README.LICENSE rule and the bot to update the website table. My general worry is that by saying we always want automation, we’re going to wind up exceeding the breakeven timespans in the table from this XKCD comic:

Now that I’m actually looking at the table, though, if adding a Beman library to the table on the website takes 5 minutes of time, and we add a new Beman library every week, then the entry says this is worth 21 hours of time across 5 years, so maybe that does make sense to automate by that metric. On the other hand, that may turn out to be an optimistic pace of library addition-- these are early days, so we don’t have a good sense yet.

We still need a viable solution for automatic docs updates, etc.

I agree with that-- this should be a high priority. What I’ve seen so far on this front is the Docusaurus system that @tzlaine started working on here.

1 Like

I shouldn’t have said “all this automation” earlier-- I want to be clearer that I think that what beman-tidy does is an excellent way of enforcing almost all the existing rules in the beman standard, and my issues are pretty narrowly focused.

I am actually talking about all automation. We should have lots of flows automated.

This would satisfy my complaint about this rule. I care much less about changing the status quo for the other [README.*] rules.

I understand. I do care much about keeping the status quo for README.* . As a person who actually checks all repos from time to time in a manual way, I’m saying not having trivial stuff like README.* in a standard way, it would be a pain.

Now that I’m actually looking at the table, though, if adding a Beman library to the table on the website takes 5 minutes of time, and we add a new Beman library every week, then the entry says this is worth 21 hours of time across 5 years, so maybe that does make sense to automate by that metric. On the other hand, that may turn out to be an optimistic pace of library addition-- these are early days, so we don’t have a good sense yet.

It’s not all about actual duration of tasks. It’s also about someone remembering at precise timepoints that we need to do that. Again, my job as an org admin is to manually do such tasks from time to time. And it’s not predictable because I’m not always aware a new library was added. Having automation creaing an issue or a PR and tagging me/leads on GitHub is gold! If this automation is inplace and it’s working, it will ping 1-5 people, which can easily understand the reported issue, review the proposed PR and merge it! (without specific skills or knowledge about the infrastructure or automation implementation details)

I agree with that-- this should be a high priority. What I’ve seen so far on this front is the Docusaurus system that @tzlaine started working on here.

I think that’s unrelated. My understanding is that Zach will only handle C++ code to API docs generation. I’m talking about docs and website subsets generated from non-C++ code/flows - e.g., library table, repository settings, etc.

TLDR: @ednolan, I think you are seeing things only from a library author point of view. We also have the need of manual or automated joint flows, where IMO, it is impossible to try to stay up to date with more that tens of libraries. I have been testing manual approach in the last year, it does not work - we need automation for quality and control on org level!

We had a quick look on this topic in the Beman Weekly Sync and proposed to resume the topic when both Eddie and me are present. We’ll try to resume next week. Meanwhile, we expect here more replies here.

FYI @ednolan @project-leads

1 Like

Hey everyone, a quick and friendly reminder to keep our discussions constructive. It’s fantastic that we have people with different perspectives here, like library authors and users. Let’s make sure we explore these viewpoints collaboratively.

Instead of labeling a person’s perspective (e.g., ‘You’re only seeing it this way’), let’s try to focus on the ideas. For example: ‘That’s a valid point from an author’s view. How would it look from a user’s perspective?’

This helps keep the conversation welcoming and productive for everyone. Thanks!

1 Like

Sorry, the labeling thing, it wasn’t intended. @ednolan , I apologize. I just carried away from discussion and wanted to present some different point of views.

2 Likes

No offense taken, I understood that your intended meaning was more along the lines of how David phrased it. Sorry I couldn’t make the last weekly sync, looking forward to continuing the discussion at the next one.

2 Likes