Most grant scoring rubrics are theatre.
That is not a slight against the people who designed them. It is an observation about what rubrics actually do in practice versus what they are supposed to do. A rubric is supposed to produce consistent, defensible, differentiated scores across a pool of applications. Most do not. They produce a trail of numbers that arrives at a conclusion the assessors had already reached — and then file that trail as evidence the process was rigorous.
The uncomfortable question for any grants director is not "do we have a rubric?" Almost everyone has a rubric. The question is: does your rubric actually drive decisions, or does it document them after the fact?
A scoring rubric creates the appearance of systematic decision-making. Criteria are listed. Weights are assigned. Numbers are entered. A total emerges. The highest-scoring applications are funded.
Except — look closer at how assessors use these criteria in practice. When criteria are defined too loosely, assessors don't score what the criterion describes. They score their overall impression of the application, then distribute those impressions across the criteria in a way that produces a number consistent with that impression. The rubric is reverse-engineered from the conclusion, not used to reach it.
This happens because most rubric criteria are written at the level of concepts rather than evidence. "Community benefit" is a concept. "Evidence of how the programme will reach people who do not currently access this type of service" is an evaluable criterion. The first invites holistic scoring. The second requires the assessor to look for something specific — and if it isn't there, the score should reflect that.
The illusion is reinforced by the fact that rubrics rarely fail visibly. When two assessors both give an application 7.2 out of 10 on "impact potential," the rubric appears to be working. Nobody asks whether they were looking at the same evidence or whether they would have scored it the same way if the applicant's organisation had a different reputation.
Inter-rater reliability is a measure of how consistently different assessors score the same application using the same criteria. In research and clinical settings, it is calculated routinely. In grants assessment, it is almost never measured.
This is remarkable given what it means. If two experienced assessors scoring the same application produce scores that differ by more than 20–25%, the rubric is not producing reliable assessments — it is producing two different opinions that happen to be expressed as numbers. Aggregating those numbers produces a score with false precision. Ranking applications by that score and funding the top three produces a decision that is not meaningfully more defensible than selecting by committee discussion.
The reason inter-rater reliability is rarely measured is that measuring it would require using the same application as a calibration instrument across all assessors before the round opens. Few organisations do this. It takes time. It requires a willingness to find out that the rubric is not working. And it requires doing something about it when it isn't.
The consequence is that most organisations are making substantial funding decisions without ever having tested whether their assessment instrument is measuring anything consistent.
Weighted criteria are better than unweighted criteria, but only if the weights correspond to something real in how the assessment is conducted.
Consider a rubric where innovation is weighted at 40% and methodology at 30%. What does that mean for assessors? It should mean that if two applications are equal on every other dimension, the one with stronger innovation should rank higher even with weaker methodology. Assessors should be aware, when they are scoring, that their innovation score matters more.
In practice, this only works if the word "innovation" is defined precisely enough for assessors to agree on what it looks like in a high-scoring application versus a low-scoring one. If the criterion guidance says "demonstrates innovative thinking or approach," that definition is functionally useless. One assessor reads "innovative" as "uses evidence-based approaches that are new to this context." Another reads it as "proposes something no one else has tried." These are not compatible definitions. The 40% weight doesn't make the rubric more rigorous — it just multiplies the disagreement.
Weights are not the problem. Vague criteria are the problem. The weight tells assessors what matters most. The criterion definition tells them what to look for. Without the second, the first is decoration.
The test is simple: can an assessor, looking at a specific application, point to the evidence in the application that justified their score? If they can't — if the score is a gestalt impression that they then assign to the criterion — the criterion is not doing its job regardless of its weight.
1. Your scores cluster in a narrow band. If 80% of applications score between 6 and 8 out of 10 on most criteria, either your applicant pool is unusually uniform or your criteria are not discriminating. Rubrics that don't produce a meaningful spread are not selecting — they are validating.
2. Assessors skip to the summary before completing individual criteria. When experienced assessors form a holistic judgement quickly and then complete the scoring form, the form is post-hoc documentation. Watch how your assessors work, not just what they submit.
3. Appeals and challenges consistently target the process, not the scores. When unsuccessful applicants argue that the process was unfair rather than that their application was scored incorrectly, it often signals that the scoring criteria were not clear or specific enough to produce a response. If scores are genuinely derived from defined criteria, an applicant can look at their feedback, look at the scoring rubric, and understand where they fell short. When that isn't possible, the rubric has failed its transparency function.
Start by writing assessor guidance at the level of evidence. For each criterion, describe — in concrete terms — what a score of 8–10 looks like, what a score of 5–7 looks like, and what a score of 1–4 looks like. If you cannot write those descriptions without using the criterion word itself (e.g., "a high-innovation application demonstrates high levels of innovation"), the criterion is not defined — it is circular.
Run a calibration exercise before every round. Use one real application from a previous round (redacted if necessary) and have all assessors score it independently. Compare scores, surface disagreements, and discuss what the criteria were actually capturing. This single exercise does more for scoring consistency than any amount of rubric refinement on paper.
Consider criterion sequencing. Some criteria are binary (does this applicant meet the eligibility requirement?), some are graduated (how well does the proposed approach address the stated need?), and some are comparative (how does this application's budget justification compare to similarly scoped applications?). Mixing these types without distinguishing them creates assessor confusion that shows up as inconsistency.
Review your criteria against the decision patterns from previous rounds. If your highest-weighted criteria are not the dimensions that most strongly predict which applications were funded, either the weights are wrong or assessors are overriding the rubric with other considerations. Both are worth knowing.
A scoring tool is designed to rank applications consistently and independently. It should produce reliable scores across assessors. Its purpose is discrimination.
A deliberation tool is designed to structure a committee discussion. It prompts assessors to consider multiple dimensions, surfaces disagreements, and supports a defensible narrative. Its purpose is synthesis.
Most organisations need both, and they are different instruments. The mistake is using a deliberation tool as if it were a scoring tool — taking the numbers that emerge from a committee discussion and treating them as objective scores. Or using a scoring tool as if it were a deliberation tool — conducting a by-the-numbers assessment and treating the ranked list as a final decision without any committee review.
Well-designed assessment processes separate these stages. Individual scoring happens first, using criteria that are specific enough to produce consistent results. Deliberation happens second, using the individual scores as input to a structured discussion — not a rubber stamp, but a genuine review of borderline cases, range discrepancies, and portfolio-level considerations.
When both tools are designed with care, the rubric does its job. Decisions become defensible not because they followed a process, but because the process actually produced the decision.
If you are redesigning your assessment process, we're worth talking to. Tahua supports weighted scoring rubrics with criterion-level guidance, blind review to remove institutional bias, and COI auto-recusal — so assessors score what they should be scoring, not what they already know about the applicant.