Two-stage selection pipelines—where a cheap screener S decides who moves on to an expensive,
high-quality assessor H—have become a default design in hiring, credit, medical triage, and
content moderation. Intuition says that a higher level of agreement between the screener and
the assessor must be helpful, or at least not harmful. We show the opposite: once the screener
becomes too similar to the assessor, the pipeline can flip from helpful to harmful.
To formalize this, we consider a two-stage selection pipeline where each candidate is charac-
terized by a latent true quality Q, a screener score S, and an assessor score H. We model (Q, S, H)
as a trivariate Gaussian with unit variances and correlations Corr(S,Q) = θs, Corr(H,Q) = θh,
Corr(S, H) = θ. For fixed acceptance thresholds τs and τh, we derive a closed-form expression of
expected quality for selected cases E[Q | S > τs, H > τh ] and prove a simple rule:
S is helpful ⇐⇒ θ < θs/θh.
The ratio θ⋆ := θs/θh is the screener’s normalized predictive power for Q. If the screener pre-
dicts H better than this benchmark, it merely duplicates H’s judgment, gatekeeps on the same
mistakes, and lowers the final expected quality. Random shortlisting, skipping, or reversing S’s
decision then typically improves the expected quality of the selected cases, even when S is posi-
tively predictive of true quality Q—a counter-intuitive “good screener gone bad” effect.
We prove that an early-stage screener improves the expected quality of the final selection if
and only if its agreement with the expert is below a simple correlation threshold. When the
screener predicts the expert too well, it duplicates the expert’s errors and the two-stage pipeline
degrades. This correlation-driven effect is distinct from capacity-based comparisons in dynamic
screening—where one- versus two-stage performance hinges on acceptance rates and noise ag-
gregation—and from strategic “zig-zag” manipulation in sequential pipelines. We translate the
condition into a practical diagnostic based on partial correlations and show how to incorporate
the (often nontrivial) per-case cost of second-stage assessment.
We consider the algorithmic and managerial implications for human-AI collaboration, discuss
how proxy targets, feature overlap, and imprinting can silently push θ above θ⋆, and provide de-
sign levers to keep θ below θ⋆. We also connect this phenomenon to the fidelity paradox in knowl-
edge distillation in machine learning: when student models match teacher models too closely,
they lose generalization. The main takeaway is that the predictive performance of the screener in
predicting Q (θs) is an insufficient diagnostic metric for the overall performance of the pipeline.
The inter-stage correlation θ is a key diagnostic that should be monitored as carefully as accuracy.