TeamStation AI

TeamStation AI

Evidence-based evaluation of technical interviews

Where Semantic RAG and cognitive-science methods intersect with expert review—reproducible, calibrated, and explainable for LATAM engineering teams.

8-year proprietary corpus • 12,000+ technical interviews • Expert-in-the-loop • Anomaly detection • 44 Formulas & Algorithms

Psychometric Calculus on NLP for LATAM Hiring

We call it Axiom Cortex™

What we measure

Architecture-data-database-network-organization-server-structure

Architectural instinct

Designs that scale and fail gracefully; solid mental models.

Problem-solving agility

Decomposes fuzzy problems into shippable steps and trade-offs.

Learning orientation

Absorbs new APIs/tools fast; adapts under changing constraints.

Collaborative mindset

Communicates rationale, invites feedback, and unblocks teammates.

How we hire the right person 

1

Role goals & competency blueprint

We align with your business objectives and define core and secondary competencies (plus nice-to-have skills) with explicit weights and ideal-answer examples.

2

Recorded technical interview (~60 min)

A senior evaluator runs a structured, evidence-based interview to surface spoken reasoning, architectural choices, trade-offs, and concrete examples.

3

Transcript & Evidence Locker

We transcribe the recording, apply human QA, restore punctuation/terminology, time-stamp it, and produce a time-coded transcript with highlights tied to each competency.

4

Semantic chunking in RAG and staged/multi-step prompting

We analyze the candidate’s transcript using semantic chunking in RAG and staged/multi-step prompting that govern our proprietary LLM. Each chunk carries a competency tag, weight, difficulty, and blueprint ID—ensuring coverage, a difficulty ramp, and verification.

5

Per-question scoring (five checks)

For every chunk we score technical correctness, sound mental model, practical method, communication clarity, and effort & fluency.

6

Ideal-answer alignment & language fairness

We compare responses to the role’s ideal answers and apply language fairness calibration so L2/ESL phrasing isn’t penalized—we judge the shape of thinking.

7

Expert review & decision package

Human experts inspect integrity flags (including possible AI/lookup patterns), override/rescore where needed, and assemble a shortlist with rationale, risks/opportunities, targeted follow-ups, and L1–L4 leveling.

8

Offer & onboarding

When you’re ready to hire, we handle EOR & payroll, background checks, device provisioning/MDM/security, and onboarding—one accountable SLA.

Bias is reduced by focusing on reasoning patterns; integrity checks flag near-verbatim web phrasing and sudden speech-pattern shifts.

Lonnie McRorey | CEO and Co-Founder of TeamStation AI

Our system automatically detects top talent in Latin America
Illustration
Illustration

Axiom Cortex™Psychometrics and advanced mathematics

    Semantic chunking in RAG: splitting the interview into meaning-based units and retrieving the most relevant blueprints/examples for each.
    Staged/multi-step prompting: guiding the model through a controlled sequence of steps instead of one brittle mega-prompt.
    Five checks: correctness • mental model • method • clarity • effort/fluency.
    Trait profile: rolled-up view of the four traits that predict on-the-job performance.
    Confidence alignment: whether confidence matches demonstrated knowledge.

“Wherever there is judgment, there is noise.”

Daniel Kahneman.
DANIEL KAHNEMAN is the Eugene Higgins Professor of Psychology, Princeton University, Professor of Public Affairs, the Princeton School of Public and International Affairs, and the winner of the 2002 Nobel Prize in Economic Sciences and the National Medal of Freedom in 2013

Semantic chunking in RAG 

We split the interview into meaning-based chunks, retrieve the right blueprints/examples, and guide the model through a controlled, stepwise sequence—the LLM is governed, not free-running.

Per-question “five checks”

Every response is scored on five independent checks—correctness, mental model, method, clarity, effort/fluency—then rolled into a role-specific trait profile.

Language fairness calibration

Adjusts for L2/ESL discourse markers and phrasing so candidates are judged on conceptual fidelity—the quality of the idea—not accent or word choice.

AI/lookup integrity checks

Detectors surface near-verbatim phrasing from public sources and pattern shifts vs baseline; flagged segments go to expert review.

Expert-in-the-loop governance

Senior reviewers gate decisions, can override scores, and provide plain-English rationale; the pipeline is auditable end-to-end.

From scores to decisions

Senior reviewers gate decisions, can override scores, and provide plain-English rationale; the pipeline is auditable end-to-end.

If the embedded view doesn’t load, open it in a new tab.

Redacted sample evaluation. Format and fields vary by role/level.

Sample Evaluation Synthesis

We turn interview evidence into an explainable hiring recommendation—governed by semantic chunking in RAG and staged/multi-step prompting, with expert review and language-fair calibration. We score each answer on five checks and assemble a decision package you can act on.

Dan Diachenko interviewing Sr Software Engineers

Nearshore LATAM Technical Interview Evaluation — 44 Methods & Metrics (CTO Appendix)

Semantic RAG, staged/multi-step prompting, calibration, fairness, and reliability—built for nearshore LATAM hiring.


Scoring & aggregation

    Five-Checks Per-Chunk Scoring (correctness, mental model, method, clarity, effort/fluency)
    Weighted Composite Score (core/secondary competency weighting)
    Role/Level Normalization & Cut-Score Mapping (L1–L4)
    Semantic Alignment Scoring (embedding/cross-encoder similarity to blueprints)
    Retrieval Scoring in Semantic RAG (BM25, dense ANN; diversity via MMR)
    Trait Synthesis via Hierarchical/Bayesian Fusion (AI, PSA, LO, CM)
    Confidence Alignment Index (Metacognitive Conviction Index, MCI)

Drift & stability monitoring

    Population Stability Index (PSI)
    Kolmogorov–Smirnov (KS) & Anderson–Darling
    KL / Jensen–Shannon Divergence for feature distributions
    Page–Hinkley / ADWIN for streaming drift

Item calibration & measurement

    Item Response Theory (2-PL / 3-PL)
    Many-Facet Rasch Modeling (candidate × item × rater × modality)
    Role/Level Normalization & Cut-Score Mapping (L1–L4)
    Generalizability Theory (G-studies / D-studies)

Calibration & reliability

    Probability Calibration (Platt scaling / Isotonic regression)
    Calibration Metrics (Brier Score, ECE/MCE/ACE)
    Internal Consistency (Cronbach’s α, McDonald’s ω)
    Split-Half / Spearman–Brown Reliability
    Test–Retest / Interclass Correlation (ICC)
    Inter-Rater Reliability (Cohen’s κ, Fleiss’ κ, Krippendorff’s α)

Fairness & bias control

    Language-Fairness Normalization (residualization / domain adaptation)
    Group Fairness: Demographic Parity, Equal Opportunity, Equalized Odds
    Predictive Parity & Calibration-Within-Groups
    Differential Item Functioning (Mantel–Haenszel, Logistic-DIF)
    Counterfactual Fairness Probes (textual perturbations)
    Threshold Optimization under Fairness Constraints

Integrity & anomaly detection

    Near-Verbatim Match & Source Overlap (n-gram/cosine, MinHash)
    Stylometric/Baseline Shift Detection (KL / Jensen–Shannon)
    Latency & Fluency Pattern Shifts (answer-tempo anomalies)
    CUSUM/Robust Z-Score Outliering for answer series
    Test–Retest / Interclass Correlation (ICC)

Decision & gates

    Probabilistic Core-Competency Gates (chance-of-meeting-target)
    Utility-Optimized Recommendation (constrained Bayesian decision)
    Cost-Sensitive Thresholding (Youden’s J, custom cost curves)
    Multi-Objective Trade-offs (fairness–utility Pareto checks)

Uncertainty reporting

    Nonparametric Bootstrap / Jackknife CIs
    Bayesian Credible Intervals
    Delta-Method Approximation for composites

RAG & prompting governance

    Semantic Chunking in RAG (meaning-based units, blueprint retrieval)
    Staged / Multi-Step Prompting (schema-constrained outputs)
    Cross-Step Verification & Self-Consistency Checks
    Prompt/Model Versioning & Provenance

Governance & auditability

    ICAL Consistency Checkpoints (self-validation/re-processing)
    Reviewer Override Protocols (explanations required)
    Full Audit Trail (rubric versions, evidence links, decision rationale)
Advanced Algorithms for evaluating LATAM engineers

What to expect, every time

Governed by semantic chunking in RAG and staged/multi-step prompting; the LLM is constrained, never free-running.

≈ 60 min

Structured technical interview (recorded)

2–5 business days

Screen-to-shortlist (role-dependent)

100% of flags reviewed

Expert-in-the-loop governance

Language-fairness calibration

Applied on every evaluation

Data handling, privacy & integrity

We never publish candidate identities. Integrity checks flag near-verbatim web phrasing and sudden speech-pattern shifts for expert review.

Consent-based recording

Candidates are informed; recordings are used only for evaluation.

Encryption in transit & at rest

Standard modern TLS for transport; encrypted storage for media/transcripts.

Access control & audit

Least-privilege reviewer access; activity logs retained for audits.

Redaction & sharing

Public samples are redacted; customers receive secure links, not e-mail attachments.

FAQ

  • We run the transcript through semantic chunking in RAG and staged/multi-step prompting. Each chunk is scored on: technical correctness, sound mental model, practical method, communication clarity, effort & fluency. Scores are weighted by core vs secondary competencies.

  • Normalization that prevents non-native English phrasing (L2/ESL) from depressing scores. We evaluate conceptual fidelity—the quality of the idea—over accent or word choice.

  • No. The LLM is governed: semantic chunking in RAG picks only the relevant blueprints/examples, and staged/multi-step prompts constrain outputs into a fixed rubric. Experts review any flags before they influence the roll-up.

  • Yes. Every role gets its own competency blueprint and difficulty ramp. Core competencies are weighted more heavily than nice-to-haves.

  • Yes. We can map your rubric to our five checks and trait profile, preserving your language and thresholds.

  • A shortlist with rationale, a trait profile, targeted follow-up questions, and onboarding notes—plus a secure link to the time-coded transcript highlights.

Ready to see this on your next role?

We’ll map your role blueprint, run a recorded interview, and deliver a language-fair, expert-reviewed evaluation.