How to Improve LLM Evaluation with Domain Experts

Discover how involving domain experts improves LLM evaluation, prevents harmful AI errors, and ensures ethical AI deployment.

How to Improve LLM Evaluation with Domain Experts

Artificial intelligence (AI) is rapidly transforming industries, from social services to healthcare and beyond. Yet, when AI systems fail to account for real-world complexities, they can cause catastrophic harm to the most vulnerable populations. Dana Bquin, Senior Developer Relations Engineer at Anaconda, explores the critical role of domain experts in enhancing LLM evaluation and mitigating these risks. Drawing on her experience as a lead data scientist for the U.S. Administration for Children and Families (ACF), Bquin highlights the pitfalls of excluding domain expertise during AI development and provides actionable strategies for teams building AI-powered products.

This article distills key insights from her talk into practical guidance for product managers, AI engineers, and cross-functional teams tasked with ensuring AI reliability, fairness, and ethical use in production.

The Consequences of AI Failures: Real-World Case Studies

AI technologies hold immense potential for automating complex tasks, but when designed without domain expertise, they can perpetuate harm. Bquin illustrates this with three impactful case studies:

1. Undercounting Vulnerable Children in Poverty

The U.S. Head Start program relies on block grant funding calculated based on the number of children under age five living in poverty. However, these children are notoriously undercounted for reasons such as homelessness, undocumented status, and outdated data collection methods. Hypothetically, if an LLM were tasked with estimating this figure, it would simply produce an answer without addressing these systemic biases or noting the outdated nature of the data. This could jeopardize funding for critical social programs and misrepresent the scale of the issue.

2. Michigan’s Unemployment Fraud Detection System

Between 2013 and 2015, Michigan deployed an automated system to identify unemployment fraud, initially celebrated for its success in flagging five times more cases than before. However, this system lacked human oversight and failed to distinguish between genuine fraud and clerical errors, with 93% of its determinations proven wrong. The fallout was devastating: wrongful wage garnishments, evictions, bankruptcies, and untold human suffering. This failure underscores the need for nuanced evaluation and domain expertise to prevent such errors.

3. Predictive Models in Child Welfare

In 2023, Allegheny County, Pennsylvania, used AI to predict whether children reported for neglect might end up in foster care. While intended to prioritize high-risk cases, the system aggregated data from sources like the criminal justice system and behavioral health records, disproportionately penalizing marginalized families. Features such as past psychiatric treatment or incarceration history - factors beyond a family’s immediate control - unduly influenced risk scores. Without domain experts, these biases went unchecked, perpetuating systemic inequities.

Why Domain Experts Are Essential in AI Evaluation

These failures highlight recurring themes: AI systems often prioritize numerical accuracy over contextual understanding, ignore systemic inequities, and fail to account for real-world complexities. Domain experts bridge this gap by:

  • Surfacing Contextual Nuances: Experts understand the environments in which AI operates and can anticipate unintended consequences.
  • Identifying Ethical Implications: They can spot bias, systemic inequalities, and distributional justice issues (i.e., who benefits from the system versus who bears the costs of its errors).
  • Avoiding Misleading Metrics: Aggregate performance metrics (e.g., "90% accuracy") can mask harm when the 10% failure disproportionately impacts vulnerable groups.
  • Challenging Proxy Variables: Experts can identify when a proxy (e.g., criminal history) encodes historical biases rather than meaningful insights.

Building Better AI with Domain Experts: Strategies and Approaches

The good news? Integrating domain experts into AI development doesn’t require starting from scratch. By adopting thoughtful processes, teams can leverage their expertise effectively. Here are actionable steps:

1. Involve Experts Early and Continuously

Most teams bring domain experts in only during the validation phase - far too late to mitigate design flaws. Shift this approach to involve them during:

  • Problem Definition: Collaborate with experts to understand real-world challenges and frame the problem appropriately.
  • Data Collection and Validation: Ensure data reflects the context and avoids perpetuating systemic biases.
  • Algorithm Design: Collaborate on key design decisions, such as threshold settings or feature selection.

2. Establish Two-Way Communication

Domain experts often lack technical knowledge, while AI developers may not grasp the nuances of the domain. To bridge this gap:

  • Use clear, shared language to define key terms and goals.
  • Develop documentation (e.g., model cards) collaboratively to foster transparency.
  • Encourage structured interviews or shadowing to understand workflows and challenges.

3. Use Participatory AI Methods

As outlined in a 2022 Nesta report on humanitarian AI, participatory methods involve affected communities in co-creating solutions. Examples include:

  • Red Teaming Exercises: Structured brainstorming sessions where domain experts identify failure modes specific to their field.
  • Ethnographic Observation: Shadowing professionals to understand their workflows and identify pain points.
  • Structured Interviews: Gathering insights from multiple experts to create a well-rounded understanding.

4. Document Procedures

Bquin emphasizes the need for formal documentation. Surprisingly, many machine learning studies fail to clearly describe how domain expertise was elicited. By documenting this process, teams can standardize practices, improve communication, and create reusable frameworks for future projects.

5. Rethink Metrics

Benchmarking alone cannot capture real-world performance. Teams need to develop evaluation methods that reflect the complexities of their use cases. For example:

  • Test AI systems under real-world conditions, not just academic benchmarks.
  • Incorporate ethics and safety considerations into evaluation metrics.

Realigning Power Dynamics in AI Development

Beyond technical challenges, Bquin highlights systemic issues in AI development culture:

  • Power Asymmetries: Domain experts are often excluded from high-level decision-making. Shift their role from "reviewers" to "co-designers."
  • Invisible Labor: Cross-functional collaboration, translating concepts, and building relationships are time-intensive and emotionally taxing but essential for success. Recognize and reward this labor.
  • De-skilling of Experts: Avoid dismissing non-technical expertise as irrelevant. Instead, treat domain knowledge as an integral part of the AI development process.

Key Takeaways

  • Domain experts are invaluable in identifying contextual nuances, ethical issues, and potential biases in AI systems.
  • Involve experts early in problem framing, data collection, and algorithm design to prevent harmful outcomes.
  • Participatory methods, such as red teaming and ethnographic observation, enhance AI evaluation and foster trust.
  • Document your processes for eliciting domain expertise to improve transparency and reproducibility.
  • Move beyond benchmarks by developing evaluation metrics that account for real-world complexities and ethical considerations.
  • Recognize cross-functional labor, such as translating domain insights into technical requirements, as essential to successful AI projects.
  • Address power asymmetries by empowering domain experts to influence key design decisions, not just validate results.

Conclusion

AI systems are only as effective as the perspectives and expertise they incorporate. By actively involving domain experts throughout the development lifecycle, teams can create systems that not only perform well technically but also align with ethical, cultural, and social considerations. As Dana Bquin aptly put it, "Domain experts foster dignity, trust, procedural fairness, and community relationships." These values are not optional - they are foundational to building AI that enhances society rather than harming it.

Source: "Daina Bouquin-Is Your LLM Evaluation Missing the Point---PyData Boston 2025" - PyData, YouTube, Dec 15, 2025 - https://www.youtube.com/watch?v=_vwml2sueIU

Related Blog Posts