ChatIBD v2: Six Months In

When we launched ChatIBD, we weren't sure how clinicians would respond. Six months and over 1,000 users later, we have seen continued growth and uptake, suggesting that ChatIBD is meeting a real world need. Real clinical questions helped us understand what the tool was being used for. What they didn't do was find its flaws.

Here's what changed, and why.

What Our Testing Revealed

The majority of clinical queries that we have received are direct enough that the system handles them reasonably well. The subtle failures stay hidden. That's what pushed us to build our own benchmarks, with questions specifically designed to stress-test the edge cases in IBD guideline interpretation.

A good example from that process: we asked "How common is IBD in patients with HIV?" ChatIBD pulled a passage from the 2025 BSG guideline and returned a confident answer. The problem is that the passage answers the inverse question: how common is HIV among IBD patients. This is not the same thing.

There was no hallucination. The right passage was retrieved. However, the system reasoned in the wrong direction without noticing.

General AI benchmarks don't tend to catch this kind of error. Spotting that the answer inverted the question requires knowing the clinical literature well enough to notice the difference.

Building Our Own Tests

We now run 68 benchmark questions in full, with a 15-question core discriminative subset for faster iteration, and plan to continue expanding this. At present, two frontier AI models evaluate each response independently, and I give the final review as a practising IBD gastroenterologist. This is not formal validation, but it is already helping us surface real issues that we would otherwise miss.

That process helps us identify mitigation strategies. Sometimes model upgrades and system prompt improvements are enough. Sometimes, as we have found, the product itself needs to be rebuilt from the ground up.

v2: Sentence-Level Citations

In v2, citations work at sentence level. Instead of pointing to a guideline document, ChatIBD now surfaces the specific sentence the answer came from.

For the HIV/IBD case, this would have made the problem visible immediately. The citation would have shown a sentence about HIV prevalence in IBD patients, and the mismatch with the original question would have been obvious. We continue to stretch the system to find the flaws and our goal is to continue improving the safety and reliability of ChatIBD.

What This Tool Is

ChatIBD improves accessibility to IBD guidelines. It doesn't have your patient's history, their comorbidities, or the full clinical picture. It's for looking up guideline evidence quickly. The clinical judgment stays with you.

Being able to see exactly where an answer comes from is what makes that workable. If the source sentence is visible, it will make it easier and quicker for you to decide whether it actually applies.

What's Next

We're expanding the benchmarks as we find new edge cases, and we are progressing formal validation given its real world uptake. We hope this will continue to build the solid foundation of safety that real world clinical AI applications need. As always, if you notice any errors, please report them directly in the system. Alternatively, if you have feedback or suggestions, let us know at contact@chatibd.com.

We hope you continue to find ChatIBD useful.