In the validation of medical expert systems, agreement among different human specialists on a random sample of cases may be taken as a substitute to a missing gold standard. Distance measures between pairs of experts, extensively described in previous studies, do not take into account the influence of chance-expected agreement. A weighted kappa index, with three different weighting schemes, is proposed as an alternative to be applied in the general situation of N cases assessed by E experts about K possible diagnoses, each of them qualified with one of G ordinal categories. A hierarchical cluster analysis, applied to the kappa matrices generated, allows for the classification of the expert system among clinical specialists, providing a relative assessment of its diagnostic ability. The above methodology is applied to the validation of two medical expert systems, PNEUMON-IA and RENOIR.