Validation of Expert Systems

The problem of a medical expert system validation is generally complex. It requires a rigorous methodology of validation and must have shown proof of its practical competency in order to be used currently.

Validation concerns the quality of conclusions provided by the system, the quality of the deductive process leading to these conclusions as well as the validity of its utilization.

The validation of a knowledge based system (KBS) in psychiatry, as in any other area of scientific investigation, consists in putting into practice all types of action that aim:

  • to extend scientific interest,
  • to increase contributions to theoretical and/ or applied research,
  • to expend clinical interest, and
  • to otherwise further the validation the system, as well as to expose the possible defects, errors, non-fulfillment, imprecision or maladjustments in it.

Therefore, the validation of such a tool succeeds in its evaluation according to a series of major and minor criteria whose definitions condition the quality and exhaustiveness of the evaluation. Especially, it is necessary to underline that the validation of a KBS is a totally distinct phase from that of its realization.

There are in the course of the realization phase periods devoted to the meticulous verification of the knowledge base. However, that does not suppress the necessity of a clear distinction between these two phases. Conversely, the validation phase can exhibit possible improvements that can be made in the knowledge base without confusing this phase with the realization.

We can generally find many problems in the validation phase that, for various reasons, will not have been solved in the conception and development phases. Among others, these problems concern the precise definition of the system objectives and contingencies of application such as: medical objectives, category of users, utilization conditions, etc.

The problem of a medical expert system validation is generally complex [1-4]. There is no lack of difficulties put forward by the validation. Difficulties arise under the methodological and objective areas as well as the subjective area.

To guarantee a scientific rigor and the necessary impartiality involve in the distribution of a tool concerning health, the validation phase imposes the formation of a team of experts that is different from the realization team. Indeed, the realization team could not both judge and be judged in the validation.

In a first step, two applied examples of validation are presented. Subsequently, some requirements to the valida­tion of a KBS are discussed.


During the last ten years we developed a KBS, Adinfer, and used it in several studies [5-6]. As the classifications in the psychiatric field changed, we had to adjust the knowledge base and to revalidate the system on many occasions.


Sleep-EVAL was developed with clear objectives in mind:


Adinfer is a KBS devoted to psychiatry. The newest version is a level 2 nonmonotonic expert system endowed with a causal reasoning mode. Thus, this system is able to track a set of tentative beliefs and revise them when new information is derived or provided by the user. The whole program is written in Pascal-Object language for Macintosh. The first knowledge base contained the DSM-III (Diagnostic and Statistical Manual of Mental Disorders) [7] classification, the second, the DSM-III-R classification [8]. The newest knowledge base includes the DSM-IV classification [9]. In all cases, the knowledge base covered the entire field of psychiatry.

These classifications were chosen because the selected criteria are the result of quantitative studies. Retained criteria possess a satisfactory inter-rater reliability. With regard to criteria that define the different diagnoses, since nothing can ensure that each judge evaluated them in the same manner, the system has the latitude to define them freely if it respects the causality of the reasoning. The vocabulary employed to define objects borrows its terms from psychiatry.

With the DSM-III version, the intuitive diagnosis of a clinician was compared to that provided by the expert system [10-13]. Users were general practitioners (n = 20) and psychiatrists (n=27). The patients were met by the clinicians in their office. In all, data were collected for 1141 patients.

Each clinician had to specify ten important symptoms before providing his diagnosis. The number of patients per clinician had to be sufficient to permit statistical analysis.

The general agreement between expert system's conclusions and clinicians' conclusions was 83%. For 16% of the cases, it had complete disagreements, essentially with the “not otherwise specified psychosis” category and the “organic mental disorders” category.

With the DSM-III-R version, we used a prospective model to perform the study [14]. Twenty general practitioners collected data on 114 patients with somatic complaints seen in their practice. Each file contained socio-demographic information (age, sex, marital status and profession), the motive of the consultation, psychiatric and medical history of the patient, and a list of presenting symptoms. These files were then simultaneously submitted to four expert psychiatrists chosen for their skills in working with the American psychiatric nosography (DSM-III-R). Prior to the study, a preliminary analysis had been carried out on inter-rater reliability of these four experts to assess the probability of agreement or disagreement. The experts gave one or more DSM-III-R diagnosis(es) for each patient and a list the symptoms underlying their diagnostic decision. Each expert was to do this independently.

The information was exactly the same for the clinician using the expert system. Adinfer had to produce a written report about its diagnosis(es) as well as a list of symptoms that it associated with each diagnosis.

To verify the exactness of the system's conclusion, the following strategies were adopted:

  • When a diagnosis produced by the system was judged to be adequate when at least three of the four experts reached the same diagnostic conclusions.
  • When at least three experts agreed on the same diagnosis but their diagnosis was different from that of the system, the case was re-analyzed and the underlying diagnostic rules were reviewed and corrected as needed;
  • When the diagnostic conclusions were different within the group of experts, a qualitative analysis of the case at hand was carried out.

In 26 cases, the judges were unable to agree on the diagnosis. The kappa coefficient was 0. 97 between Adinfer's conclusions and experts' conclusions when there was an agreement among judges.


Eval, built around the Adinfer's expert system, is a KBS devoted to the assessment of sleep disorders. It was designed specially to be used in epidemiological studies in the general population.

The knowledge base contained a series of questions about socio-demographic information, sleeping habits, medications, illnesses, and DSM-III-R [8], DSM-IV [9] and ICD-10 [15] classifications of sleep disorders.

Special care was given to word definitions, samples of symptoms, and instructions on how to answer the questions. This procedure is essential because it reduces ambiguity in regard to the interpretation of symptoms, which is more likely to occur when the users have no education relating to the psychiatric field.

The users were 26 lay interviewers without knowledge about the field of psychiatry and sleep disorders.

The 7350 interviews were conducted by telephone during a two-month period in two countries (France and the metropolitan area of Montreal, Quebec, Canada).

The average training time was 3 hours [16-17].

The results show very good agreement with that found in the epidemiological literature. For example, in the French study it was found that 10. 9% of the population is currently suffering from an insomnia disorder according to the DSM-IV classification which is comparable to another epidemiological study [18].


A difficulty we met during the validation of a KBS concerns the problem of the knowledge itself.

How indeed do we conduct the validation without evaluating at the same time the foundation and theoretical biases of the expert who is the author? This question does not belong solely to artificial intelligence but the answer still conditions the credibility of AI scientific methodology. Therefore, it is advisable to distinguish most clearly between quality and capacity evaluation of the system and evaluation principles and the theoretical biases.

A rigorous validation demands a methodology just as rigorous. Such a methodology has to be based on the definition criteria and the installation of precise evaluation protocols. Rigor and impartiality are imperative considering the number of factors that affect the functioning of the global system, the number of evaluation criteria, and the variability of the criteria according to the type of application, the medical context, the type of users, and especially the various biases that can be introduced in methodological choice.

The general script of a validation process, is the following:

  • Definition of validation objectives and criteria: there are many evaluation criteria for a knowledge base what stem from the evaluator's point of view. However, we can locate some major characterization axes: medical axis, includes the totality of medical characteristics; the cognitive axis, constitutes the artificial intelligence and cognitive sciences, and the socio-economic axis implies the various costs and profits.
  • Definition of validation methods: each evaluation objective has to be reached in a predetermined way that constitutes the evaluation protocol of the objective. In respect to clarity, we need to define as many protocols as there are evaluation objectives rather than only one evaluation protocol covering all objectives.
  • Drawing-up evaluation planning: the evaluation of a knowledge based system takes place in a longitudinal way and has to be performed with respect of a schedule. Evaluation planning has to be differentiated from utilization assessment. This last is linked to an average or long term utilization, and supposes, therefore, that the tool has been declared usable following a conclusive evaluation phase.

If we consider the use of the computerized system as a classical medical examination, the main principles of medical evaluation can be applied to the computer system. In order to make the evaluation be methodologically suitable:

  1. The technical evaluation, has to be entirely satisfactory in terms of reproducibility, minimal inter and intra-observer variability; the number of doubtful answers has to be the lowest possible.
  2. The illness “I” has to be perfectly defined, as well as what we call the non “I” state.
  3. The physician-evaluator has to ignore the diagnosis given to subjects he examines.

The medical criteria are the first to consider in the evaluation of a knowledge based system. The main medical criteria are the confidence given the knowledge modeling contained in the knowledge base, and the validity of the system's answers. These categories are not totally separated. They will have some common points and their distinction is mainly for clarity.


In regard to medical decision, total certainty in the accuracy of a diagnosis or a therapeutic choice is rare.

The physician acts at best, according to his experience and his knowledge, but can not guarantee the infallibility of his judgments. Nevertheless, even physician with little experience knows how to evaluate approximately chances of success and to note alternatives that are left aside. The physician knows equally how to evaluate risks and to judge the consequences of erroneous choices.

Can we ask of a knowledge based system a self-evaluation capacity of its choice consequences? Such a question enables us to discriminate between mediocre systems and systems presenting real medical interest.

The following are observations that may to strengthen the user's confidence in the system:

  • Diagnostic warnings: all system answers (diagnosis, therapeutic advice, etc. ) obtained from a reasoning, has to be accompanied of an evaluation of the confidence degree that we can grant to this answer. This evaluation remains important in spite of theoretical difficulties underlying its elaboration.
  • Prognostic warnings: all advice or indications given by the system has to be accompanied by an evaluation and a description of risks and negative consequences linked to advice or indications application.

A system based on an expert knowledge whose application involves the health or the life of an individual can not make an abstraction of this type of information. It is indeed difficult to anticipate the system's effect on each of its users. Moreover, information about negative effects of the advice application also have a function of limiting uncontrolled suggestion phenomena which can be influence the user.

  • Open answers: a diagnosis is rarely unique or exclusive.

Consequently, the system has to give all possible diagnoses with an evaluation of their respective chance of occurrence, or at least to recall diagnoses that are left aside because of their low chance of occurrence.

An important characteristic of the expert's behavior is the evocation of hypotheses from signs, followed by a series of confirmation attempts to confirm some of these hypotheses to the detriment of others, until, in the best of cases, a unique answer is obtained.

An “intelligent” system has to reproduce this generation of hypotheses and to keep them at the user's disposal.


Validity characterizes, on the one hand, the quality of conclusions provided by the system (validation of the decisional trees), and on the other hand, the quality of the deductive process leading to these conclusions (validation of the inference engine).

It can be methodically evaluated by submitting the system to a testing period or to a series of subjects' files whose expertise has been verified beforehand by several physicians.

It is also important to proceed to a validation of its utilization. This process may guarantee viability of the expert system. It consists in employing lay users, non experts in the field, to perform evaluation of the expert system. Thus, we can compare expert system's conclusions to experts' conclusions.

The questions we will try to answer are:

  • Is a lay user able to achieve diagnoses with the help of the expert system?
  • Are these conclusions discerning enough?


The statistical evaluation generally compares, by means of tools provided by statistics, expert system's conclusions to experts' conclusions given the same conditions. To do so, we compare groups of subjects corresponding to various cases.

Study can be retrospective, i. e. , undertaken from archived cases of patients or conducted directly from data collections in the care services or in the physician's office, for example. In the latter case, a file regrouping all data for each case (observations, analysis results, system's conclusions, expert's conclusions, etc. ) has to be kept up to date in the course of the data acquisition phase, the validation study is done using this file (statistical study or data analysis). The question that we try to answer is: Does the KBS provide comparable, as good or worse conclusions than those of the expert, considering the totality of submitted cases? The system will have to function in a non-interactive mode and to use only information in the file following meticulous verification of the information before using it. We must avoid submitting incomplete files or comprising errors to the KBS.


There are several ways to assess the degree of agreement between the expert system's conclusions and experts' conclusions.

The choice of the appropriate measure depends on the nature of the variables. When the values which bear the comparison are qualitative (that is the case for diagnoses or therapeutic advice), we use the kappa coefficient [19] to compare the two groups of values (judgment of expert, judgment of the system).

This type of kappa is suitable when the data are dichotomous (e. g. presence or absence of a diagnosis) or polychotomous (e. g. many diagnoses) and when it has only two judges (expert system and one expert).

An intraclass correlation coefficient [20] could also be used with dichotomous data and two judges but only in some instances. This measure is more suitable when it has dichotomous or quantitative data and several judges.

Weighted kappa can be used when it has two judges and dichotomous or polychotomous data. This measure is calculated on the basis of weights attributed to disagreements [21].

Finally, the generalized kappa [22] can be used with dichotomous or polychotomous data and three or more judges.

However, in many situations the conclusions of only one expert (judge) are not sufficient. The human expert proceeds in an intuitive fashion, directly “jumping” to the diagnostic category he judges to best apply to the diagnosis, even if it means to later check whether all the necessary criteria are present. Thus, it is also important to compare experts' conclusions.


The majority of KBS developers are expert system specialists whose major objective is the development of new conceptual tools and softwares. For their needs, the validation, in terms of current utilization tools, is not absolutely indispensable from the strict point of view of artificial intelligence. It falls on to the user to assert through utilization the validity of this new tool.

The credibility of artificial intelligence among scientists, is construct of far more on criteria of theoretical and conceptual innovations than objective validation criteria. As a result, the validation phase, given its methodological demands and the human and financial resources it requires, evolves out of the realization field. That is why many artificial intelligence systems never leave the laboratory and have only been publications in scientific reviews or communications in congresses.

With the advance of technology, AI now has the possibility of developing very powerful tools. However, the expectations of users are also increasing.

They expect from an expert system good reliability, suitable validation, and easy use. They expect that the knowledge based system will take into account the constraints of the reference classifications. At the same time, it is expected that users can follow the progression of the system's reasoning to judge and sanction it when necessary. Consequently, special care must be taken to validate the expert system if we want the system to be used currently and considerable reflections and efforts have to be made to develop standardized methods of validation of KBS.


  1. J. Gasching, P. Klahr, H. Pople, E. H. Shotliffe and A. Terry, Evaluation of expert systems: Issues and Cases studies. in Building Expert Systems, Addisson-Wesley Publishers, Massachusetts, 1983.
  2. C. Whitbech, Criteria for evaluating a computer aid to clinical reasoning. J. Med. Philo. , 8 (1983) 51.
  3. M. Fieschi and M. Joubert, Some reflections on the evaluation of expert systems in medicine. Methods Inform. Med. , 25 (1985) 15.
  4. P. L. Miller, The evaluation of artificial intelligence systems in medicine. Comp. methods Programs Biomed. , 22 (1986) 5.
  5. M. Ohayon, Intelligence artificielle et psychiatrie. Masson, Paris, 1990.
  6. M. Ohayon and M. Caulet, Adinfer: experience of an expert system in psychiatry. in K. C. Lun et al. (eds) MEDINFO 92 (pp. 615-619). Elsevier Science Publishers, Amsterdam, 1992.
  7. American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (DSM-III), The American Psychiatric Association, Washington, 1980.
  8. American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (DSM-III-R), The American Psychiatric Association, Washington, 1987.
  9. American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (DSM-IV), The American Psychiatric Association, Washington, 1994.
  10. M. Ohayon and J. Fondaraï, DSM-III et clinique psychiatrique: Perspectives méthodologiques en psychiatrie. In: Comptes-rendus du LXXXIV congrès de psychiatrie et de neurologie de langue française, le Mans, 23-27 juin 1986: Masson, Paris, (1986) , 135.
  11. M. Ohayon and J. Fondaraï, Convergences et divergences entre DSM-III et pratique psychiatrique française. Ann. Médico-Psychol. , 144 (1986), 515-530.
  12. M. Ohayon and Y. Poinso, Compatibilité du DSM-III avec la nosographie psychiatrique française. Psychol. Med. , 19 (1987), 367.
  13. M. Ohayon and Y. Poinso, Prédictivité et évaluation clinique de la réponse antidépressive à la Maprotiline. Utilisation d'un système d'aide au diagnostic. Actualités Psychiatriques, 18 (1988), 120.
  14. M. Ohayon, Validation of a Knowledge Based System (Adinfer) versus human experts. Proceedings of the Twelfth International Congress on Medical Informatics, MIE-94, 1994.
  15. WHO (World Health Organisation). The ICD-10 Classification of Mental and Behavioural Disorders: Clinical Descriptions and Diagnostic Guidelines. World Health Organisation, 1992.
  16. M. Ohayon, Epidemiological study on insomnia in a General Population, Sleep, in press.
  17. M. Ohayon and M. Caulet, Insomnia and psychotropic drug consumption. Prog. Neuro-Psych. Bio. Psych. 19 (1995, in press).
  18. D. E. Ford and D. B. Kamerow, Epidemiologic study of sleep disturbances and psychiatric disorders. An opportunity for prevention? J. Am. Med. Assoc. , 262(1989), 1479.
  19. J. Cohen, A coefficient of agreement for nominal scales. Educ. Psychol. Meas. , 20 (1960), 37.
  20. J. L. Fleiss, Statistical methods for rates and proportions, John Wiley & Sons, New-York, 1973.
  21. J. Cohen, Weighted kappa: Nominal scale agreement with provision for scales disagreement or partial credit. Psychol. Bull. , 70 (1968), 213.
  22. J. L. Fleiss, Measuring nominal scale agreement among many raters. Psychol. Bull. , 76 (1971), 378.