Clinical & PSG Validation of the Sleep-EVAL Diagnosis Expert System

Sleep-EVAL, an artificial intelligent computer program, is an Expert System for evaluation and diagnosis of Sleep and Mental Disorders in general and clinical populations

Sleep and psychiatric epidemiological studies conducted in the general population have benefited greatly from the use of computerized tools.

These tools allow the execution of surveys using large samples of the general population. One of the benefits is to provide greater accuracy in the estimation of the prevalence of infrequent symptoms and diagnoses. These computerized tools exist in different types of computer programs in psychiatry, but only a few such tools exist in the field of sleep medicine. In psychiatry, these computer base tools range from the simple computerization of paper assessment tools (1-4) like:

  • the SCAN (1),
  • the Interactive Self-Assessment Scale (ISAS) (2) or
  • the ChronoRecord (3), to computer-assisted diagnoses like the Computer Assisted Diagnostic Interview (CADI) (5) or
  • the Computer-Based Structured Clinical Interview for DSM-IV (CB-SCID1) (6) and
  • computer medical decision support systems (CMDSS) like the Computerized Suicide Risk Scale based on back propagation neural networks (CSRS-BP) (7).

CMDSS, which include expert systems, should be differentiated from computerized assessment tools. Both share many characteristics but possess different architectures.

Some of the advantages of using computerized tools and CMDSS include:

  • minimal training time,
  • absence of omitted answers,
  • uniformity of administration,
  • suppression of data coding and transcription errors and
  • coverage of diagnoses that are less often considered because of their rarity.

Computerized tools possess limited reasoning capacities and operate only according to diagnostic algorithms or predetermined decisional trees implemented in the software.

Expert systems, on the other hand, have all the features of the computerized tool and in addition operate a reasoning process throughout the interview. This allows for the building of decisional trees during the interview. Furthermore, multiple classifications can be formalized in the knowledge base.

Classifications in psychiatry and sleep medicine are based upon the grouping of symptoms into criteria and criteria into syndromes and diagnoses. In these classifications, diagnostic pathways are based on the presence or absence of criteria: This works when criteria can be quantitatively answered by the patient, but many criteria in sleep medicine are qualitative. In such cases, classifications leave the clinician responsible for the final decision. This is appropriate in clinical practice, but in epidemiological studies this is unacceptable. Many interviewers are needed and because each has a different education level and each one makes a subjective judgment, this multiplies the risk of diagnostic error.

Therefore, a computerized diagnostic system, to be used in general populations and by different types of users, needs to be designed in such a way that the “judgment” on a symptom is made only by the computer based on a patient’s answers. It must also take into account ambivalence in the interviewees. One possible strategy to deal with this type of uncertainty is to use fuzzy modeling. The advantage of this approach is that this computational technique is based on linguistic variables, which are closer to human language. In this study, we present the Sleep-EVAL system, an expert system specializing in the assessment of sleep and psychiatric disorders in the general population that integrates fuzzy reasoning. We assess its validity in a clinical sample from a sleep disorders clinic.



Sleep-EVAL (8,9) is a non-monotonic, level-2 expert system endowed with a causal reasoning mode. It comprises a knowledge base, an inference engine, two neural networks, a mathematical preprocessor and a user interface.

The version used in epidemiological studies possesses a sampling manager and a performance manager.

The knowledge base contains a questionnaire on:

  • sleep habits,
  • sleep hygiene,
  • medical history,
  • the DSM-IV (10) and the International Classification of Sleep Disorders (ICSD-97) (11) classifications.

Each question is paired with a fuzzy set of answers. The fuzzy sets are used to qualify:

  • frequency,
  • quantity,
  • intensity and
  • various degrees of “yes/no” (from uncertainty to certainty: 5 levels).

The inference engine (knowledge processor) is the part of the expert system that finds solutions to problems, with the help of the knowledge base. It has the task of:

  • posing questions,
  • inferring hypotheses based on the answers and
  • then making diagnostic conclusions.

A typical interview begins with a standard questionnaire administered to all interviewees. From the responses, the system draws a series of diagnostic hypotheses that it attempts to confirm or reject through further questioning or deductions (non-monotonic, level-2 feature). The system pursues its diagnostic exploration until all diagnostic possibilities are exhausted.

The inference engine uses the neural networks to manage uncertainty in the subject’s answers as well as in criteria and diagnoses. One of the neural networks is devoted strictly to the management of fuzzy sets of answers. The other neural network is used to calculate the relative weight of nosological objects (symptoms, criteria, syndromes, diagnoses). The cumulative weights are used to determine the presence or absence of a criterion or a diagnosis. In the end, each explored nosological object (including diagnoses) will have a degree of certainty (or weight) ranging from 0.4 (completely present) to -0.4 (completely absent). The mathematical processor is also used by the inference engine when it encounters special instructions in the knowledge base. It can perform simple mathematical operations such as addition or subtraction as well as more complex operations such as converting age into months or hours into minutes. It can also calculate the body mass index or compare the duration of symptoms.

The procedure is a dynamic one that is in continuous flux during the interview. However some control is exercised on the structure of the interview: During its exploration of a diagnosis, the system is authorized to ask additional questions only when the addition of weights is not equal to 1 in order to adjust weights.

In the version used in epidemiological surveys, a sampling manager and a performance manager are included:

  • The sampling manager is responsible for dialing the phone numbers, managing creation of new phone numbers and the selection of the individual to interview in a household. It has an appointment organizer that will display a memo when it is time for the appointment. It also loops through numbers to call back before allowing the use of a telephone number that was never called.
  • The performance manager gives detailed accounts of the daily and weekly performance of the interviewers team such as: number of completed interviews, number of dialed phone numbers and number of refusals.


Seventy-two (72) patients from the Sleep and Alertness Clinic of the Toronto Western Hospital (Canada) participated in this study. New patients aged 18 years or over consulting at the sleep clinic were randomly selected over a five-month period.


Each patient was interviewed twice, once over the telephone by a physician using Sleep-EVAL and again by a sleep specialist.

The physician using Sleep-EVAL was a non-sleep specialist.

The physician interviewing the patient with Sleep-EVAL remained blind to all diagnoses by Sleep-EVAL and the sleep specialist.

Most of both interviews took place the same day (n=41); the longest elapsed time between the sleep specialist interview and the Sleep-EVAL interview was 15 days.

At the end of the usual clinical interview, the sleep specialist was asked to provide his or her diagnoses, up to a maximum of three. A list of symptoms underlying each diagnosis was also requested.

The sleep specialist then revised his or her diagnoses once the polysomnographic results were available. These revised diagnoses were those used for comparison with the Sleep-EVAL system.

Sleep specialists and polysomnographic assessments

Diagnoses were given by six different sleep specialists.

All the patients underwent a one-night polysomnography; 42 of them had a 2-night polysomnography.

Patients were recorded on the following physiological measures:

  • electromyogram,
  • electroencephalogram nighttime recording,
  • snoring sound,
  • electro-oculogram,
  • electrocardiogram,
  • oxymetry and airflow (oral and nasal thermistors).

The records were scored according to Rechtschaffen and Kales (12) and standard ASDA criteria (13).

Statistical analyses

Agreement between the two sources of diagnoses was calculated using Cohen’s kappa (K), (14,15) with the diagnoses of the sleep specialists serving as the gold standard.

  • K < .40 were considered as a poor agreement;
  • K between .40 and .75 were considered a fair to good agreement and
  • K≥ .75 an excellent agreement.

Sensitivity (the proportion of patients with the diagnosis who have a positive result for this diagnosis with Sleep-EVAL), and specificity (the proportion of patients without the diagnosis who have a negative result for this diagnosis with Sleep-EVAL) were also calculated.


The sample was composed of 37 women, aged 45.6 (±12.9) years on average, and 35 men, aged 49.3 (±10.7) years. The patients were mostly married (74.6%) and white (86.1%). Most of the patients had at least a college education (66.6%). The most frequent motives for consultation were:

  • sleep problems (47.6%),
  • apnea (27.4%),
  • insomnia (16.7%),
  • fatigue (11.9%) and
  • snoring (11.9%).


Table 1 shows the most frequent symptoms observed by the sleep specialists during the clinical evaluation and their frequency during the telephone interview with the assistant physician.

Table 1. Most frequent symptoms reported by the sleep specialists
Sleep specialists Sleep-EVAL
Symptoms % (n) % (n)
Snoring 62.0 (44) 67.6 (48)
Sleep apnea (breathing pauses) 36.6 (22) 42.3 (30)
Difficulty maintaining sleep 49.3 (35) 84.5 (61)
Difficulty initiating sleep 25.4 (18) 36.6 (26)
Morning headaches 26.8 (14) 39.4 (28)
Excessive daytime sleepiness 31.0 (22) 40.8 (29)
Daytime fatigue 63.4 (45) 84.5 (60)
Choking during sleep 21.1 (15) 19.7 (14)
Dry mouth upon awakening 32.4 (23) 60.6 (44)

Table 2 presents the measures of agreement for the symptoms.

As can be seen, the reliability was good for:

  • breathing pauses during sleep,
  • difficulty initiating sleep and
  • excessive daytime sleepiness (Kappa > .60);

The reliability was acceptable for:

  • snoring and
  • choking during sleep

The reliability was poor for difficulty maintaining sleep and daytime fatigue.

On the other hand, the sensitivity was high for nearly all the symptoms (> 80%), which means that positive cases were correctly identified by Sleep-EVAL. Specificity was lower, especially for daytime fatigue, difficulties maintaining sleep and dry mouth upon awakening.

These results were expected: Sleep-EVAL systematically collects information for key symptoms associated with all sleep disorders. Consequently, it increased the likelihood that a patient scored positively on symptoms that are not essential clinical features and are less likely to be reported by the sleep specialists.

Table 2. Measures of agreement between Sleep-EVAL and sleep specialists for the most frequent symptoms
Sensitivity Specificity Kappa
Snoring 86.4% 63.0% .51
Breathing pauses during sleep 84.6% 82.2% .65
Excessive daytime sleepiness 90.6% 81.6% .67
Daytime fatigue 91.1% 26.9% .21
Difficulty initiating sleep 88.9% 81.1% .61
Difficulty maintaining sleep 100.0% 30.6% .30
Morning headaches 78.9% 75.0% .45
Dry mouth upon awakening 95.7% 56.3% .42
Choking during sleep 60.0% 91.1% .52


Sleep specialists gave an average of 1.65 (±0.61) diagnoses per patient while the Sleep-EVAL gave an average of 2.9 (±2.1) diagnoses per patient (p<.01). Table 3 presents the agreement between the sleep specialists and the Sleep-EVAL system for the most frequent diagnoses. As it can be seen, kappa was excellent for:

  • Obstructive Sleep Apnea Syndrome, a sleep disorder characterized by breathing pauses during sleep (sleep apnea),
  • loud snoring and excessive daytime sleepiness.

The agreement was also good for:

  • insomnia and
  • periodic limb movement disorder.

Sensitivity and specificity were high for all the diagnoses.

These results were also expected: Diagnoses are complex entities that require the presence of several symptoms. Therefore, even though patients frequently reported a symptom, the likelihood that they reported two or more symptoms from the same diagnostic entity was reduced.

Table 3. Agreement on diagnosis between Sleep-EVAL and sleep specialists
Sleep-EVAL Sleep specialists Sensitivity Specificity Kappa
%(n) %(n)
Obstructive Sleep Apnea 50.0 (36) 51.4(37) 94.6% 97.1% .92
Insomnia 26.4 (19) 20.1 (15) 93.3% 90.4% .77
Periodic limb movement 30.5 (22) 30.4 (22) 81.8% 91.1% .73
Narcolepsy 4.2 (3) 4.2 (3) 100.0% 100.0% 1.0


The validity of Sleep-EVAL was tested in a clinical setting against the routine clinical assessment of sleep specialists and polysomnographic data. Overall, diagnostic agreement was good between Sleep-EVAL and the sleep specialists.

Sleep-EVAL formulated more diagnoses than did the sleep specialists. This was expected since the system explored all potential diagnoses, covering all the inclusion and exclusion criteria. This is one of the main characteristics of computerized diagnostic tools. Such tools inexorably investigate all possible diagnoses even if a diagnosis has already been reached. This affords a definite advantage in epidemiological studies aimed at determining the prevalence of all possible sleep disorders. As our results illustrate, agreement on individual symptoms were acceptable at best. As already stated, this phenomenon is due to the systematic investigation of all key symptoms related to different sleep and psychiatric pathologies. For example, difficulty maintaining sleep, which is considered to be the most prevalent insomnia symptom in the general population is not specific to insomnia. This symptom is also frequently seen in several other sleep disorders such as periodic limb movement disorder or obstructive sleep apnea syndrome; in psychiatric disorders such as depressive disorder, anxiety disorders, eating disorders and drug or medication withdrawal. Consequently, an individual complaining of difficulty maintaining sleep can have a variety of disorders that need to be explored before considering an insomnia diagnosis. This also indicates that epidemiologists should be more cautious when presenting results on insomnia based solely on the presence of difficulty initiating or maintaining sleep.

The fact that Sleep-EVAL diagnoses were compared against those made by sleep specialists during a routine clinical assessment rather than a structured interview may have placed the expert system at an unfair disadvantage. Several studies have shown that the validity of a diagnostic tool increases when it is tested against structured clinical interviews. However, since polysomnographic assessments were done in order to confirm the sleep specialists’ diagnoses, the use of a structured clinical interview would have made a difference for only a small number of patients. Furthermore, structured clinical interviews are non-existent for the assessment of sleep disorders.

This study supports the validity of the Sleep-EVAL system as a useful research tool. The study also emphasizes the “blind spots” that sleep clinicians may have. For example, it would appear that a systematic evaluation of fatigue is not considered in the rubric of the sleep specialists remitted but as recent clinical research has shown (16) is a prominent complaint in sleep clinic attendees. Also, the complaint of difficulty maintaining sleep is under- appreciated by sleep clinicians.


  1. Wing JK, Babor T, Brugha T, Burke J, Cooper JE, Giel R, Jablenski A, Regier D, Sartorius N. SCAN. Schedules for Clinical Assessment in Neuropsychiatry. Arch Gen Psychiatry. 1990;47:589-593.
  2. Weber B, Fritze J, Schneider B, Simminger D, Maurer K. Computerized self-assessment in psychiatric in-patients: acceptability, feasibility and influence of computer attitude. Acta Psychiatr Scand. 1998;98:140-145.
  3. Whybrow PC, Grof P, Gyulai L, Rasgon N, Glenn T, Bauer M. The electronic assessment of the longitudinal course of bipolar disorder: the ChronoRecord software. Pharmacopsychiatry. 2003;36 Suppl 3:S244-9.
  4. Harel TZ, Smith DW, Rowles JM. A comparison of psychiatrists' clinical-impression-based and social workers' computer-generated GAF scores. Psychiatr Serv. 2002;53:340-342.
  5. Miller PR, Dasher R, Collins R, Griffiths P, Brown F. Inpatient diagnostic assessments: 1. Accuracy of structured vs. unstructured interviews. Psychiatry Res. 2001;105:255-264.
  6. Bergman LG, Fors UG. Computer-aided DSM-IV-diagnostics - acceptance, use and perceived usefulness in relation to users' learning styles. BMC Med Inform Decis Mak. 2005 7;5(1):1.
  7. Modai I, Ritsner M, Kurs R, Mendel S, Ponizovsky A. Validation of the Computerized Suicide Risk Scale--a backpropagation neural network instrument (CSRS-BP). Eur Psychiatry. 2002;17:75-81.
  8. Ohayon M. Knowledge Based System Sleep-EVAL: Decisional Trees and Questionnaires. Ottawa: National Library of Canada, 1995.
  9. Ohayon M. Improving decision-making processes with the fuzzy logic approach in the epidemiology of sleep disorders. J Psychosom Res 1999; 47:297-311.
  10. APA (American Psychiatric Association). Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV). Washington: The American Psychiatric Association, 1994.
  11. American Academy of Sleep Medicine. The International Classification of Sleep Disorders, Revised: Diagnostic and Coding Manual. Chicago, Illinois: American Academy of Sleep Medicine, 1997.
  12. Rechtschaffen A, Kales A. eds. A manual of standardized terminology, techniques and scoring systems for sleep stages of human subjects, Los Angeles: Brain Information Service/Brain Research Institute,UCLA,1968.
  13. ASDA (American Sleep Disorders Association). EEG arousals: scoring rules and examples. Sleep 1992; 15: 173-184.
  14. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960; 20:37‑46.
  15. Byrt T. How good is that agreement? Epidemiology, 1996; 7:161
  16. Hossain JL, Ahmad P, Reinish LW, Kayumov L, Hossain NK, Shapiro CM. Subjective fatigue and subjective sleepiness: two independent consequences of sleep disorders? J Sleep Res. 2005;14:245-253.