Sleep-EVAL against clinical assessments of sleep disorders and polysomnographic data

A number of computerized tools have been developed over the years with the aim of simulating human reasoning. Expert systems in particular are well on their way to fulfilling the great expectations of Artificial Intelligence engineers. Today, this technology is no longer mere fodder for the popular imagination. It has stepped out of the realm of fiction and found myriad applications in various areas of science (1-6). The first such application in the domain of medicine came with the development of EMYCIN (7), an expert system specialized in diagnostic recommendations for general practice. More than one hundred expert systems have since followed in its wake (3-6,8), some in the field of sleep medicine (9). Until recently, no expert system had ever been designed specifically as a diagnostic instrument for use in clinical or epidemiological studies of sleep disorders. Sleep-EVAL (10) is the first such tool.

Aside from exposing any faults, errors, or inaccuracies in investigative instruments, the practice of validation provides an opportunity to extend scientific interest in a particular area by re-examining theoretical and applied research and encouraging clinical advancements.

Diagnostic tools in medicine, however, present a unique challenge. The validity of these tools is measured by their ability to achieve diagnoses comparable to those of a known ‘gold standard’, such as an already validated questionnaire or experts in the field. Where sleep medicine is concerned, in the absence of a definitive benchmark, sleep specialists are elected the gold standard by default, despite the fact that diagnostic agreement between specialists remains low, especially when DSM-IV criteria are applied (11).

Certain sleep disorder diagnoses, however, are formulated on the basis of objective polysomnographic measures. This is the case for obstructive sleep apnea syndrome (OSAS). The respiratory disturbance index (RDI) or apnea/hypopnea index (AHI) is used to determine whether breathing patterns are abnormal. Usually, an RDI greater than or equal 5 is considered to represent an inordinate number of sleep respiratory disturbances. This criterion, however, was recently called into question by Guilleminault et al., who reported that it was clinically impossible to distinguish OSAS from upper airway resistance syndrome (UARS) in children (12) and that there was no difference between complaints of OSAS and UARS in adults (13). Yet, most sleep centers still use an RDI ≥ 5 as a diagnostic indicator.

As flow limitation and abnormal increase in respiratory effort are still rarely investigated today, we focused in the present study on the more widely accepted RDI criterion of ≥ 5 events/hour and examined just how reliable OSAS diagnoses are when based solely on the self-report of patients. Sleep-EVAL is an expert system that attempts to reproduce the reasoning of a physician specialized in sleep medicine.

The following presents the results of a validation study carried out through routine assessments of patients at two sleep disorder centers.

Sixty-one patients from the Sleep Disorders Center of Stanford University (United States) and 44 from the Regensburg University (Germany) sleep clinic consented to participate in this validation study. Both sites were requested to randomly recruit patients 18 years of age or over from within their clientele.

Subjects were excluded if polysomnographic recordings were requested for complaints other than a sleep disorder. At the Regensburg clinic, 26 healthy volunteer subjects were also included to determine rates of false positives. Each patient was interviewed twice, once by a physician using Sleep-EVAL and again by a sleep specialist.

In both sites, the physician using Sleep-EVAL was a non-sleep specialist. The physician interviewing the patient with Sleep-EVAL remained blind to all diagnoses by Sleep-EVAL and the sleep specialist.

At the end of the usual clinical interview, the sleep specialist was requested to provide his or her diagnoses, up to a maximum of three, using the ICSD-90 (16).

A list of symptoms underlying each diagnosis was also requested.

The sleep specialist then revised his or her diagnoses once the polysomnographic results were available. These revised diagnoses were the only ones used for comparison with the Sleep-EVAL system. Diagnoses were given by four different sleep specialists at the Sleep Disorders Center of Stanford University and by five different sleep specialists at the Regensburg University sleep clinic.

All the patients were polysomnographically assessed. For the purposes of this study, a series of questions allowed the physician subsequently to enter the following polysomnographic results into the system:

  • EEG nighttime or daytime recording;
  • electro-oculography (EOG);
  • electromyography;
  • electrocardiogram;
  • oxymetry;
  • CO2 monitoring;
  • airflow (nasal, oral);
  • esophageal manometry (PES);
  • body temperature monitoring;
  • MSLT;
  • HLA typing;
  • nocturnal penile tumescence (NPT);
  • esophageal ph;
  • snoring sound;
  • video monitoring.

This data were then used by Sleep-EVAL to produce a revised diagnosis.

Consequently, two kinds of diagnoses were available:

  • one based solely on self-report and
  • another based on both self-report and polysomnographic findings.

The diagnoses used in this study were those based solely on self-report.

Depending on the suspected sleep disorder, other measures were also used:

  • HLA typing,
  • esophageal manometry and pH,
  • video monitoring and
  • nocturnal penile tumescence.

The records were scored according to Rechtschaffen and Kales’ and ASDA’s standard criteria. Agreement between the two sources of diagnoses was calculated via Cohen’s Kappa (18, 19), with the diagnoses of the sleep specialists serving as the gold standard.

McNemar’s change test (20) was used to assess the level of significance in the rates of diagnoses between the sleep specialists and Sleep-EVAL, that is, whether Sleep-EVAL produced biased results.

Agreement percentage was also calculated.



The mean age of the 105 patients was 44.11 years (SD=15.64). Patients from the Stanford center were older (48.75 years ± 13.95) than those from the Regensburg clinic (37.68 years ± 13.72) (F=14.463, df=1, 103, p<0.005). Nearly two thirds of the patients were men (63.8%) and over half were married (54.4%). Also, 63.8% were currently employed and 45.6% had at least 13 years of schooling. Rates did not differ between the two sites in terms of gender but were dissimilar in terms of marital status, employment and education (Table 1). Patients were most likely referred to the centers by a general practitioner (32.4%).

Table 1: Sociodemographic characteristics of the sample by sleep disorder center
Sleep Disorder Center
Stanford Regensburg
n=61 n=44 p value of chi-square
Woman 31.3% 43.2% n.s.
Man 68.9% 56.8%
< 9 yrs 0 31.8% < 0.001
9-13 yrs 39.0% 43.2%
< 13 yrs 61.0% 25.0%
Marital status
Single 20.3% 52.3% < 0.01
Married/ Living with someone 66.1% 38.6%
Separated/divorced 11.7% 6.8%
Widowed 1.7% 6.8%
Working status
Not working 26.2% 50.0% < 0.05
Working 73.8% 50.0%

All but nine of the patients underwent their two interviews and polysomnographic recordings within a six-month period. The nine exceptions visited the centers for a routine follow-up. These cases were dropped from the analyses. The distribution of diagnoses rendered by Sleep-EVAL and the sleep specialists is given in Table 2.

Table 2: Distribution of principal and secondary diagnoses
NOS = not otherwise specified.
Source of diagnoses
Sleep-EVAL Sleep specialists
% of patients % of patients
Diagnoses (ICSD-90) (n=96) (n=96)
Obstructive sleep apnea syndrome 38.5 41.7
Sleep-disordered breathing NOS 13.5 15.6
Insomnia 9.2 7.1
Narcolepsy 3.4 3.4
Primary snoring 3.1 2.1
REM-behavior disorder 2.1 2.1
Other diagnoses 13.3 5.2
No diagnosis 30.2 32.2

Overall, sleep specialists formulated 17 different diagnoses. Sleep-EVAL reached a parasomnia diagnosis (e.g., sleep talking) more often than the sleep specialists.

As already mentioned, multiple diagnoses were possible. Sleep-EVAL formulated an average of 1.32 (SD=1.1) diagnoses per patient, compared with an average of 0.93 (SD=0.72) for the sleep specialists (t=3.91, df=94, p=0.001).

Patterns of agreement between Sleep-EVAL and the sleep specialists are presented in Table 3.

Table 3: Agreement between Sleep-EVAL and sleep specialists diagnoses
Absence rated by Presence rated by % Kappa
ICSD Diagnoses Sleep-EVAL Sleep specialists Sleep-EVAL Sleep specialists Agreement
Any sleep-breathing disorder 41 40 55 56 96.9 0.94
- OSAS 59 56 37 40 96.7 0.94
- Sleep-disordered breathing NOS 81 81 15 15 95.8 0.84
- Primary snoring 93 94 3 2 96.8 0.38
Insomnia disorders 87 90 9 6 96.9 0.78
Narcolepsy 93 93 3 3 97.7 0.66
REM-behavior disorder 94 94 2 2 100.0 1.00
Any dyssomnia diagnosis 33 35 63 61 87.3 0.73

Agreement was excellent for sleep-breathing disorders (96.9%), with no significant difference between the two sites (Stanford: 94.2%, k=0.83; Regensburg: agreement 100%, k= 1.00). Agreement was also excellent for overall recognition of dyssomnias, OSAS (Stanford: 100%, k=1.00; Regensburg: 93.2%, k=0.84) and insomnia disorders. No significant difference emerged on McNemar’s test between the rates of diagnoses made by Sleep-EVAL and the sleep specialists.

Overall, there was disagreement between Sleep-EVAL and sleep specialists on only three cases of sleep-breathing disorders not otherwise specified. In two of these, the sleep specialist formulated a diagnosis of sleep-breathing disorder. The patients in question, however, reported only excessive daytime sleepiness and insomnia symptoms. This was insufficient for Sleep-EVAL to reach such a diagnosis. The third case was diagnosed by Sleep-EVAL as suffering from primary snoring on the basis of loud snoring occurring at least two nights per week. The system also rated that this patient might have a narcolepsy but one criterion was missing for a definite diagnosis. The sleep specialist, instead, gave only a diagnosis of narcolepsy.
Agreement on REM-behavior disorder was perfect. For one case of narcolepsy, the system did not find all the necessary criteria for a diagnosis of narcolepsy and thus classified the patient as “possibility of narcolepsy". However, the number of cases in these two diagnoses was low (two cases of REM-behavior disorder and three of narcolepsy, according to the sleep specialists). Healthy volunteers of the 26 healthy volunteers, two received a diagnosis from the sleep specialists: one was diagnosed with OSAS and the other with sleep bruxism. The Sleep-EVAL system, instead, diagnosed one healthy subject with OSAS (same subject as above), another with sleep bruxism (same subject as above), and another with stimulant-dependent (caffeine) sleep disorder.

The validity of Sleep-EVAL was tested in a clinical setting against the routine clinical assessment of sleep specialists and polysomnographic data in two sleep disorder centers. Overall, diagnostic agreement was good between Sleep-EVAL and the sleep specialists.

Sleep-EVAL formulated more diagnoses than did the sleep specialists. This is because the system explores all potential diagnoses, covering all the inclusion and exclusion criteria. This is one of the main characteristics of computerized diagnostic tools: They inexorably investigate all possible diagnoses even if a diagnosis has already been reached. This affords a definite advantage in epidemiological studies aimed at determining the prevalence of all possible sleep disorders. The fact that Sleep-EVAL diagnoses were compared against those made by sleep specialists during a routine clinical assessment rather than a structured interview may have placed the expert system at an unfair disadvantage. In this regard, several studies have shown that the validity of an assessment tool increases when it is tested against structured clinical interviews (21). However, as polysomnographic assessments were done for the majority of patients in order to confirm the sleep specialists’ diagnoses, the use of a structured clinical interview would have made a difference for only a small number of patients. Another limitation stems from the classification used. The ICSD-90 has often been criticized for the lack of validity of some of its diagnoses and for its broad range of possible diagnoses (84 in all). Schramm et al. (21) suggested that a more general classification was more likely to achieve a higher inter-rater agreement than was a detailed one. Buysse et al. (11) reported that diagnostic agreement between sleep specialists and non-specialists on the basis of the DSM-IV was low to moderate (0.30 to 0.57) in a context of unstructured clinical interviews.

In the present study, the use of the Sleep-EVAL system may have favored better agreement, given that the system served as a sleep specialist, with the non-sleep specialist physician questioning the patient according to the decisions made by Sleep-EVAL.

Finally, the agreement between two sleep specialists or between sleep specialists and non-specialists on the basis of the ICSD-90 is yet to be documented.

Based on the results of this study, the Sleep-EVAL system appears to be a valid instrument in the recognition of sleep disorders, particularly insomnia disorders and OSAS. Further studies are necessary to determine its validity in the diagnosis of less common sleep disorders such as hypersomnia, narcolepsy and REM-behavior sleep disorder.


  1. Patterson DW. Introduction to Artificial Intelligence & Expert Systems. Englewood Cliffs: Prentice Hall; 1990.
  2. Stefik M. Introduction to Knowledge Systems. San Francisco: Morgan Kaufmann Publishers, Inc; 1995.
  3. Bernelot Moens HJ. Validation of the AI/RHEUM Knowledge base with data from consecutive rheumatological outpatients. Meth Inform Med 1992; 31:175-181.
  4. Kahn CE Jr. Validation, clinical trial, and evaluation of a radiology expert system. Meth Inform Med 1991; 30:268-274.
  5. Lucas PJF, Janssens AR. Development and validation of HEPAR, an expert system for the diagnosis of disorders of the liver and biliary tract. Med Inform 1991; 16:259-270.
  6. Verdaguer A, Patak A, Sancho JJ, Sierra C, Sanz F. Validation of the medical expert system PNEUMON-IA. Comput Biomed Res 1992; 25:511-526.
  7. Shortliffe EH. EMYCIN: computer-based medical consultations. New York: Elsevier; 1976.
  8. Ohayon M, Caulet M. Adinfer: Experience with an expert system for psychiatry. Medinfo 1992; 7:615-619.
  9. Korpinen L, Frey H. Sleep Expert: an intelligent medical decision support system for sleep disorders. Medical Informatics 1993; 18:163-170.
  10. Ohayon MM. Sleep-EVAL, Knowledge base system for the diagnosis of sleep disorders. Registration #437699, Copyright Office, Ottawa: Industry Canada, Canadian Intellectual Property Office. 1994.
  11. Buysse DJ, Reynold CF, Hauri P, Roth T, Stepanski E, Thorpy MJ, Bixler E, Kales A, Manfredi, R, Vgontzas A, Stapf BS, Houck PR, Kupfer DJ. Diagnostic concordance for DSM-IV disorders: a report from the APA/NIMH DSM-IV field trial. Am J Psychiatry 1994; 151:1351-1360.
  12. Guilleminault C, Pelayo R, Leger D, Clerk A, Bocian RC. Recognition of sleep-disordered breathing in children. Pediatrics 1996; 98:871-882.
  13. Lee T, Oh C, Solnick AJ, Guilleminault C, Black JE. Upper airway resistance syndrome and obstructive sleep apnea: a comparaison of clinical features. Sleep Res. 1997; 26: 4-11.
  14. Ohayon MM, Guilleminault C, Paiva T, Priest RG, Rapoport DM, Sagales T, Smirne S, Zulley J. An International Study on Sleep Disorders in the General Population: Methodological Aspects. Sleep 1997; 20:1086-1092.
  15. APA (American Psychiatric Association). Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV). Washington: The American Psychiatric Association, 1994.
  16. ICSD - Diagnostic Classification Steering Committee, Thorpy MJ., Chairman. International Classification Of Sleep Disorders: Diagnostic And Coding Manual (ICSD). Rochester, Minnesota: American Sleep Disorders Association, 1990.
  17. Zadeh LA. A theory of approximate reasoning. Machine Intelligence 1979; 9:149-194.
  18. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960; 20:37‑46.
  19. Byrt T. How good is that agreement? Epidemiology, 1996; 7:161
  20. McNemar Q. Psychological Statistics. Fourth edition. New York: J. Wiley, 1969.
  21. Schramm E, Hohagen F, Grasshoff U, Rieman D, Hajak G, Weeß HG, Berger M. Test-retest reliability and validity of the structured interview for sleep disorders according to DSM-III-R. Am J Psychiatry 1993; 150: 867-872.
  22. Rolston DW. Principles of Artificial Intelligence and Expert Systems Development. San Francisco: McGraw-Hill Book Company; 1988.
  23. Rao VB, Rao HV. C++ Neural Networks & Fuzzy Logic. Second Edition. New York: MIS Press; 1995.