ABSTRACT (TSD'98 Submission)

Title: Learning and Classifying Medical Text Documents Using the Naive Bayes Algorithm

Authors: Jan Zizka (1) and Ales Bourek (2)

  1. Faculty of Informatics
    Masaryk University
    Botanicka 68a, 602 00 Brno, Czech Republic
    Phone: (+420 5) 41512337 Fax: (+420 5) 41212568
    E-mail:
    zizka@informatics.muni.cz
  2. Dept. of Obstetrics and Gynaecology
    Masaryk University, Teaching Hospital Brno-Bohunice
    Jihlavska 20, 639 01 Brno, Czech Republic
    Contact: http://www.med.muni.cz/~bourek/


The overwhelming numbers of easily accessible text documents in electronic form provide a valuable source of up-to-day information in medicine as well as in other areas. On the other hand, large volumes of documents make a big obstacle for users who look for relevant information. For example, a typical search through the Internet using key words may provide thousands of text documents, which are not structured in a uniform way or are not structured at all. Reading and evaluating the documents is a tedious and time-consuming
process that keeps physicians away from their other tasks.


In this paper, we describe an application of Bayesian learning methods for filtering large volumes of online accessible text documents. The aim is to present only the most relevant documents to the user from medical online text documents (e.g., MEDLINE, MEDLARS,
CCOD, WWW, and other Internet sources). Among thousands of documents retrieved, only a small fraction usually has a chance to be relevant because the primary search criterion (a small set of key words) is too rough. Due to the absence of any particular document structure, the computerized selection of relevant information is uncommonly difficult, as the user is not able to define clear selection criterions (e.g., by the form of database queries and index files). Also, any transformation of the high number of heterogenous documents from various sources into a database form would be very cumbersome, slow and ineffective.

The naive Bayesian learning/classifying method, used in our experiments, is based on the following: A collection of text documents is used for training the classifier to adapt to particular user needs. The user marks documents of his interest in the training set, dividing it into a part called "relevant" and a part called "irrelevant." Each document is then represented as a sequence of words and the Bayesian classifier calculates a classification value for each document so that it maximizes the probability of observing the words that were actually found in the document. The "naive" assumption is that positions of words in a document are not important (independence assumption). Even if this assumption is not correct, in practice it works surprisingly well.

Although the work described in the paper is ongoing, the preliminary experiments showed that the results are very good and promising. A collection of 708 medical text documents from various sources on the Internet consisted of 110,386 words with 11,328 different words (including numbers like years, ISBN, etc.). Without excluding any words, the classification accuracy measured on a completely different set of medical documents was around 81%. After removing the common words (unnecessary for classification, like "of," "the," "and," "for," "in," "to," "a," "on" including words with very high and very low occurrence and numbers), the accuracy increased up to 88%. At the present moment, we experiment with improving the classification accuracy by defining a set of the most important words from the particular user's point of view.

return to homepage Ales BOUREK