Extracting clinical information from chest x-ray reports: A case study for Russian language
Published in 2020 NIR Innopolis, 2020
—In this paper, we analyze possible approaches for diagnosis identification in Russian medical reports. Firstly, we introduce the main problems of raw Russian medical reports preprocessing. Secondly, focusing on the embedding extraction method, we analyzed several publicly available models and discovered that the use of BERT model is a promising instrument for this task. Performing the first attempt to build the NLP system for the Russian medical report classification based on the embeddings extraction method, we formulated the main weaknesses that limit the use of the existing publicly available Russian NLP models in the medical-text domain. Having no labeled data available, we evaluate each model visually, analyzing embeddings representation in 2D field retrieved by dimensionality reduction using t-SNE. We assume that a good model will be able to place reports that describe the same diagnosis close to each other, while moving reports with distinct diagnoses far from each other, forming clusters. Finally, we proposed several ways of possible future research that, as we believe, will improve the results achieved in this field so far.
Download here