Privacy-Preserving Generation of Textual Healthcare Data
Abstract
Technological advancements in data science have offered us affordable storage and
efficient algorithms to query a large volume of data. Our health records are a significant
part of this data, which is pivotal for healthcare providers and can be utilized
in our well-being. The clinical note in Electronic Health Records (EHRs) is one
such category that collects a patient's complete medical information during different
timesteps of patient-care available in the form of free-texts. Thus, these unstructured
textual notes contain events from a patient's admission to discharge, which can prove
to be significant for future medical decisions. However, since these texts also contain
sensitive information about the patient and the attending medical professionals, such
notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on
this plethora of untapped information. Therefore, in this work, we intend to generate
synthetic medical texts from a private or sanitized (de-identified) clinical text corpus
and analyze their utility rigorously in different metrics and levels. Experimental results
con rm the applicability of our generated data as it achieves more than 80%
accuracy in various practical classification problems and matches (or outperforms)
the original text data.