Privacy-Preserving Generation of Textual Healthcare Data
Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in Electronic Health Records (EHRs) is one such category that collects a patient's complete medical information during different timesteps of patient-care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient's admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results con rm the applicability of our generated data as it achieves more than 80% accuracy in various practical classification problems and matches (or outperforms) the original text data.