A Systematic Approach to Cleaning Routine Health Surveillance Datasets: An Illustration Using National Vector Borne Disease Control Programme Data of Punjab, India

Gurpreet Singh,Biju Soman,Arun Mitra
DOI: https://doi.org/10.48550/arXiv.2108.09963
2021-08-23
Abstract:Advances in ICT4D and data science facilitate systematic, reproducible, and scalable data cleaning for strengthening routine health information systems. A logic model for data cleaning was used and it included an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner. Apriori computational workflows and operational definitions were prepared. Model performance was illustrated using the dengue line-list of the National Vector Borne Disease Control Programme, Punjab, India from 01 January 2015 to 31 December 2019. Cleaning and imputation for an estimated date were successful for 96.1% and 98.9% records for the year 2015 and 2016 respectively, and for all cases in the year 2017, 2018, and 2019. Information for age and sex was cleaned and extracted for more than 98.4% and 99.4% records. The logic model application resulted in the development of an analysis-ready dataset that can be used to understand spatiotemporal epidemiology and facilitate data-based public health decision making.
Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to clean the Indian routine health monitoring data set systematically, repeatably and on a large scale in order to improve data quality and thus support data - based public health decision - making. Specifically, the researchers developed a logical model for cleaning data from the National Vector - Borne Disease Control Programme (NVBDCP) in Punjab, India, which covered dengue cases from January 1, 2015 to December 31, 2019. Through this model, the researchers aimed to address several key challenges in the data cleaning process, including: 1. **Inconsistent data format**: In routine health information systems, the recording formats of dates and other important information are often inconsistent, which brings difficulties to data analysis. The researchers improved data consistency and availability by developing algorithms to automatically identify and convert date information in different formats. 2. **Missing and incorrect data**: There are a large number of missing values and errors in the original data, such as missing or incorrect entry of basic information such as age and gender. Through automated and semi - automated data cleaning processes, the researchers successfully filled in most of the missing values and corrected the incorrect data. 3. **Data standardization**: In order to make the data better used for analysis and decision - making, the researchers standardized the data, for example, converting address information into a standard format for geocoding and spatial analysis. 4. **Data privacy protection**: During the data cleaning process, the researchers also took measures to protect personal privacy, such as removing sensitive information such as names and contact information and anonymizing the data using encryption algorithms. Through these methods, the researchers successfully converted the original data into a high - quality data set that can be used for analysis, thus providing a basis for understanding the spatio - temporal epidemiological characteristics of dengue fever and also providing support for data - based public health decision - making.