A qualitative assessment of machine learning support for detecting data completeness and accuracy issues to improve data analytics in big data for the healthcare industry

Juddoo, Suraj and George, Carlisle ORCID: https://orcid.org/0000-0002-8600-6264 (2020) A qualitative assessment of machine learning support for detecting data completeness and accuracy issues to improve data analytics in big data for the healthcare industry. 2020 3rd International Conference on Emerging Trends in Electrical, Electronic and Communications Engineering (ELECOM). In: ELECOM 2020 - 3rd International Conference on Emerging Trends in Electrical, Electronic and Communications Engineering (ELECOM), 25-27 Nov 2020, Mauritius, Mauritius. e-ISBN 9781728157078, e-ISBN 9781728157061, pbk-ISBN 9781728157085. [Conference or Workshop Item] (doi:10.1109/ELECOM49001.2020.9297009)

[img]
Preview
PDF - Final accepted version (with author's formatting)
Download (333kB) | Preview

Abstract

Tackling Data Quality issues as part of Big Data can be challenging. For data cleansing activities, manual methods are not efficient due to the potentially very large amount of data. This paper aims to qualitatively assess the possibilities for using machine learning in the process of detecting data incompleteness and inaccuracy, since these two data quality dimensions were found to be the most significant by a previous research study conducted by the authors. A review of existing literature concludes that there is no unique machine learning algorithm most suitable to deal with both incompleteness and inaccuracy of data.

Various algorithms are selected from existing studies and applied against a representative big (healthcare) dataset. Following experiments, it was also discovered that the implementation of machine learning algorithms in this context encounters several challenges for Big Data quality activities. These challenges are related to the amount of data particular machine learning algorithms can scale to and also to certain data type restrictions imposed by some machine learning algorithms. The study concludes that 1) data imputation works better with linear regression models, 2) clustering models are more efficient to detect outliers but fully automated systems may not be realistic in this context. Therefore, a certain level of human judgement is still needed.

Item Type: Conference or Workshop Item (Paper)
Research Areas: A. > School of Science and Technology > Computer Science > Aspects of Law and Ethics Related to Technology group
Item ID: 31441
Notes on copyright: © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Useful Links:
Depositing User: Carlisle George
Date Deposited: 27 Nov 2020 11:42
Last Modified: 09 Jun 2021 14:25
URI: https://eprints.mdx.ac.uk/id/eprint/31441

Actions (login required)

View Item View Item

Statistics

Downloads
Activity Overview
50Downloads
56Hits

Additional statistics are available via IRStats2.