DOI

https://doi.org/10.25772/3K9C-3356

Author ORCID Identifier

0000-0002-8431-8483

Defense Date

2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Computer Science

First Advisor

Dr. Bartosz Krawczyk

Abstract

With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered.

This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed.

Rights

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission

12-1-2021

Download

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons, Theory and Algorithms Commons

COinS

Theses and Dissertations

Learning from Multi-Class Imbalanced Big Data with Apache Spark

DOI

Author ORCID Identifier

Defense Date

Document Type

Degree Name

Department

First Advisor

Abstract

Rights

Is Part Of

Is Part Of

Date of Submission

Included in

Browse

Search

Author Corner

Links

Theses and Dissertations

Learning from Multi-Class Imbalanced Big Data with Apache Spark

Author

DOI

Author ORCID Identifier

Defense Date

Document Type

Degree Name

Department

First Advisor

Abstract

Rights

Is Part Of

Is Part Of

Date of Submission

Included in

Share

Browse

Search

Author Corner

Links