Statistical analysis of the performance of four Apache Spark ML Algorithms

Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this co...

Descripción completa

Detalles Bibliográficos
Autores principales: Camele, Genaro, Hasperué, Waldo, Ronchetti, Franco, Quiroga, Facundo Manuel
Formato: Articulo
Lenguaje:Inglés
Publicado: 2022
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/146934
Aporte de:
id I19-R120-10915-146934
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
Big Data
Machine Learning
Classification Models
Apache Spark
Spark ML
Wilcoxon Test
Student’s T Test
Big Data
Aprendizaje automático
Modelos de clasificación
Test de Wilcoxon
Test T-Student
spellingShingle Ciencias Informáticas
Big Data
Machine Learning
Classification Models
Apache Spark
Spark ML
Wilcoxon Test
Student’s T Test
Big Data
Aprendizaje automático
Modelos de clasificación
Test de Wilcoxon
Test T-Student
Camele, Genaro
Hasperué, Waldo
Ronchetti, Franco
Quiroga, Facundo Manuel
Statistical analysis of the performance of four Apache Spark ML Algorithms
topic_facet Ciencias Informáticas
Big Data
Machine Learning
Classification Models
Apache Spark
Spark ML
Wilcoxon Test
Student’s T Test
Big Data
Aprendizaje automático
Modelos de clasificación
Test de Wilcoxon
Test T-Student
description Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this context, the Apache Spark ML library is one of the most widely used libraries for performing classification and other tasks with large datasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms before applying a FS technique is crucial to planning computations and saving time. In this work, a comparative study of four Spark ML classification algorithms is carried out, statistically measuring execution times and predictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Naive Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.
format Articulo
Articulo
author Camele, Genaro
Hasperué, Waldo
Ronchetti, Franco
Quiroga, Facundo Manuel
author_facet Camele, Genaro
Hasperué, Waldo
Ronchetti, Franco
Quiroga, Facundo Manuel
author_sort Camele, Genaro
title Statistical analysis of the performance of four Apache Spark ML Algorithms
title_short Statistical analysis of the performance of four Apache Spark ML Algorithms
title_full Statistical analysis of the performance of four Apache Spark ML Algorithms
title_fullStr Statistical analysis of the performance of four Apache Spark ML Algorithms
title_full_unstemmed Statistical analysis of the performance of four Apache Spark ML Algorithms
title_sort statistical analysis of the performance of four apache spark ml algorithms
publishDate 2022
url http://sedici.unlp.edu.ar/handle/10915/146934
work_keys_str_mv AT camelegenaro statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms
AT hasperuewaldo statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms
AT ronchettifranco statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms
AT quirogafacundomanuel statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms
AT camelegenaro analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml
AT hasperuewaldo analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml
AT ronchettifranco analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml
AT quirogafacundomanuel analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml
bdutipo_str Repositorios
_version_ 1764820460609994754