Statistical analysis of the performance of four Apache Spark ML Algorithms
Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this co...
Autores principales: | , , , |
---|---|
Formato: | Articulo |
Lenguaje: | Inglés |
Publicado: |
2022
|
Materias: | |
Acceso en línea: | http://sedici.unlp.edu.ar/handle/10915/146934 |
Aporte de: |
id |
I19-R120-10915-146934 |
---|---|
record_format |
dspace |
institution |
Universidad Nacional de La Plata |
institution_str |
I-19 |
repository_str |
R-120 |
collection |
SEDICI (UNLP) |
language |
Inglés |
topic |
Ciencias Informáticas Big Data Machine Learning Classification Models Apache Spark Spark ML Wilcoxon Test Student’s T Test Big Data Aprendizaje automático Modelos de clasificación Test de Wilcoxon Test T-Student |
spellingShingle |
Ciencias Informáticas Big Data Machine Learning Classification Models Apache Spark Spark ML Wilcoxon Test Student’s T Test Big Data Aprendizaje automático Modelos de clasificación Test de Wilcoxon Test T-Student Camele, Genaro Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel Statistical analysis of the performance of four Apache Spark ML Algorithms |
topic_facet |
Ciencias Informáticas Big Data Machine Learning Classification Models Apache Spark Spark ML Wilcoxon Test Student’s T Test Big Data Aprendizaje automático Modelos de clasificación Test de Wilcoxon Test T-Student |
description |
Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this context, the Apache Spark ML library is one of the most widely used libraries for performing classification and other tasks with large datasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms before applying a FS technique is crucial to planning computations and saving time. In this work, a comparative study of four Spark ML classification algorithms is carried out, statistically measuring execution times and predictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Naive Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others. |
format |
Articulo Articulo |
author |
Camele, Genaro Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel |
author_facet |
Camele, Genaro Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel |
author_sort |
Camele, Genaro |
title |
Statistical analysis of the performance of four Apache Spark ML Algorithms |
title_short |
Statistical analysis of the performance of four Apache Spark ML Algorithms |
title_full |
Statistical analysis of the performance of four Apache Spark ML Algorithms |
title_fullStr |
Statistical analysis of the performance of four Apache Spark ML Algorithms |
title_full_unstemmed |
Statistical analysis of the performance of four Apache Spark ML Algorithms |
title_sort |
statistical analysis of the performance of four apache spark ml algorithms |
publishDate |
2022 |
url |
http://sedici.unlp.edu.ar/handle/10915/146934 |
work_keys_str_mv |
AT camelegenaro statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms AT hasperuewaldo statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms AT ronchettifranco statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms AT quirogafacundomanuel statisticalanalysisoftheperformanceoffourapachesparkmlalgorithms AT camelegenaro analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml AT hasperuewaldo analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml AT ronchettifranco analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml AT quirogafacundomanuel analisisestadisticodelrendimientodecuatroalgoritmosdeapachesparkml |
bdutipo_str |
Repositorios |
_version_ |
1764820460609994754 |