Comparative analysis of preprocessing tasks over social media texts in Spanish

One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text analysis process, it is highly recommended to pre...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Tessore, Juan Pablo, Esnaola, Leonardo Martín, Russo, Claudia Cecilia, Baldassarri, Sandra
Otros Autores: 0000-0002-2111-0976
Formato: Documento de conferencia acceptedVersion
Lenguaje:Inglés
Publicado: Association for Computing Machinery (ACM) 2021
Materias:
Acceso en línea:https://repositorio.unnoba.edu.ar/xmlui/handle/23601/143
Aporte de:
id I103-R405-23601-143
record_format dspace
spelling I103-R405-23601-1432021-07-26T15:24:35Z Comparative analysis of preprocessing tasks over social media texts in Spanish Tessore, Juan Pablo Esnaola, Leonardo Martín Russo, Claudia Cecilia Baldassarri, Sandra 0000-0002-2111-0976 0000-0001-6298-9019 0000-0002-9315-6391 Text mining Text preprocessing Text classification, Sentiment Analysis One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text analysis process, it is highly recommended to previously clean and reduce the noise of the data. This work focuses on measuring the effectiveness that diverse cleaning and repairing tasks have on the data. The results obtained, indicate that the tasks of tokens with no letters removal, and stressed words correction are the most effective. In addition, some tasks like hashtags or usernames processing, which behave very well in other datasets, are not that relevant in this one. This research is part of a more general one that pursues to build an automatic emotion classifier that makes use of the preprocessed comments as input. Fil: Tessore, Juan Pablo. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina Fil: Tessore, Juan Pablo. Comisión de Investigaciones Científicas de la Provincia de Buenos Aires. Fil: Esnaola, Leonardo Martín. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina Fil: Russo, Claudia Cecilia. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina Fil: Baldassarri, Sandra. Departamento de Informática e Ingeniería de Sistemas, Universidad de Zaragoza, Aragon, Zaragoza, España Fil: Baldassarri, Sandra. Instituto de Investigación en Ingeniería (I3A), Universidad de Zaragoza, Zaragoza, Aragon, España Con referato 2021-07-26T15:24:35Z 2021-07-26T15:24:35Z 2019-06-25 info:eu-repo/semantics/conferenceObject info:ar-repo/semantics/documento de conferencia info:eu-repo/semantics/acceptedVersion info:eu-repo/semantics/conferenceObject info:ar-repo/semantics/documento de conferencia info:eu-repo/semantics/acceptedVersion info:eu-repo/semantics/conferenceObject info:ar-repo/semantics/documento de conferencia info:eu-repo/semantics/acceptedVersion Juan Pablo Tessore, Leonardo Martín Esnaola, Claudia Cecilia Russo, and Sandra Baldassarri. 2019. Comparative analysis of preprocessing tasks over social media texts in Spanish. In Proceedings of the XX International Conference on Human Computer Interaction (Interacción '19). Association for Computing Machinery, New York, NY, USA, Article 27, 1–8. DOI:https://doi.org/10.1145/3335595.3335632 978-1-4503-7176-6/19/06 https://repositorio.unnoba.edu.ar/xmlui/handle/23601/143 eng info:eu-repo/grantAgreement/UNNOBA/SIB2019/EXP 536/2019/AR. Buenos Aires/Tecnología y Aplicaciones de Sistemas de Software: Calidad e Innovación en procesos, productos y servicios https://doi.org/10.1145/3335595.3335632 info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-nd/2.5/ar/ application/pdf application/pdf text/plain Association for Computing Machinery (ACM) Interacción 2019: XX International Conference on Human Computer Interaction
institution Universidad Nacional del Noroeste de la Provincia de Buenos Aires
institution_str I-103
repository_str R-405
collection Re DI Repositorio Digital UNNOBA
language Inglés
topic Text mining
Text preprocessing
Text classification,
Sentiment Analysis
spellingShingle Text mining
Text preprocessing
Text classification,
Sentiment Analysis
Tessore, Juan Pablo
Esnaola, Leonardo Martín
Russo, Claudia Cecilia
Baldassarri, Sandra
Comparative analysis of preprocessing tasks over social media texts in Spanish
topic_facet Text mining
Text preprocessing
Text classification,
Sentiment Analysis
description One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text analysis process, it is highly recommended to previously clean and reduce the noise of the data. This work focuses on measuring the effectiveness that diverse cleaning and repairing tasks have on the data. The results obtained, indicate that the tasks of tokens with no letters removal, and stressed words correction are the most effective. In addition, some tasks like hashtags or usernames processing, which behave very well in other datasets, are not that relevant in this one. This research is part of a more general one that pursues to build an automatic emotion classifier that makes use of the preprocessed comments as input.
author2 0000-0002-2111-0976
author_facet 0000-0002-2111-0976
Tessore, Juan Pablo
Esnaola, Leonardo Martín
Russo, Claudia Cecilia
Baldassarri, Sandra
format Documento de conferencia
Documento de conferencia
acceptedVersion
Documento de conferencia
Documento de conferencia
acceptedVersion
Documento de conferencia
Documento de conferencia
acceptedVersion
author Tessore, Juan Pablo
Esnaola, Leonardo Martín
Russo, Claudia Cecilia
Baldassarri, Sandra
author_sort Tessore, Juan Pablo
title Comparative analysis of preprocessing tasks over social media texts in Spanish
title_short Comparative analysis of preprocessing tasks over social media texts in Spanish
title_full Comparative analysis of preprocessing tasks over social media texts in Spanish
title_fullStr Comparative analysis of preprocessing tasks over social media texts in Spanish
title_full_unstemmed Comparative analysis of preprocessing tasks over social media texts in Spanish
title_sort comparative analysis of preprocessing tasks over social media texts in spanish
publisher Association for Computing Machinery (ACM)
publishDate 2021
url https://repositorio.unnoba.edu.ar/xmlui/handle/23601/143
work_keys_str_mv AT tessorejuanpablo comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish
AT esnaolaleonardomartin comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish
AT russoclaudiacecilia comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish
AT baldassarrisandra comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish
_version_ 1850060759467294720