Comparative analysis of preprocessing tasks over social media texts in Spanish
One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text analysis process, it is highly recommended to pre...
Guardado en:
| Autores principales: | , , , |
|---|---|
| Otros Autores: | |
| Formato: | Documento de conferencia acceptedVersion |
| Lenguaje: | Inglés |
| Publicado: |
Association for Computing Machinery (ACM)
2021
|
| Materias: | |
| Acceso en línea: | https://repositorio.unnoba.edu.ar/xmlui/handle/23601/143 |
| Aporte de: |
| id |
I103-R405-23601-143 |
|---|---|
| record_format |
dspace |
| spelling |
I103-R405-23601-1432021-07-26T15:24:35Z Comparative analysis of preprocessing tasks over social media texts in Spanish Tessore, Juan Pablo Esnaola, Leonardo Martín Russo, Claudia Cecilia Baldassarri, Sandra 0000-0002-2111-0976 0000-0001-6298-9019 0000-0002-9315-6391 Text mining Text preprocessing Text classification, Sentiment Analysis One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text analysis process, it is highly recommended to previously clean and reduce the noise of the data. This work focuses on measuring the effectiveness that diverse cleaning and repairing tasks have on the data. The results obtained, indicate that the tasks of tokens with no letters removal, and stressed words correction are the most effective. In addition, some tasks like hashtags or usernames processing, which behave very well in other datasets, are not that relevant in this one. This research is part of a more general one that pursues to build an automatic emotion classifier that makes use of the preprocessed comments as input. Fil: Tessore, Juan Pablo. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina Fil: Tessore, Juan Pablo. Comisión de Investigaciones Científicas de la Provincia de Buenos Aires. Fil: Esnaola, Leonardo Martín. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina Fil: Russo, Claudia Cecilia. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina Fil: Baldassarri, Sandra. Departamento de Informática e Ingeniería de Sistemas, Universidad de Zaragoza, Aragon, Zaragoza, España Fil: Baldassarri, Sandra. Instituto de Investigación en Ingeniería (I3A), Universidad de Zaragoza, Zaragoza, Aragon, España Con referato 2021-07-26T15:24:35Z 2021-07-26T15:24:35Z 2019-06-25 info:eu-repo/semantics/conferenceObject info:ar-repo/semantics/documento de conferencia info:eu-repo/semantics/acceptedVersion info:eu-repo/semantics/conferenceObject info:ar-repo/semantics/documento de conferencia info:eu-repo/semantics/acceptedVersion info:eu-repo/semantics/conferenceObject info:ar-repo/semantics/documento de conferencia info:eu-repo/semantics/acceptedVersion Juan Pablo Tessore, Leonardo Martín Esnaola, Claudia Cecilia Russo, and Sandra Baldassarri. 2019. Comparative analysis of preprocessing tasks over social media texts in Spanish. In Proceedings of the XX International Conference on Human Computer Interaction (Interacción '19). Association for Computing Machinery, New York, NY, USA, Article 27, 1–8. DOI:https://doi.org/10.1145/3335595.3335632 978-1-4503-7176-6/19/06 https://repositorio.unnoba.edu.ar/xmlui/handle/23601/143 eng info:eu-repo/grantAgreement/UNNOBA/SIB2019/EXP 536/2019/AR. Buenos Aires/Tecnología y Aplicaciones de Sistemas de Software: Calidad e Innovación en procesos, productos y servicios https://doi.org/10.1145/3335595.3335632 info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-nd/2.5/ar/ application/pdf application/pdf text/plain Association for Computing Machinery (ACM) Interacción 2019: XX International Conference on Human Computer Interaction |
| institution |
Universidad Nacional del Noroeste de la Provincia de Buenos Aires |
| institution_str |
I-103 |
| repository_str |
R-405 |
| collection |
Re DI Repositorio Digital UNNOBA |
| language |
Inglés |
| topic |
Text mining Text preprocessing Text classification, Sentiment Analysis |
| spellingShingle |
Text mining Text preprocessing Text classification, Sentiment Analysis Tessore, Juan Pablo Esnaola, Leonardo Martín Russo, Claudia Cecilia Baldassarri, Sandra Comparative analysis of preprocessing tasks over social media texts in Spanish |
| topic_facet |
Text mining Text preprocessing Text classification, Sentiment Analysis |
| description |
One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text
analysis process, it is highly recommended to previously clean and reduce the noise of the data. This work focuses on measuring the effectiveness that diverse cleaning and repairing tasks have on the data. The results obtained, indicate that the tasks of tokens with no letters removal, and stressed words correction are the most effective. In addition, some tasks like hashtags or usernames processing, which behave very well in other datasets, are not that
relevant in this one. This research is part of a more general one that pursues to build an automatic emotion classifier that makes use of the preprocessed comments as input. |
| author2 |
0000-0002-2111-0976 |
| author_facet |
0000-0002-2111-0976 Tessore, Juan Pablo Esnaola, Leonardo Martín Russo, Claudia Cecilia Baldassarri, Sandra |
| format |
Documento de conferencia Documento de conferencia acceptedVersion Documento de conferencia Documento de conferencia acceptedVersion Documento de conferencia Documento de conferencia acceptedVersion |
| author |
Tessore, Juan Pablo Esnaola, Leonardo Martín Russo, Claudia Cecilia Baldassarri, Sandra |
| author_sort |
Tessore, Juan Pablo |
| title |
Comparative analysis of preprocessing tasks over social media texts in Spanish |
| title_short |
Comparative analysis of preprocessing tasks over social media texts in Spanish |
| title_full |
Comparative analysis of preprocessing tasks over social media texts in Spanish |
| title_fullStr |
Comparative analysis of preprocessing tasks over social media texts in Spanish |
| title_full_unstemmed |
Comparative analysis of preprocessing tasks over social media texts in Spanish |
| title_sort |
comparative analysis of preprocessing tasks over social media texts in spanish |
| publisher |
Association for Computing Machinery (ACM) |
| publishDate |
2021 |
| url |
https://repositorio.unnoba.edu.ar/xmlui/handle/23601/143 |
| work_keys_str_mv |
AT tessorejuanpablo comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish AT esnaolaleonardomartin comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish AT russoclaudiacecilia comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish AT baldassarrisandra comparativeanalysisofpreprocessingtasksoversocialmediatextsinspanish |
| _version_ |
1850060759467294720 |