Exploring modulated detection transformer as a tool for action recognition in videos

During recent years transformers architectures have been growing in popularity. Modulated Detection Transformer (MDETR) is an end-to-endmulti-modal understanding model that performs tasks such as phase grounding, referring expression comprehension, referring expression segmentation, andvisual questi...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Crisol, Tomás, Ermantraut, Joel, Rostagno, Adrián, Aggio, Santiago L., Iparraguirre, Javier
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2022
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/151735
https://publicaciones.sadio.org.ar/index.php/JAIIO/article/download/388/326
Aporte de:
id I19-R120-10915-151735
record_format dspace
spelling I19-R120-10915-1517352023-04-19T20:05:33Z http://sedici.unlp.edu.ar/handle/10915/151735 https://publicaciones.sadio.org.ar/index.php/JAIIO/article/download/388/326 issn:2451-7496 Exploring modulated detection transformer as a tool for action recognition in videos Crisol, Tomás Ermantraut, Joel Rostagno, Adrián Aggio, Santiago L. Iparraguirre, Javier 2022-10 2022 2023-04-19T14:57:38Z en Ciencias Informáticas Multi-modal transformers Action detection Model generalization During recent years transformers architectures have been growing in popularity. Modulated Detection Transformer (MDETR) is an end-to-endmulti-modal understanding model that performs tasks such as phase grounding, referring expression comprehension, referring expression segmentation, andvisual question answering. One remarkable aspect of the model is the capacity to infer over classes that it was not previously trained for. In this work we explore the use of MDETR in a new task, action detection, without any previous training. We obtain quantitative results using the Atomic Visual Actions dataset.Although the model does not report the best performance in the task, we believe that it is an interesting finding. We show that it is possible to use a multi-modal model to tackle a task that it was not designed for. Finally, we believe that this line of research may lead into the generalization of MDETR in additionaldownstream tasks. Sociedad Argentina de Informática e Investigación Operativa Objeto de conferencia Objeto de conferencia http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) application/pdf 6-10
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
Multi-modal transformers
Action detection
Model generalization
spellingShingle Ciencias Informáticas
Multi-modal transformers
Action detection
Model generalization
Crisol, Tomás
Ermantraut, Joel
Rostagno, Adrián
Aggio, Santiago L.
Iparraguirre, Javier
Exploring modulated detection transformer as a tool for action recognition in videos
topic_facet Ciencias Informáticas
Multi-modal transformers
Action detection
Model generalization
description During recent years transformers architectures have been growing in popularity. Modulated Detection Transformer (MDETR) is an end-to-endmulti-modal understanding model that performs tasks such as phase grounding, referring expression comprehension, referring expression segmentation, andvisual question answering. One remarkable aspect of the model is the capacity to infer over classes that it was not previously trained for. In this work we explore the use of MDETR in a new task, action detection, without any previous training. We obtain quantitative results using the Atomic Visual Actions dataset.Although the model does not report the best performance in the task, we believe that it is an interesting finding. We show that it is possible to use a multi-modal model to tackle a task that it was not designed for. Finally, we believe that this line of research may lead into the generalization of MDETR in additionaldownstream tasks.
format Objeto de conferencia
Objeto de conferencia
author Crisol, Tomás
Ermantraut, Joel
Rostagno, Adrián
Aggio, Santiago L.
Iparraguirre, Javier
author_facet Crisol, Tomás
Ermantraut, Joel
Rostagno, Adrián
Aggio, Santiago L.
Iparraguirre, Javier
author_sort Crisol, Tomás
title Exploring modulated detection transformer as a tool for action recognition in videos
title_short Exploring modulated detection transformer as a tool for action recognition in videos
title_full Exploring modulated detection transformer as a tool for action recognition in videos
title_fullStr Exploring modulated detection transformer as a tool for action recognition in videos
title_full_unstemmed Exploring modulated detection transformer as a tool for action recognition in videos
title_sort exploring modulated detection transformer as a tool for action recognition in videos
publishDate 2022
url http://sedici.unlp.edu.ar/handle/10915/151735
https://publicaciones.sadio.org.ar/index.php/JAIIO/article/download/388/326
work_keys_str_mv AT crisoltomas exploringmodulateddetectiontransformerasatoolforactionrecognitioninvideos
AT ermantrautjoel exploringmodulateddetectiontransformerasatoolforactionrecognitioninvideos
AT rostagnoadrian exploringmodulateddetectiontransformerasatoolforactionrecognitioninvideos
AT aggiosantiagol exploringmodulateddetectiontransformerasatoolforactionrecognitioninvideos
AT iparraguirrejavier exploringmodulateddetectiontransformerasatoolforactionrecognitioninvideos
_version_ 1765660010836131840