I am a Research Fellow at Mayo Clinic (Department of AI & Informatics), where I work on Medical Image Analysis and machine learning.
Previously, I've worked in Video Understanding, Human Activity Recognition, Video-to-Video Summarization, Multimodal Action Recognition, and 3D Video Synthesis.
In this work, we introduce a fast patch selection method (FPS) for efficient selection of representative patches while preserving spatial distribution. HistoRotate, is a 360∘ rotation augmentation for training histopathology models, enhancing learning without compromising contextual information. PathDino, is a compact histopathology Transformer with five small vision transformer blocks and ≈9 million parameters.
This paper investigates the efficacy of the foundation models in the domain of histopathology by conducting a detailed comparison between these models, specifically CLIP derivatives (PLIP and BiomedCLIP), and traditional, domain-specific histology models that leverage well-curated datasets. Through a rigorous evaluation process on eight diverse datasets, including four internal from Mayo Clinic and four well-known public datasets (PANDA, BRACS, CAMELYON16, DigestPath). The findings show that domain-specific models, such as DinoSSLPath and KimiaNet, provide better performance across various metrics, underlining the significance of clean large datasets for histopathological analyses.
We propose SDM, a novel method for selecting diverse WSI patches, minimizing patch count while capturing all morphological variations. SDM outperforms the state-of-the-art, achieving high representativeness without needing parameter tuning.
We leveraged the power of 3D graphics and computer vision techniques to tackle a real-world problem, that we propose object-to-spot rotation estimation which is of particular significance for intelligent surveillance systems, bike-sharing systems, and smart cities. We introduced a rotation estimator (OSRE) that estimates a parked bike rotation with respect to its parking area.
We propose a new spatiotemporal attention scheme, termed synchronized spatiotemporal and spatial attention (SSTSA), which derives the spatiotemporal features with temporal and spatial multiheaded self-attention (MSA) modules.
Multimodal learning for video understanding (Text, Audio, RGB, Motion). We present a multimodal learning approach that leverage several modalities and several on-the-shelf models for both audio and language understanding. We proposed Irrelevant Modality Dropout (IMD) that drops the irrelevant audio from further processing while fusing the relevant audio-visual data for better video understanding.
This research study addresses the following question: To what extent can a fast independent adaptive algorithm select the most discriminative and representative frames to downsize huge video datasets while improving action recognition performance?
We propose variational Representation Learning for object Re-Identification. The proposed method has been evaluated on vehicle re-identification and person re-identification and face recognition.
For small objects detection like pedestrians in the outdoor surveillance, we propose a fast, lightweight, and auto-zooming-based framework for small pedestrian detection.