Abhay Kumar

I graduated with M.S. in Computer Sciences from University of Wisconsin-Madison (UWM/UWisc). My research interests lie primarily in Deep Learning and its application to Computer Vision, Natural Language Processing and recommendation systems. Over the years, I have gained strong academic background, relevant work experience, and research aptitude (12+ publications, 350+ citations, h-index: 8, i-10 index: 5).

Previously, I worked at Samsung R&D Institute- India, Bangalore on Bixby - Artificial Intelligence based smart assistant, leveraging deep learning technologies. I was awarded the Samsung Citizen Award under technology excellence category for the outstanding contribution in 2017-18. For my excellent contribution to research, publications and patents, I was again awarded with the Samsung Citizen Award in the "Innovator" category in 2019.

Prior to that, I did my undergraduation from Indian Institute of Technology Kanpur (IIT Kanpur) with a major in Electrical Engineering. I was recipient of Academic Excellence Award for the outstanding academic performance in two consecutive years 2013-14 and 2014-15.

Research Interests:- Multimodal (Image, Video, Speech) Signal Processing, Computer Vision, NLP, Machine Learning, Deep Learning & Optimization

Google Scholar  /  LinkedIn  /  Wisc email  /  Gmail  /  Social medial links

Citations a/c to Google Scholar: (12+ publications, 350+ citations, h-index: 8, i-10 index: 5).
0
90
45
2017201820192020202120222023233463858584
Updates
Competitions
project_img

Microsoft AI Challenge India 2018
Phase-1 Rank: 2nd | Phase-2 Rank : 6th (Over 2000 teams participated)

Problem Statement: "Given a user query and candidate passages corresponding to each, the task is to mark the most relevant passage which contains the answer to the user query. As search engines evolve to respond to speech inputs and as usage of ambient devices like speakers grow in the society etc. returning 10 blue links to a search query is not always desirable. At Bing.com, our aim is to serve answer to questions directly without users having to search through the 10 blue links."

leaderboard (22/01/2018) | certificate | problem statement | live leaderboard

Publications
project_img

Speech Emotion Recognition Using Spectrogram & Phoneme Embedding
INTERSPEECH 2018

This paper proposes a speech emotion recognition method based on phoneme sequence and spectrogram. Both phoneme sequence and spectrogram retain emotion contents of speech which is missed if the speech is converted into text. We performed various experiments with different kinds of deep neural networks with phoneme and spectrogram as inputs. Three of those network architectures are presented here that helped to achieve better accuracy when compared to the state-of-the-art methods on benchmark dataset. A phoneme and spectrogram combined CNN model proved to be most accurate in recognizing emotions on IEMOCAP data. We achieved more than 4% increase in overall accuracy and average class accuracy as compared to the existing state-of-the-art methods.

paper link | pdf

project_img

Learning Discriminative features using Center Loss and Reconstruction as Regularizer for Speech Emotion Recognition
In IJCAI Workshop on Artificial Intelligence in Affective Computing [ACCEPTED]

This paper proposes a Convolutional Neural Network (CNN) inspired by Multitask Learning (MTL) and based on speech features trained under the joint supervision of softmax loss and center loss, a powerful metric learning strategy, for the recognition of emotion in speech. Speech features such as Spectrograms and Mel-frequency Cepstral Coefficient s (MFCCs) help retain emotion-related low-level characteristics in speech. We experimented with several Deep Neural Network (DNN) architectures that take in speech features as input and trained them under both softmax and center loss, which resulted in highly discriminative features ideal for Speech Emotion Recognition (SER). Our networks also employ a regularizing effect by simultaneously performing the auxiliary task of reconstructing the input speech features. This sharing of representations among related tasks enables our network to better generalize the original task of SER. Some of our proposed networks contain far fewer parameters when compared to state-of-the-art architectures.

paper link

project_img

Hybrid Maximum Depth-kNN Method for Real-Time Node Tracking using Multi-Sensor Data
IEEE International Conference on Communications (ICC) 2015, London, UK

In this paper, a hybrid MD-kNN method for real time sensor node tracking is proposed. The method combines two individual location hypothesis functions obtained from generalized maximum depth and generalized kNN methods. The individual location hypothesis functions are themselves obtained from multiple sensors measuring visible light, humidity, temperature, acoustics, and link quality. The hybrid MD-kNN method therefore combines the lower computational power of maximum depth and outlier rejection ability of kNN method to realize a robust real time localization method. Additionally, this method does not require the assumption of an underlying distribution under non-line-of-sight (NLOS) conditions. Additional novelty of this method is the utilization of multivariate data obtained from multiple sensors which has hitherto not been used. The affine invariance property of the hybrid MD-kNN method is proved and its robustness is illustrated in the context of node localization. Experimental results on the Intel Berkeley research data set indicates reasonable improvements over conventional methods available in literature.

paper link | pdf | presentation

project_img

Emoception: An Inception Inspired Efficient Speech Emotion Recognition Network
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore

This research proposes a Deep Neural Network architecture for Speech Emotion Recognition called Emoception, which takes inspiration from Inception modules. The network takes speech features like Mel-Frequency Spectral Coefficients (MFSC) or Mel-Frequency Cepstral Coefficients (MFCC) as input and recognizes the relevant emotion in the speech. We use USC-IEMOCAP dataset for training but the limited amount of training data and large depth of the network makes the network prone to overfitting, reducing validation accuracy. The Emoception network overcomes this problem by extending in width without increase in computational cost. We also employ a powerful regularization technique, Multi-Task Learning (MTL) to make the network robust. The model using MFSC input with MTL increases the accuracy by 1.6% vis-à-vis Emoception without MTL. We report an overall accuracy improvement of around 4.6% compared to the existing state-of-art methods for four emotion classes on IEMOCAP dataset.

paper link | pdf

project_img

Bidirectional Transformer based Multi-Task Learning for Natural Language Understanding
24th International Conference on Applications of Natural Language to Information Systems, Salford, United Kingdom

We propose a multi-task learning-based framework for natural language understanding tasks like sentiment and topic classification. We make use of bi-directional transformer based architecture to generate encoded representations from given input followed by task-specific layers for classification. Multi-Task learning (MTL) based framework make use of a different set of tasks in parallel, as a kind of additional regularization, to improve the generalizability of the trained model over individual tasks. We introduced a task-specific auxiliary problem using the k-means clustering algorithm to be trained in parallel with main tasks to reduce the model’s generalization error on the main task. POS-tagging was also used as one of the auxiliary tasks. We also trained multiple benchmark classification datasets in parallel to improve the effectiveness of our bidirectional transformer based network across all the datasets. Our proposed MTL based transformer network im-proved state-of-the-art overall accuracy of Movie Review (MR), AG News, and Stanford Sentiment Treebank (SST-2) corpus by 6%, 1.4%, and 3.3% respectively.

paper link

project_img

Deceptive Reviews Detection using Deep Learning Techniques
24th International Conference on Applications of Natural Language to Information Systems, Salford, United Kingdom

With the increasing influence of online reviews in shaping customer decision-making and purchasing behavior, many unscrupulous businesses have a vested interest in generating and posting deceptive reviews. Deceptive reviews are fictitious reviews written deliberately to sound authentic and deceive the consumers. Traditional deceptive reviews detection methods are based on various handcrafted features, including linguistic and psychological, which characterize the deceptive reviews. However, the proposed deep learning methods have better self-adaptability to extract the desired features implicitly and outperform all traditional methods. We have purposed multiple Deep Neural Network (DNN) based approaches for deceptive reviews detection and have compared the performances of these models on multiple benchmark datasets. Additionally, we have identified a common problem of handling the variable lengths of these reviews. We have purposed two different methods – Multi-Instance Learning and Hierarchical architecture to handle the variable length review texts. Experimental results on multiple benchmark datasets of deceptive reviews have outperformed existing state-of-the-art. We evaluated the performance of the proposed method on other review-related task-like review sentiment detection as well and achieved state-of-the-art accuracies on two benchmark datasets for the same.

paper link

project_img

MTCNet: Multi-Task Learning Paradigm for Crowd Count Estimation
16th IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), Taipei, Taiwan [ACCEPTED] [update: WITHDRAWN]

We propose a Multi-Task Learning (MTL) paradigm based deep neural network architecture, called MTCNet (Multi-Task Crowd Network) for crowd density and count estimation. Crowd count estimation is challenging due to the non-uniform scale variations and the arbitrary perspective of an individual image. The proposed model has two related tasks, with Crowd Density Estimation as the main task and Crowd-Count Group Classification as the auxiliary task. The auxiliary task helps in capturing the relevant scale-related information to improve the performance of the main task. The main task model comprises two blocks: VGG-16 front-end for feature extraction and a dilated Convolutional Neural Network for density map generation. The auxiliary task model shares the same front-end as the main task, followed by a CNN classifier. Our proposed network achieves 5.8% and 14.9% lower Mean Absolute Error (MAE) than the state-of-the-art methods on ShanghaiTech dataset without using any data augmentation. Our model also outperforms with 10.5% lower MAE on UCF_CC_50 dataset.

paper link

project_img

Exploiting SIFT Descriptor for Rotation Invariant Convolutional Neural Network
15th IEEE India Council International Conference (INDICON 2018)

This paper presents a novel approach to exploit the distinctive invariant features in convolutional neural network. The proposed CNN model uses Scale Invariant Feature Transform (SIFT) descriptor instead of the maxpooling layer. Max-pooling layer discards the pose, i.e., translational and rotational relationship between the low-level features, and hence unable to capture the spatial hierarchies between low and high level features. The SIFT descriptor layer captures the orientation and the spatial relationship of the features extracted by convolutional layer. The proposed SIFT Descriptor CNN therefore combines the feature extraction capabilities of CNN model and rotation invariance of SIFT descriptor. Experimental results on the MNIST and fashionMNIST datasets indicates reasonable improvements over conventional methods available in literature.

arXiv link | IEEE link | presentation | certificate

project_img

Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition
20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France [IN PRESS]

This paper proposes a Residual Convolutional Neural Network (ResNet) based on speech features and trained under Focal Loss to recognize emotion in speech. Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCCs) have shown the ability to characterize emotion better than just plain text. Further Focal Loss, first used in One-Stage Object Detectors, has shown the ability to focus the training process more towards hard-examples and down-weight the loss assigned to well-classified examples, thus preventing the model from being overwhelmed by easily classifiable examples. After experimenting with several Deep Neural Network (DNN) architectures, we propose a ResNet, which takes in Spectrogram or MFCC as input and supervised by Focal Loss, ideal for speech inputs where there exists a large class imbalance. Maintaining continuity with previous work in this area, we have used the University of Southern California’s Interactive Emotional Motion Capture (USC-IEMOCAP) database’s Improvised Topics in this work. This dataset is ideal for our work, as there exists a significant class imbalance among the various emotions. Our best model achieved a 3.4% improvement in overall accuracy and a 2.8% improvement in class accuracy when compared to existing state-of-the-art methods.

paper link | presentation | poster

project_img

From Fully Supervised to Zero Shot Settings for Twitter Hashtag Recommendation
20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France [IN PRESS]

We propose a comprehensive end-to-end pipeline for Twitter hashtags recommendation system including data collection, supervised training setting and zero shot training setting. In the supervised training setting, we have proposed and compared the performance of various deep learning architectures, namely Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Transformer Network. However, it is not feasible to collect data for all possible hashtag labels and train a classifier model on them. To overcome this limitation, we propose a Zero Shot Learning (ZSL) paradigm for predicting unseen hashtag labels by learning the relationship between the semantic space of tweets and the embedding space of hashtag labels. We evaluated various state-of-the-art ZSL methods like Convex combination of Semantic Embedding (ConSE), Embarrassingly Simple Zero Shot Learning (ESZSL) and Deep Embedding Model for Zero Shot Learning (DEM-ZSL) for the hashtag recommendation task. We demonstrate the effectiveness and scalability of ZSL methods for the recommendation of unseen hashtags. To the best of our knowledge, this is the first quantitative evaluation of ZSL methods to date for unseen hashtags recommendations from tweet text.

paper link | presentation | certificate | poster

project_img

Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France [IN PRESS]

This paper proposes a speech emotion recognition method based on speech features and speech transcriptions (text). Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCC) help retain emotion-related low-level characteristics in speech whereas text helps capture semantic meaning, both of which help in different aspects of emotion detection. We experimented with several Deep Neural Network (DNN) architectures, which take in different combinations of speech features and text as inputs. The proposed network architectures achieve higher accuracies when compared to state-of-the-art methods on a benchmark dataset. The combined MFCC-Text Convolutional Neural Network (CNN) model proved to be the most accurate in recognizing emotions in IEMOCAP data. We achieved an almost 7% increase in overall accuracy as well as an improvement of 5.6% in average class accuracy when compared to existing state-of-the-art methods.

paper link | presentation | certificate | poster

project_img

Visual Context-aware Convolution Filters for Transformation-invariant Neural Network

We propose a novel visual context-aware filter generation module which incorporates contextual information present in images into Convolutional Neural Networks (CNNs). In contrast to traditional CNNs, we do not employ the same set of learned convolution filters for all input image instances. Our proposed input-conditioned convolution filters when combined with techniques inspired by Multi-instance learning and max-pooling, results in a transformation-invariant neural network. We investigated the performance of our proposed framework on three MNIST variations, which covers both rotation and scaling variance, and achieved 1.13% error on MNIST-rot-12k, 1.12% error on Half-rotated MNIST and 0.68% error on Scaling MNIST, which is significantly better than the state-of-the-art results. We make use of visualization to further prove the effectiveness of our visual context-aware convolution filters. Our proposed visual context-aware convolution filter generation framework can also serve as a plugin for any CNN based architecture and enhance its modeling capacity.

paper link

Undergraduate Research Projects
project_img

Classification of objects from the stream of Surveillance videos
Supervisor: Dr Harish Karnick, Indian Institute of Technology Kanpur

The project aimed at building a system for detecting and classifying objects in a video stream into three classes- Pedestrian, Two-Wheeler and Four-Wheeler.

• Implemented various state-of-art Background Subtraction algorithms for detecting the object by performing a connected component analysis.
• Extracted SIFT features from different image patches obtained using the bounding box of the annotated frames.
• Trained different ciassifiers namely, SVM, random forest and decision tree to predict the label of detected objects.
• Extracied features from images using pre-trained convolutional neural networks (CNN) on the ILSVRC 2012 dataset and used caffe for extracting features.
• For detection, implemented Selective Search, which generates all possible object locations in a given image. It is a data-dnven approach which combines the strength of segmentation and exhaustive search.

report

project_img

Direct Content Analysis for Scene Intensity Estimation in Movies using low-level multimodal features
Supervisor: Dr. Tanaya Guha, Indian Institute of Technology Kanpur

The project aims at developing a computational model to estimate scene intensity profile in movies or videos. Scene intensity can be understood as a measure of excitement or activity in a scene.

• Exploited computable video features namely, average shot length, color variance, motion content, lighting key, motion energy, harmonicity etc. As video features to compute scene intensity.
• Incorporated facial emotion detection using the optical flow of facial interest points
• Created a small dataset by manually timestamping scene boundaries and conducted a survey asking people how critical they consider of these scenes in a particular movie.
• Various cinematic principles and video features is being exploited for robust scene intensity estimation.

Selected as the best project in the course comprising of around 30 students.

report | presentation | paper presentation

project_img

Dictionary Learning and Sparse representation based Image Processing Applications
Supervisor: Dr. Ketan Rajawat, Indian Institute of Technology Kanpur

The project aimed at exploring various dictionary learning algorithms(k-SVD, MOD, OMP)and implementing sparse representation based application in Image Processing like Image denoising, inpainting, classification, compression etc.

• Implemented image inpainting(removing corrupted pixels in the target region)using sparse representation on dictionary learned from randomly sampling patches from the source region of the image.
• Compared Sparse based Image denoising using overcomplete DCT dictionary with state-of-art methods.
• Implemented Sparse representation based Image classification on MNIST dataset.

report

project_img

Age and Gender Recognition of a Speaker from Interactive voice response (IVR)systems
Supervisor: Dr R.M.Hegde, Indian Institute of Technology Kanpur

The project aimed at building a system for Age and Gender Recognition using speech features.

• Pre-processed and extracted useful long-term and short-term features including MFCC (Mel Frequency Cepstral Coefficients), Shifted Delta Cepstral (SDC), pitch, and first three formants information from the speech signals.
• Trained 128-mixture GMM model with MAP adaptation for MFCCs and used WSNMF for dimensionality reduction.
• Analysed performance of various machine learning classifiers such as Support Vector Machines (SVM), Random Forests, Decision Trees.

report | presentation

project_img

Analysis of Benford's Law in Digital Image Forensics
Supervisor: Dr Sumana Gupta , Indian Institute of Technology Kanpur

Analysed various application of Benford's law in Digital Image Forensics.

• JPEG and JPEG2000 approximately follow this law and JPEG2000 found closer to the given law. The non-following of this law in different forensics setu ps can be used as fingerprint. Further, amount of deviation from Benford's law in compressed images can be used to find forgery.
• Performed simulations to detect multiple compressions for JPEG images, glare detection in UCID images etc.

report

project_img

Non-Parametric Method for Indoor Fire-Fighter Tracking using Data Depth-based Localizers
Supervisor: Dr R.M.Hegde , Indian Institute of Technology Kanpur

The project aimed at developing Maximum depth-based, kNN depth-based, and Hybrid MD-kNN Localizers for ad-hoc Sensor Networks. It have performance improvement in terms of computation time and robustness.

• Implemented depth-based localizer using Matlab and R codes for various depth functions like Tukey, Liu, Oja, L1, and Mahalanobis Depth. Analysed time complexity and robustness of various depth functions.
• Performed training at each of the grid points using the offline collected observation vectors from all the anchors in the network and mapped the online observation vector to the appropriate grid point in the network using localizers.
• Analysed Intel Lab Data for assessing robustness, localization success rate, and computation time of the localizers.

report

Internships
project_img

Samsung Research and Development Institute, Bengaluru (SRIB) [May'15-June'15]
Supervisor: Srinivas Rao Kudavelly, Principal Engineer, Innovation & Enterprise Biz Division/​HME(Health and Medical Equipments)/U​ltrasound)

Pyramidal Implementation of Lucas-Kanade-Tomasi (LKT) Feature Tracking Algorithm for 3D Images

• The project aimed at C++ implemention of Lucas-Kanade-Tomasi (LKT) Feature tracking algorithm for Ultrasound 3D image (echocardiogram). Pyramidal Implementation of the above algorithm has performance improvement in terms of local accuracy and robustness.
• Analysed the sensitivity of algorithm to various parameters.

Got Pre-Placement offer for full-time position at Samsung.


Go to top

inspired from this website