Responsive image

Shashi Kant Gupta

Data Scientist (Conversational AI)
RingCentral Innovation (India) Pvt Ltd

About Me

I am a Data Scientist in the Conversational AI Team at RingCentral Innovation (India). At RingCentral, I have worked on multilingual speech recongition, text punctuation restoration, speaker identification and spoken language identification.

Before joining RingCentral, I completed my bachelor + master degree from IIT Kanpur, where I was advised by Dr. Gabriel Kreiman (Harvard Medical School, Boston, USA) and Prof. K. S. Venkatesh (IIT Kanpur, India). For my master thesis (An Integrated Computational Model of Visual Search Combining Eccentricity, Bottom-up, and Top-down Cues), I worked on the intersection of computer vision, deep learning, cognitive sciences, and neuroscience. During my undergraduate years, I was awarded the prestigious Khoran Scholarship by IUSSTF, WINStep and DBT (India) to work with Dr. Gabriel Kreiman at Harvard Medical School, Boston, USA on computational neuroscience and deep learning. Before that, I also touched upon robotics during my internship at the Centre for Smart System, SUTD Singapore and as a part of Humanoid IITK Team. During my academic career, I have had the good fortune to work with some incredible researchers, Dr. Gabriel Kreiman (Harvard Medical School, Boston, and CBMM, MIT), Mengmi Zhang (NTU Singapore, and A \*STAR Singapore), Prof. K. S. Venkatesh (IIT Kanpur India), and Prof. Nisheeth Srivastava (IIT Kanpur India).

During my free time, I love to do gardening, pencil sketching, kirigami, and watching animes.


  1. Shashi Kant Gupta, Sushant Hiray, Prashant Kukde "Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech", Accepted by Interspeech 2023 [] [paper]
  2. This work focuses on improving the Spoken Language Identification (LangId) system for a challenge that focuses on developing robust language identification systems that are reliable for non-standard, accented (Singaporean accent), spontaneous code-switched, and child-directed speech collected via Zoom. We propose a two-stage Encoder-Decoder-based E2E model. The encoder module consists of 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with a global context. The decoder module uses an attentive temporal pooling mechanism to get fixed length time-independent feature representation. The total number of parameters in the model is around 22.1 M, which is relatively light compared to using some large-scale pre-trained speech models. We achieved an EER of 15.6% in the closed track and 11.1% in the open track (baseline system 22.1%). We also curated additional LangId data from YouTube videos (having Singaporean speakers), which will be released for public use.
  3. Shashi Kant Gupta, Mengmi Zhang, Chia-Chien Wu, Jeremy M. Wolfe, Gabriel Kreiman "Visual Search Asymmetry Deep Nets and Humans Share Similar Inherent Biases", Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS 2021) [] [paper]
  4. Visual search is a ubiquitous and often challenging daily task, exemplified by looking for the car keys at home or a friend in a crowd. An intriguing property of some classical search tasks is an asymmetry such that finding a target A among distractors B can be easier than finding B among A. To elucidate the mechanisms responsible for asymmetry in visual search, we propose a computational model that takes a target and a search image as inputs and produces a sequence of eye movements until the target is found. The model integrates eccentricity-dependent visual recognition with target-dependent top-down cues. We compared the model against human behavior in six paradigmatic search tasks that show asymmetry in humans. Without prior exposure to the stimuli or task-specific training, the model provides a plausible mechanism for search asymmetry. We hypothesized that the polarity of search asymmetry arises from experience with the natural environment. We tested this hypothesis by training the model on augmented versions of ImageNet where the biases of natural images were either removed or reversed. The polarity of search asymmetry disappeared or was altered depending on the training protocol. This study highlights how classical perceptual properties can emerge in neural network models, without the need for task-specific training, but rather as a consequence of the statistical properties of the developmental diet fed to the model. All source code and data are publicly available at here.
  5. Shivi Gupta, Shashi Kant Gupta "Investigating Emotion-Color Association in Deep Neural Networks", Annual Conference of the Cognitive Science Society 2021 (Abstract) [] [paper]
  6. Deep Neural Network representations correlate very well with neural responses measured in primates' brains and with psychological representations of human similarity judgement tasks, making them possible models for human behavior-related tasks. This study investigates whether DNNs can learn an implicit association (between colors and emotions) for images. An experiment was conducted in which subjects were asked to select a color for a given emotion-inducing image. These human responses (decision probabilities) were modeled on neural networks using representations extracted from pre-trained DNNs for the images and colors (a square of the color). The model presented showed a fuzzy linear relationship with the decision probabilities. Finally, this model was presented as a model for emotion classification tasks, specifically with very few training examples, showing an improvement in accuracy from a standard classification model. This analysis can be of relevance to psychologists studying these associations and AI researchers modelling emotional intelligence in machines.
  7. Shashi Kant Gupta "Reinforcement Based Learning on Classification Task Could Yield Better Generalisation and Adversarial Accuracy", Workshop on Shared Visual Representations in Human and Machine Intelligence (SVRHM), Neural Information Processing Systems 2020 (NeurIPS 2020) [] [paper]
  8. Deep Learning has become interestingly popular in computer vision, mostly attaining near or above human-level performance in various vision tasks. But recent work has also demonstrated that these deep neural networks are very vulnerable to adversarial examples (adversarial examples - inputs to a model which are naturally similar to original data but fools the model in classifying it into a wrong class). Humans are very robust against such perturbations; one possible reason could be that humans do not learn to classify based on an error between "target label" and "predicted label" but possibly due to reinforcements that they receive on their predictions. In this work, we proposed a novel method to train deep learning models on an image classification task. We used a reward-based optimization function, similar to the vanilla policy gradient method used in reinforcement learning, to train our model instead of conventional cross-entropy loss. An empirical evaluation on the cifar10 dataset showed that our method learns a more robust classifier than the same model architecture trained using cross-entropy loss function (on adversarial training). At the same time, our method shows a better generalization with the difference in test accuracy and train accuracy <2% for most of the time compared to the cross-entropy one, whose difference most of the time remains >2%.
  9. Shashi Kant Gupta "A More Biologically Plausible Local Learning Rule for ANNs", Beyond Backpropagation Workshop, Neural Information Processing Systems 2020 (NeurIPS 2020) [] [paper]
  10. The backpropagation algorithm is often debated for its biological plausibility. However, various learning methods for neural architecture have been proposed in search of more biologically plausible learning. Most of them have tried to solve the "weight transport problem" and try to propagate errors backward in the architecture via some alternative methods. In this work, we investigated a slightly different approach that uses only the local information which captures spike timing information with no propagation of errors. The proposed learning rule is derived from the concepts of spike timing dependant plasticity and neuronal association. A preliminary evaluation done on the binary classification of MNIST and IRIS datasets with two hidden layers shows comparable performance with backpropagation. The model learned using this method also shows a possibility of better adversarial robustness against the FGSM attack compared to the model learned through backpropagation of cross-entropy loss. The local nature of learning gives a possibility of large scale distributed and parallel learning in the network. And finally, the proposed method is a more biologically sound method that can probably help in understanding how biological neurons learn different abstractions.