Graduate Student pursuing research in Text Mining and Machine Learning at Stony Brook University


Greetings ! I am a Phd candidate in the Department of Computer Science at Stony Brook University, USA and am a member of the Data Science Lab. I am advised by Prof.Steven Skiena. My research interests lie at the intersection of Text Mining, Machine Learning and Computational Social Science. At Data Science Lab I have worked on several projects with a focus on representation learning geared towards Natural Language Processing. My thesis focuses on developing statistical models for detecting and analyzing linguistic variation in social media [proposal]. I am fortunate to have collaborated on projects with Rami Al-Rfou, Bryan Perozzi at Stony Brook, Abhradeep Guha Thakurta who is currently at Apple, and Yashar Mehdad from Yahoo! Research (now at Airbnb). I am also a part of HLAB, where I collaborate and work with Prof. Andrew Schwartz on analyzing language on social media with a human-centric focus.


  • 2016

    Domain Adaptation for Named Entity Recognition in Online Media with Word Embeddings.

    In this paper, we propose methods to effectively adapt models learned on one domain onto other domains using distributed word representations. First we analyze the linguistic variation present across domains to identify key linguistic insights that can boost performance across domains. We propose methods to capture domain specific semantics of word usage in addition to global semantics. We then demonstrate how to effectively use such domain specific knowledge to learn NER models that outperform previous baselines in the domain adaptation setting. pdf

  • 2016

    Multiscale Graph Embeddings for Interpretable Network Classification.

    We present Walklets, a novel approach for learning multiscale representations of vertices in a network. These representations clearly encode multiscale vertex relationships in a continuous vector space suitable for multi-label classification problems. Unlike previous work, the latent features generated using Walklets are analytically derivable, and human interpretable pdf

  • 2016

    Freshman or Fresher? Quantifying the Geographic Variation of Internet Language [ICWSM, 2016]

    We present a new computational technique to detect and analyze statistically significant geographic variation in language. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. We extend recently developed techniques for neural language models to learn word representations which capture differing semantics across geographical regions. In order to quantify this variation and ensure robust detection of true regional differences, we formulate a null model to determine whether observed changes are statistically significant. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning. Our analysis reveals interesting facets of language change at multiple scales of geographic resolution – from neighboring states to distant continents. pdf

  • 2015

    Statistically Significant Detection of Linguistic Change [WWW,15]

    We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book-ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium. Project Page

  • 2015

    Multilingual Named Entity Recognition [SDM,15]

    We build a Named Entity Recognition system (NER) for 40 languages using only language agnostic methods. Our system relies only on un-supervised methods for feature generation. We obtain training data for the task of NER through a semi-supervised technique not relying whatsoever on any language specific or orthographic features. This approach allows us to scale to large set of languages for which little human expertise and human annotated training data is available. pdf

  • 2015

    Robustness and Stability of Dropout

    Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Dropout is a popular heuristic that has been practically shown to avoid local minima when training these networks. We investigate the robustness and stability properties of Dropout. We empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. pdf

  • 2014

    Inducing Language Networks [Complex Networks IV, 2014]

    We induced networks on continuous space representations of words over the Polyglot and Skipgram models. We compared the structural properties of these networks and demonstrate that these networks differ from networks constructed through other run of the mill methods. We also demonstrated that these networks exhibit a rich and varied community structure. pdf

  • 2013

    Sex Differences in the Human Connectome [BHI,13]

    We investigate sex differences across male and female connectomes identifying several discriminative features. One of our main findings discloses a statistical difference at the pars-orbitalis of the connectome between the sexes, which has been shown to function in language production. pdf


alt-text Freshman or Fresher? Quantifying the Geographic Variation of Internet Language

  Vivek Kulkarni, Bryan Perozzi, Steven Skiena
  10th International Conference of Web and Social Media (ICWSM 2016)

alt-text Statistically Significant Detection of Linguistic Change

  Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, Steven Skiena
  24th International World Wide Web Conference (to appear in WWW 2015)

alt-text Polyglot-NER: Massive Multilingual Named Entity Recognition

  Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steven Skiena
  SIAM International Conference on Data Mining (SDM 2015) 

alt-text A Paper Ceiling: Explaining the Persistent Underrepresentation of Females in Printed News Coverage.

  Eran Shor, Arnout van de Rijt, Alex Miltsov, Vivek Kulkarni, and Steven Skiena
  American Sociological Review

alt-text Inducing Language Networks from Continuous Space Word Representations

  Bryan Perozzi, Rami Al-Rfou, Vivek Kulkarni, Steven Skiena
  Fifth Workshop on Complex Networks (CompleNet 2014)

alt-text Sex Differences in the Human Connectome

  Vivek Kulkarni, Jagat Pudipeddi Sastry, Leman Akoglu et al. 
  Brain and Health Informatics, 2013


I interned at Yahoo! Research in the summer of 2016 and interned at Google during the summer of 2013 and the summer of 2015. I have also spent a couple of years working for Microsoft and Juniper Networks before joining graduate school.


Awarded the prestigious Renaissance Technologies Fellowship 2014-2017.


Please email me if you would like to get in touch !