Selected Publications

User-factor adaptation is the problem of adapting NLP models to real-valued human attributes, or factors, that capture fine-grained differences between individuals. These factors can include both known factors (e.g. demographics, personality) and latent factors that can be inferred simply from an unlabeled collection of a person’s tweets. Our approach to user-factor adaptation is similar to feature augmentation, a common technique in domain adaptation, with the addition of being able to adapt to continuous variables. We find that we can improve on popular NLP tasks by putting language back into its human context.
EMNLP, 2017.

Natural language processing has increasingly moved from modeling documents and words toward studying the people behind the language. This move to working with data at the user or community level has presented the field with different characteristics of linguistic data. In this paper, we empirically characterize various lexical distributions at different levels of analysis, showing that, while most features are decidedly sparse and non-normal at the message-level (as with traditional NLP), they follow the central limit theorem to become much more Log-normal or even Normal at the userand county-levels. Finally, we demonstrate that modeling lexical features for the correct level of analysis leads to marked improvements in common social scientific prediction tasks.
ACL, 2017.

We present Walklets, a novel approach for learning multiscale representations of vertices in a network. In contrast to previous works, these representations explicitly encode multiscale vertex relationships in a way that is analytically derivable. Walklets generates these multiscale relationships by subsampling short random walks on the vertices of a graph. By `skipping’ over steps in each random walk, our method generates a corpus of vertex pairs which are reachable via paths of a fixed length. This corpus can then be used to learn a series of latent representations, each of which captures successively higher order relationships from the adjacency matrix. We demonstrate the efficacy of Walklets’s latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, DBLP, Flickr, and YouTube. Our results show that Walklets outperforms new methods based on neural matrix factorization. Specifically, we outperform DeepWalk by up to 10% and LINE by 58% Micro-F1 on challenging multi-label classification tasks. Finally, Walklets is an online algorithm, and can easily scale to graphs with millions of vertices and edges.
ASONAM, 2017.

Content on the Internet is heterogeneous and arises from various domains like News, Entertainment, Finance and Technology. Understanding such content requires identifying named entities (persons, places and organizations) as one of the key steps. Traditionally Named Entity Recognition (NER) systems have been built using available annotated datasets (like CoNLL, MUC) and demonstrate excellent performance. However, these models fail to generalize onto other domains like Sports and Finance where conventions and language use can differ significantly. Furthermore, several domains do not have large amounts of annotated labeled data for training robust Named Entity Recognition models. A key step towards this challenge is to adapt models learned on domains where large amounts of annotated training data are available to domains with scarce annotated data. In this paper, we propose methods to effectively adapt models learned on one domain onto other domains using distributed word representations. First we analyze the linguistic variation present across domains to identify key linguistic insights that can boost performance across domains. We propose methods to capture domain specific semantics of word usage in addition to global semantics. We then demonstrate how to effectively use such domain specific knowledge to learn NER models that outperform previous baselines in the domain adaptation setting.
CORR, 2016.

Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models.
CORR, 2016.

We present a new computational technique to detect and analyze statistically significant geographic variation in language. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. We extend recently developed techniques for neural language models to learn word representations which capture differing semantics across geographical regions. In order to quantify this variation and ensure robust detection of true regional differences, we formulate a null model to determine whether observed changes are statistically significant. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning.Our analysis reveals interesting facets of language change at multiple scales of geographic resolution – from neighboring states to distant continents.
ICWSM, 2016.

We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book-ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.
WWW, 2015.

We build a Named Entity Recognition system (NER) for 40 languages using only language agnostic methods. Our system relies only on un-supervised methods for feature generation. We obtain training data for the task of NER through a semi-supervised technique not relying whatsoever on anylanguage specific or orthographic features. This approach allows us to scale to large set of languages for which little human expertise and human annotated training data is available.
SDM, 2015.

Recent Publications

More Publications

. Human Centered NLP with User Factor Adaptation. EMNLP, 2017.


. On the Distribution of Lexical Features at Multiple Levels of Analysis. ACL, 2017.


. Don’t Walk, Skip! Online Learning of Multi-scale Network Embeddings. ASONAM, 2017.


. Domain Adaptation with Named Entity Recognition in Online Media with Word Embeddings. CORR, 2016.


. A Paper ceiling: Explaining the persistent underrepresentation of women in printed news. ASR, 2015.


. Statistically Significant Detection of Linguistic Change. WWW, 2015.

PDF Code Project Slides

. Polyglot NER: Massive Multilingual Named Entity Recognition. SDM, 2015.


. To drop or not to drop: Robustness, consistency and differential privacy properties of dropout. CoRR, 2015.


Recent & Upcoming Talks

Focused Representation Learning
Feb 17, 2017 1:00 PM
Focused Representation Learning
Feb 14, 2017 1:00 PM
Statistical Models for Linguistic Variation in Online Media
Oct 5, 2016 1:00 PM
Statistically Significant Detection of Linguistic Change
Nov 1, 2015 1:00 PM


I have been a teaching assistant for the following courses at Stony Brook University:

  • CSE512: Machine Learning (Graduate)
  • CSE549: Computational Biology (Graduate)
  • CSE305: Introduction to Database systems