Kenny Shirley
Principal Inventive Scientist
kshirley at
Statistics Research Department
AT&T Labs Research

I am a member of the Statistics Research Department at AT&T Labs in New York City, where I work on hierarchical Bayesian modeling, MCMC methods, visualization of hierarchies, text mining, and other topics related to applied statistics.



Next week Thursday (May 21st) my colleague Wei Wang will present our paper "Breaking Bad: Detecting Malicious Domains Using Word Segmentation" at the Web 2.0 Security and Privacy (W2SP) workshop in San Jose, CA. In the paper we describe how we segmented a set of domain names into individual tokens, and then used the resulting bag of words as an additional feature set to improve the predictive accuracy of our model to detect malicious domains. The outcome variable ("maliciousness") was gathered from the Web of Trust, a crowdsourced website reputation rating service.

Some highlights of this project were:

  1. Reading Peter Norvig's chapter of the book Beautiful Data, which includes a description of the word segmentation algorithm, along with python code to implement it. On a side note, this is the second place I've seen an example of using statistics to break a substitution cipher. The first time was in this very entertaining paper by Persi Diaconis.
  2. Using the R package glmnet to do lasso-penalized logistic regression (a really nice way to handle large numbers of features).
  3. Discovering that the names of certain basketball players are strongly associated with malicious domains (at least according to our definition of "malicious"), including "kobe", "jordan", and "lebron". I guess Kevin Durant and Carmelo Anthony are probably jealous that their names aren't yet showing up in the domain names of phishing websites as often as their peers.


I'm happy to report that my R package for visualizing topic models, LDAvis, is now on CRAN! It's a D3.js interactive visualization that's designed help you interpret the topics in a topic model fit to a corpus of text using LDA. I co-wrote it with Carson Sievert, and we also wrote a paper about it (including a user study) that we shared at the 2014 ACL Workshop on Interactive Language Learning, Visualization, and Interfaces in Baltimore last June. Here are the relevant links -- we'd love to hear any questions/comments/feedback.


Last night I gave a talk for the NYC Sports Analytics Meetup Group, and it was a blast! There were lots of great sports researchers and enthusiasts in the crowd. My talk was about Baseball Hall of Fame voting, of course. Here is a link to the slides from my talk.

For older news, click here.