nlp:text-mining [CS Wiki]

The automatic discovering of noteworthy patterns and trends in large document collections is the goal of our text mining projects. We are interested not only in identifying such patterns and trends but also in revealing them to human users through useful user interfaces. Our hypothesis is that the process of discovering meaningful patterns is best accomplished by cooperation between automatic methods and human expertise. Our work involves models for document clustering and for topic modeling. The interplay between such models and interactive user interfaces is an area of current investigation.

Publications

Probabilistic Explicit Topic Modeling Using Wikipedia
120px-explicit-topics-wikipedia.png	Joshua Hansen, Eric Ringger, Kevin Seppi
	In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013)
	Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore, the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method.

Semantic Models as a Combination of Free Association Norms and Corpus-based Correlations
120px-semantic-models.png	Derrall Heath, David Norton, Eric Ringger, Dan Ventura
	In Proceedings of the Seventh IEEE International Conference on Semantic Computing (ICSC 2013)
	We present computational models capable of understanding and conveying concepts based on word associations. We discover word associations automatically using corpus-based semantic models with Wikipedia as the corpus. The best model effectively combines corpus-based models with preexisting databases of free association norms gathered from human volunteers. We use this model to play human-directed and computer-directed word guessing games (games with a purpose similar to Catch Phrase or Taboo) and show that this model can measurably convey and understand some aspect of word meaning. The results highlight the fact that human-derived word associations and corpus-derived word associations can play complementary roles in semantic models.

Evaluating Supervised Topic Models in the Presence of OCR Errors
120px-supervised-noisy-tm.png	Daniel Walker; Eric Ringger; Kevin Seppi
	The Conference on Document Recognition and Retrieval XX (DRR 2013)
	Received best student paper award
	Topic discovery using unsupervised topic models degrades as error rates increase in OCR transcriptions of historical document images. Despite the availability of meta-data, analyses by supervised topic models, such as Supervised LDA and Topics over Non-Parametric Time, exhibit similar degradation.

Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation
120px-supervised-tonpt.png	Daniel Walker; Eric Ringger; Kevin Seppi
	Proceedings of the 9th Bayesian Modeling Applications Workshop (UAI 2012)
	We introduce a new supervised topic model that uses a nonparametric density estimator to model the distribution of real-valued metadata given a topic. The model is similar to Topics Over Time, but replaces the beta distributions used in that model with a Dirichlet process mixture of normals. The use of a nonparametric density estimator allows for the ﬁtting of a greater class of metadata densities. We compare our model with existing supervised topic models in terms of prediction and show that it is capable of discovering complex metadata distributions in both synthetic and real data.

Knowledge Homogeneity and Specialization in the Apache HTTP Server Project
120px-apache-knowledge.png	Alexander MacLean; Landon Pratt; Charles Knutson; Eric Ringger
	Proceedings of the 7th International Conference on Open Source Systems (OSS 2011)
	We present an analysis of developer communication in the Apache HTTP Server project. Using topic modeling techniques we expose latent sub-communities arising from developer specialization within the greater developer population.

Cliff Walls: An Analysis of Monolithic Commits Using Latent Dirichlet Allocation
120px-comment-analysis-lda.png	Landon Pratt; Alexander MacLean; Charles Knutson; Eric Ringger
	Proceedings of the 7th International Conference on Open Source Systems (OSS 2011)
	Large commits, which we refer to as “Cliff Walls”, are a significant challenge to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits.

The Topic Browser: An Interactive Tool for Browsing Topic Models
120px-topic-browser.png	Matthew Gardner; Joshua Lutes; Jeff Lund; Josh Hansen; Dan Walker; Eric Ringger; Kevin Seppi
	Proceedings of the Workshop on Challenges of Data Visualization (NIPS 2010)
	We present the Topical Guide (formerly “the Topic Browser”), an interactive tool that incorporates both prior work in displaying topic models as well as some novel ideas that greatly enhance the visualization of these models. The Topical Guide is a general tool for browsing the entire output of a topic model along with the analyzed corpus. With expert interaction, the Topical Guide together with the underlying topic models can provide valuable insights into a given corpus.

Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
120px-noisyocr-lds.png	Dan Walker; Bill Lund; Eric Ringger
	EMNLP 2010
	We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.

Bisecting Document Clustering Using Model-Based Methods
140px-bisecting-clustering.png	Aaron Davis
	Master's Thesis. Advised by Eric Ringger.
	We use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We speciﬁcally use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a reﬁnement step, using EM, to the ﬁnal output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.

Model-Based Document Clustering with a Collapsed Gibbs Sampler
120px-document-clustering.png	Daniel Walker; Eric Ringger
	In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD 2008)
	We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a mixture of multinomials model using Gibbs sampling, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm. We shed further light on effective the use of Gibbs sampling for document clustering.

Sentiment Regression: Using Real-Valued Scores to Summarize Overall Document Sentiment
150px-sentiment-regression.png	Adam Drake; Eric Ringger; Dan Ventura
	In Proceedings of the Second IEEE International Conference on Semantic Computing (ICSC 2008)
	We consider a sentiment regression problem: summarizing the overall sentiment of a review with a real-valued score. Empirical results on a set of labeled reviews show that real-valued sentiment modeling is feasible, and several algorithms improve upon baseline performance. We also analyze performance as the granularity of the classiﬁcation problem moves from two-class (positive vs. negative) towards inﬁnite-class (real-valued).

Questions?

Please contact Eric Ringger or Kevin Seppi, or visit the Natural Language Processing research lab in room 3346 TMCB.

nlp/text-mining.txt · Last modified: 2015/05/21 16:43 by plf1

Back to top