Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
nlp:text-mining [2015/05/21 22:28]
plf1
nlp:text-mining [2015/05/21 22:43] (current)
plf1
Line 4: Line 4:
 == Publications == == Publications ==
  
-^ http://​link.springer.com/​chapter/​10.1007/​978-3-642-40722-2_7| Probabilistic Explicit Topic Modeling Using Wikipedia ​  ^^  +[http://​link.springer.com/​chapter/​10.1007/​978-3-642-40722-2_7| Probabilistic Explicit Topic Modeling Using Wikipedia  ^^  
-| media:​nlp:​120px-explicit-topics-wikipedia.png | Joshua Hansen, Eric Ringger, Kevin Seppi   ​| ​+[[media:​nlp:​120px-explicit-topics-wikipedia.png]] | Joshua Hansen, Eric Ringger, Kevin Seppi   ​| ​
 | :::                             | '''​ In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013) ''' ​                                       |  | :::                             | '''​ In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013) ''' ​                                       | 
 | :::                             | Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore,​ the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability,​ isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method. ​ |  | :::                             | Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore,​ the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability,​ isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method. ​ | 
Line 12: Line 12:
  
  
-^ http://​darci.cs.byu.edu/​dheath/​pubs/​icsc_2013.pdf| Semantic Models as a Combination of Free Association Norms and Corpus-based Correlations ​  ^^  +[http://​darci.cs.byu.edu/​dheath/​pubs/​icsc_2013.pdf| Semantic Models as a Combination of Free Association Norms and Corpus-based Correlations  ^^  
-| media:​nlp:​120px-semantic-models.png | Derrall Heath, David Norton, Eric Ringger, Dan Ventura ​  ​| ​+[[media:​nlp:​120px-semantic-models.png]] | Derrall Heath, David Norton, Eric Ringger, Dan Ventura ​  ​| ​
 | :::                             | '''​ In Proceedings of the Seventh IEEE International Conference on Semantic Computing (ICSC 2013) ''' ​                                       |  | :::                             | '''​ In Proceedings of the Seventh IEEE International Conference on Semantic Computing (ICSC 2013) ''' ​                                       | 
 | :::                             | We present computational models capable of understanding and conveying concepts based on word associations. We discover word associations automatically using corpus-based semantic models with Wikipedia as the corpus. The best model effectively combines corpus-based models with preexisting databases of free association norms gathered from human volunteers. We use this model to play human-directed and computer-directed word guessing games (games with a purpose similar to Catch Phrase or Taboo) and show that this model can measurably convey and understand some aspect of word meaning. The results highlight the fact that human-derived word associations and corpus-derived word associations can play complementary roles in semantic models. ​ |  | :::                             | We present computational models capable of understanding and conveying concepts based on word associations. We discover word associations automatically using corpus-based semantic models with Wikipedia as the corpus. The best model effectively combines corpus-based models with preexisting databases of free association norms gathered from human volunteers. We use this model to play human-directed and computer-directed word guessing games (games with a purpose similar to Catch Phrase or Taboo) and show that this model can measurably convey and understand some aspect of word meaning. The results highlight the fact that human-derived word associations and corpus-derived word associations can play complementary roles in semantic models. ​ | 
Line 19: Line 19:
  
  
-^ http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1568659| Evaluating Supervised Topic Models in the Presence of OCR Errors ​  ^^  +[http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1568659| Evaluating Supervised Topic Models in the Presence of OCR Errors  ^^  
-| media:​nlp:​120px-supervised-noisy-tm.png | Daniel Walker; Eric Ringger; Kevin Seppi   ​| ​+[[media:​nlp:​120px-supervised-noisy-tm.png]] | Daniel Walker; Eric Ringger; Kevin Seppi   ​| ​
 | :::                             | '''​ The Conference on Document Recognition and Retrieval XX (DRR 2013) ''' ​                                       |  | :::                             | '''​ The Conference on Document Recognition and Retrieval XX (DRR 2013) ''' ​                                       | 
 | :::                             | '''​ Received best student paper award ''' ​                                       |  | :::                             | '''​ Received best student paper award ''' ​                                       | 
Line 27: Line 27:
  
  
-^ http://​www.abnms.org/​uai2012-apps-workshop/​papers/​WalkerEtal.pdf| Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation ​  ^^  +[http://​www.abnms.org/​uai2012-apps-workshop/​papers/​WalkerEtal.pdf| Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation  ^^  
-| media:​nlp:​120px-supervised-tonpt.png | Daniel Walker; Eric Ringger; Kevin Seppi   ​| ​+[[media:​nlp:​120px-supervised-tonpt.png]] | Daniel Walker; Eric Ringger; Kevin Seppi   ​| ​
 | :::                             | '''​ Proceedings of the 9th Bayesian Modeling Applications Workshop (UAI 2012) ''' ​                                       |  | :::                             | '''​ Proceedings of the 9th Bayesian Modeling Applications Workshop (UAI 2012) ''' ​                                       | 
-| :::                             | We introduce a new supervised topic model that uses a nonparametric density estimator to model the distribution of real-valued +| :::                             | We introduce a new supervised topic model that uses a nonparametric density estimator to model the distribution of real-valued metadata given a topic. The model is similar to Topics Over Time, but replaces the beta distributions used in that model with a Dirichlet process mixture of normals. The use of a nonparametric density estimator allows for the fitting of a greater class of metadata densities. We compare our model with existing supervised topic models in terms of prediction and show that it is capable of discovering complex metadata distributions in both synthetic and real data.  | 
-metadata given a topic. The model is similar to Topics Over Time, but replaces the beta distributions used in that model with a +
-Dirichlet process mixture of normals. The use of a nonparametric density estimator allows for the fitting of a greater class of +
-metadata densities. We compare our model with existing supervised topic models in terms of prediction and show that it is capable of discovering complex metadata distributions in both synthetic and real data.  | +
  
  
  
-^ http://​flosshub.org/​sites/​flosshub.org/​files/​MacLean2011a.pdf| Knowledge Homogeneity and Specialization in the Apache HTTP Server Project ​  ^^  +[http://​flosshub.org/​sites/​flosshub.org/​files/​MacLean2011a.pdf| Knowledge Homogeneity and Specialization in the Apache HTTP Server Project  ^^  
-| media:​nlp:​120px-apache-knowledge.png | Alexander MacLean; Landon Pratt; Charles Knutson; Eric Ringger ​  ​| ​+[[media:​nlp:​120px-apache-knowledge.png]] | Alexander MacLean; Landon Pratt; Charles Knutson; Eric Ringger ​  ​| ​
 | :::                             | '''​ Proceedings of the 7th International Conference on Open Source Systems (OSS 2011) ''' ​                                       |  | :::                             | '''​ Proceedings of the 7th International Conference on Open Source Systems (OSS 2011) ''' ​                                       | 
 | :::                             | We present an analysis of developer communication in the Apache HTTP Server project. Using topic modeling techniques we expose latent sub-communities arising from developer specialization within the greater developer population. ​ |  | :::                             | We present an analysis of developer communication in the Apache HTTP Server project. Using topic modeling techniques we expose latent sub-communities arising from developer specialization within the greater developer population. ​ | 
Line 45: Line 42:
  
  
-^ http://​sequoia.cs.byu.edu/​lab/​files/​pubs/​Pratt2011.pdf| Cliff Walls: An Analysis of Monolithic Commits Using Latent Dirichlet Allocation ​  ^^  +[http://​sequoia.cs.byu.edu/​lab/​files/​pubs/​Pratt2011.pdf| Cliff Walls: An Analysis of Monolithic Commits Using Latent Dirichlet Allocation  ^^  
-| media:​nlp:​120px-comment-analysis-lda.png | Landon Pratt; Alexander MacLean; Charles Knutson; Eric Ringger ​  ​| ​+[[media:​nlp:​120px-comment-analysis-lda.png]] | Landon Pratt; Alexander MacLean; Charles Knutson; Eric Ringger ​  ​| ​
 | :::                             | '''​ Proceedings of the 7th International Conference on Open Source Systems (OSS 2011) ''' ​                                       |  | :::                             | '''​ Proceedings of the 7th International Conference on Open Source Systems (OSS 2011) ''' ​                                       | 
 | :::                             | Large commits, which we refer to as "Cliff Walls",​ are a significant challenge to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits. ​ |  | :::                             | Large commits, which we refer to as "Cliff Walls",​ are a significant challenge to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits. ​ | 
Line 52: Line 49:
  
  
-^ http://​cseweb.ucsd.edu/​~lvdmaaten/​workshops/​nips2010/​papers/​gardner.pdf| The Topic Browser: An Interactive Tool for Browsing Topic Models ​  ^^  +[http://​cseweb.ucsd.edu/​~lvdmaaten/​workshops/​nips2010/​papers/​gardner.pdf| The Topic Browser: An Interactive Tool for Browsing Topic Models  ^^  
-| media:​nlp:​120px-topic-browser.png | Matthew Gardner; Joshua Lutes; Jeff Lund; Josh Hansen; Dan Walker; Eric Ringger; Kevin Seppi   ​| ​+[[media:​nlp:​120px-topic-browser.png]] | Matthew Gardner; Joshua Lutes; Jeff Lund; Josh Hansen; Dan Walker; Eric Ringger; Kevin Seppi   ​| ​
 | :::                             | '''​ Proceedings of the Workshop on Challenges of Data Visualization (NIPS 2010) ''' ​                                       |  | :::                             | '''​ Proceedings of the Workshop on Challenges of Data Visualization (NIPS 2010) ''' ​                                       | 
 | :::                             | We present the Topical Guide (formerly "the Topic Browser"​),​ an interactive tool that incorporates both prior work in displaying topic models as well as some novel ideas that greatly enhance the visualization of these models. ​ The Topical Guide is a general tool for browsing the entire output of a topic model along with the analyzed corpus. ​ With expert interaction,​ the Topical Guide together with the underlying topic models can provide valuable insights into a given corpus. ​  ​| ​ | :::                             | We present the Topical Guide (formerly "the Topic Browser"​),​ an interactive tool that incorporates both prior work in displaying topic models as well as some novel ideas that greatly enhance the visualization of these models. ​ The Topical Guide is a general tool for browsing the entire output of a topic model along with the analyzed corpus. ​ With expert interaction,​ the Topical Guide together with the underlying topic models can provide valuable insights into a given corpus. ​  ​| ​
Line 61: Line 58:
  
  
-^ http://​nlp.cs.byu.edu/​~dan/​papers/​emnlp_2010.pdf| Evaluating Models of Latent Document Semantics in the Presence of OCR Errors ​  ^^  +[http://​nlp.cs.byu.edu/​~dan/​papers/​emnlp_2010.pdf| Evaluating Models of Latent Document Semantics in the Presence of OCR Errors  ^^  
-| media:​nlp:​120px-noisyocr-lds.png | Dan Walker; Bill Lund; Eric Ringger ​  ​| ​+[[media:​nlp:​120px-noisyocr-lds.png]] | Dan Walker; Bill Lund; Eric Ringger ​  ​| ​
 | :::                             | '''​ EMNLP 2010 ''' ​                                       |  | :::                             | '''​ EMNLP 2010 ''' ​                                       | 
 | :::                             | We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.    |  | :::                             | We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.    | 
  
  
-^ http://​contentdm.lib.byu.edu/​cdm/​singleitem/​collection/​ETD/​id/​1964/​rec/​1| Bisecting Document Clustering Using Model-Based Methods ​  ​^^  +[http://​contentdm.lib.byu.edu/​cdm/​singleitem/​collection/​ETD/​id/​1964/​rec/​1| Bisecting Document Clustering Using Model-Based Methods]  ​^^  
-| media:​nlp:​140px-bisecting-clustering.png | Aaron Davis   ​| ​+[[media:​nlp:​140px-bisecting-clustering.png]] | Aaron Davis   ​| ​
 | :::                             | '''​ Master'​s Thesis. ​ Advised by Eric Ringger. ''' ​                                       |  | :::                             | '''​ Master'​s Thesis. ​ Advised by Eric Ringger. ''' ​                                       | 
 | :::                             | We use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets.  Additionally,​ we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.  |  | :::                             | We use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets.  Additionally,​ we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.  | 
Line 74: Line 71:
  
  
-^ http://​faculty.cs.byu.edu/​~ringger/​CS601R/​papers/​WalkerRingger-Gibbs-kdd2008.pdf| Model-Based Document Clustering with a Collapsed Gibbs Sampler ​  ^^  +[http://​faculty.cs.byu.edu/​~ringger/​CS601R/​papers/​WalkerRingger-Gibbs-kdd2008.pdf| Model-Based Document Clustering with a Collapsed Gibbs Sampler  ^^  
-| media:​nlp:​120px-document-clustering.png | Daniel Walker; Eric Ringger ​  ​| ​+[[media:​nlp:​120px-document-clustering.png]] | Daniel Walker; Eric Ringger ​  ​| ​
 | :::                             | '''​ In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD 2008) ''' ​                                       |  | :::                             | '''​ In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD 2008) ''' ​                                       | 
 | :::                             | We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a  mixture of multinomials model using Gibbs sampling, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm. ​ We shed further light on effective the use of Gibbs sampling for document clustering. ​  ​| ​ | :::                             | We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a  mixture of multinomials model using Gibbs sampling, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm. ​ We shed further light on effective the use of Gibbs sampling for document clustering. ​  ​| ​
Line 81: Line 78:
  
  
-^ http://​synapse.cs.byu.edu/​papers/​drv.icsc2008.pdf| Sentiment Regression: Using Real-Valued Scores to Summarize Overall Document Sentiment ​  ^^  +[http://​synapse.cs.byu.edu/​papers/​drv.icsc2008.pdf| Sentiment Regression: Using Real-Valued Scores to Summarize Overall Document Sentiment  ^^  
-| media:​nlp:​150px-sentiment-regression.png | Adam Drake; Eric Ringger; Dan Ventura ​  ​| ​+[[media:​nlp:​150px-sentiment-regression.png]] | Adam Drake; Eric Ringger; Dan Ventura ​  ​| ​
 | :::                             | '''​ In Proceedings of the Second IEEE International Conference on Semantic Computing (ICSC 2008) ''' ​                                       |  | :::                             | '''​ In Proceedings of the Second IEEE International Conference on Semantic Computing (ICSC 2008) ''' ​                                       | 
 | :::                             | We consider a sentiment regression problem: summarizing the overall sentiment of a review with a real-valued score. Empirical results on a set of labeled reviews show that real-valued sentiment modeling is feasible, and several algorithms improve upon baseline performance. We also analyze performance as the granularity of the classification problem moves from two-class (positive vs. negative) towards infinite-class (real-valued). ​ |  | :::                             | We consider a sentiment regression problem: summarizing the overall sentiment of a review with a real-valued score. Empirical results on a set of labeled reviews show that real-valued sentiment modeling is feasible, and several algorithms improve upon baseline performance. We also analyze performance as the granularity of the classification problem moves from two-class (positive vs. negative) towards infinite-class (real-valued). ​ | 
nlp/text-mining.1432247339.txt.gz · Last modified: 2015/05/21 22:28 by plf1
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0