Novel topic n-gram count LM incorporating document-based topic distributions and n-gram counts

Novel topic n-gram count LM incorporating document-based topic distributions and n-gram counts. Haidar, M. A. & O'Shaughnessy, D. In 2014 22nd European Signal Processing Conference (EUSIPCO), pages 2310-2314, Sep., 2014.

Paper abstract bibtex

In this paper, we introduce a novel topic n-gram count language model (NTNCLM) using topic probabilities of training documents and document-based n-gram counts. The topic probabilities for the documents are computed by averaging the topic probabilities of words seen in the documents. The topic probabilities of documents are multiplied by the document-based n-gram counts. The products are then summed-up for all the training documents. The results are used as the counts of the respective topics to create the NTNCLMs. The NTNCLMs are adapted by using the topic probabilities of a development test set that are computed as above. We compare our approach with a recently proposed TNCLM [1], where the long-range information outside of the n-gram events is not encountered. Our approach yields significant perplexity and word error rate (WER) reductions over the other approach using the Wall Street Journal (WSJ) corpus.

@InProceedings{6952842,
  author = {M. A. Haidar and D. O'Shaughnessy},
  booktitle = {2014 22nd European Signal Processing Conference (EUSIPCO)},
  title = {Novel topic n-gram count LM incorporating document-based topic distributions and n-gram counts},
  year = {2014},
  pages = {2310-2314},
  abstract = {In this paper, we introduce a novel topic n-gram count language model (NTNCLM) using topic probabilities of training documents and document-based n-gram counts. The topic probabilities for the documents are computed by averaging the topic probabilities of words seen in the documents. The topic probabilities of documents are multiplied by the document-based n-gram counts. The products are then summed-up for all the training documents. The results are used as the counts of the respective topics to create the NTNCLMs. The NTNCLMs are adapted by using the topic probabilities of a development test set that are computed as above. We compare our approach with a recently proposed TNCLM [1], where the long-range information outside of the n-gram events is not encountered. Our approach yields significant perplexity and word error rate (WER) reductions over the other approach using the Wall Street Journal (WSJ) corpus.},
  keywords = {document handling;natural language processing;speech processing;topic n-gram count LM;document-based topic distributions;topic n-gram count language model;NTNCLM;topic probabilities;training documents;document-based n-gram counts;long-range information;word error rate;WER reductions;Wall Street Journal;WSJ corpus;Adaptation models;Mathematical model;Training;Computational modeling;Semantics;Interpolation;Speech recognition;Statistical n-gram language model;speech recognition;mixture models;topic models},
  issn = {2076-1465},
  month = {Sep.},
  url = {https://www.eurasip.org/proceedings/eusipco/eusipco2014/html/papers/1569909989.pdf},
}

Downloads: 0

{"_id":"YHGXoSo7zksgh6wA4","bibbaseid":"haidar-oshaughnessy-noveltopicngramcountlmincorporatingdocumentbasedtopicdistributionsandngramcounts-2014","authorIDs":[],"author_short":["Haidar, M. A.","O'Shaughnessy, D."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"firstnames":["M.","A."],"propositions":[],"lastnames":["Haidar"],"suffixes":[]},{"firstnames":["D."],"propositions":[],"lastnames":["O'Shaughnessy"],"suffixes":[]}],"booktitle":"2014 22nd European Signal Processing Conference (EUSIPCO)","title":"Novel topic n-gram count LM incorporating document-based topic distributions and n-gram counts","year":"2014","pages":"2310-2314","abstract":"In this paper, we introduce a novel topic n-gram count language model (NTNCLM) using topic probabilities of training documents and document-based n-gram counts. The topic probabilities for the documents are computed by averaging the topic probabilities of words seen in the documents. The topic probabilities of documents are multiplied by the document-based n-gram counts. The products are then summed-up for all the training documents. The results are used as the counts of the respective topics to create the NTNCLMs. The NTNCLMs are adapted by using the topic probabilities of a development test set that are computed as above. We compare our approach with a recently proposed TNCLM [1], where the long-range information outside of the n-gram events is not encountered. Our approach yields significant perplexity and word error rate (WER) reductions over the other approach using the Wall Street Journal (WSJ) corpus.","keywords":"document handling;natural language processing;speech processing;topic n-gram count LM;document-based topic distributions;topic n-gram count language model;NTNCLM;topic probabilities;training documents;document-based n-gram counts;long-range information;word error rate;WER reductions;Wall Street Journal;WSJ corpus;Adaptation models;Mathematical model;Training;Computational modeling;Semantics;Interpolation;Speech recognition;Statistical n-gram language model;speech recognition;mixture models;topic models","issn":"2076-1465","month":"Sep.","url":"https://www.eurasip.org/proceedings/eusipco/eusipco2014/html/papers/1569909989.pdf","bibtex":"@InProceedings{6952842,\n author = {M. A. Haidar and D. O'Shaughnessy},\n booktitle = {2014 22nd European Signal Processing Conference (EUSIPCO)},\n title = {Novel topic n-gram count LM incorporating document-based topic distributions and n-gram counts},\n year = {2014},\n pages = {2310-2314},\n abstract = {In this paper, we introduce a novel topic n-gram count language model (NTNCLM) using topic probabilities of training documents and document-based n-gram counts. The topic probabilities for the documents are computed by averaging the topic probabilities of words seen in the documents. The topic probabilities of documents are multiplied by the document-based n-gram counts. The products are then summed-up for all the training documents. The results are used as the counts of the respective topics to create the NTNCLMs. The NTNCLMs are adapted by using the topic probabilities of a development test set that are computed as above. We compare our approach with a recently proposed TNCLM [1], where the long-range information outside of the n-gram events is not encountered. Our approach yields significant perplexity and word error rate (WER) reductions over the other approach using the Wall Street Journal (WSJ) corpus.},\n keywords = {document handling;natural language processing;speech processing;topic n-gram count LM;document-based topic distributions;topic n-gram count language model;NTNCLM;topic probabilities;training documents;document-based n-gram counts;long-range information;word error rate;WER reductions;Wall Street Journal;WSJ corpus;Adaptation models;Mathematical model;Training;Computational modeling;Semantics;Interpolation;Speech recognition;Statistical n-gram language model;speech recognition;mixture models;topic models},\n issn = {2076-1465},\n month = {Sep.},\n url = {https://www.eurasip.org/proceedings/eusipco/eusipco2014/html/papers/1569909989.pdf},\n}\n\n","author_short":["Haidar, M. A.","O'Shaughnessy, D."],"key":"6952842","id":"6952842","bibbaseid":"haidar-oshaughnessy-noveltopicngramcountlmincorporatingdocumentbasedtopicdistributionsandngramcounts-2014","role":"author","urls":{"Paper":"https://www.eurasip.org/proceedings/eusipco/eusipco2014/html/papers/1569909989.pdf"},"keyword":["document handling;natural language processing;speech processing;topic n-gram count LM;document-based topic distributions;topic n-gram count language model;NTNCLM;topic probabilities;training documents;document-based n-gram counts;long-range information;word error rate;WER reductions;Wall Street Journal;WSJ corpus;Adaptation models;Mathematical model;Training;Computational modeling;Semantics;Interpolation;Speech recognition;Statistical n-gram language model;speech recognition;mixture models;topic models"],"metadata":{"authorlinks":{}},"downloads":0},"bibtype":"inproceedings","biburl":"https://raw.githubusercontent.com/Roznn/EUSIPCO/main/eusipco2014url.bib","creationDate":"2021-02-13T17:43:41.783Z","downloads":0,"keywords":["document handling;natural language processing;speech processing;topic n-gram count lm;document-based topic distributions;topic n-gram count language model;ntnclm;topic probabilities;training documents;document-based n-gram counts;long-range information;word error rate;wer reductions;wall street journal;wsj corpus;adaptation models;mathematical model;training;computational modeling;semantics;interpolation;speech recognition;statistical n-gram language model;speech recognition;mixture models;topic models"],"search_terms":["novel","topic","gram","count","incorporating","document","based","topic","distributions","gram","counts","haidar","o'shaughnessy"],"title":"Novel topic n-gram count LM incorporating document-based topic distributions and n-gram counts","year":2014,"dataSources":["A2ezyFL6GG6na7bbs","oZFG3eQZPXnykPgnE"]}