Multilingual Topic Models

Krstovski, Kriste; Kurtz, Michael J.; Smith, David A.; Accomazzi, Alberto

Statistics > Machine Learning

arXiv:1712.06704 (stat)

[Submitted on 18 Dec 2017]

Title:Multilingual Topic Models

Authors:Kriste Krstovski, Michael J. Kurtz, David A. Smith, Alberto Accomazzi

View PDF

Abstract:Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.

Comments:	18 pages, 9 figures
Subjects:	Machine Learning (stat.ML); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:1712.06704 [stat.ML]
	(or arXiv:1712.06704v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1712.06704

Submission history

From: Kriste Krstovski [view email]
[v1] Mon, 18 Dec 2017 22:45:20 UTC (2,924 KB)

Statistics > Machine Learning

Title:Multilingual Topic Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Multilingual Topic Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators