A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Samardzic, Tanja; Gutierrez, Ximena; Bentz, Christian; Moran, Steven; Pelloni, Olga

Computer Science > Computation and Language

arXiv:2403.03909 (cs)

[Submitted on 6 Mar 2024 (v1), last revised 16 Apr 2024 (this version, v2)]

Title:A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Authors:Tanja Samardzic, Ximena Gutierrez, Christian Bentz, Steven Moran, Olga Pelloni

View PDF HTML (experimental)

Abstract:Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of linguistic features and can identify the types of languages that are not represented in a data set. Using our method, we analyse a range of popular multilingual data sets (UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD). In addition to ranking these data sets, we find, for example, that (poly)synthetic languages are missing in almost all of them.

Comments:	Accepted to NAACL 2024 Findings
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.03909 [cs.CL]
	(or arXiv:2403.03909v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.03909

Submission history

From: Tanja Samardzic [view email]
[v1] Wed, 6 Mar 2024 18:14:22 UTC (879 KB)
[v2] Tue, 16 Apr 2024 10:00:41 UTC (883 KB)

Computer Science > Computation and Language

Title:A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators