Studying the Difference Between Natural and Programming Language Corpora

Casalnuovo, Casey; Sagae, Kenji; Devanbu, Prem

Computer Science > Computation and Language

arXiv:1806.02437 (cs)

[Submitted on 6 Jun 2018]

Title:Studying the Difference Between Natural and Programming Language Corpora

Authors:Casey Casalnuovo, Kenji Sagae, Prem Devanbu

View PDF

Abstract:Code corpora, as observed in large software systems, are now known to be far more repetitive and predictable than natural language corpora. But why? Does the difference simply arise from the syntactic limitations of programming languages? Or does it arise from the differences in authoring decisions made by the writers of these natural and programming language texts? We conjecture that the differences are not entirely due to syntax, but also from the fact that reading and writing code is un-natural for humans, and requires substantial mental effort; so, people prefer to write code in ways that are familiar to both reader and writer. To support this argument, we present results from two sets of studies: 1) a first set aimed at attenuating the effects of syntax, and 2) a second, aimed at measuring repetitiveness of text written in other settings (e.g. second language, technical/specialized jargon), which are also effortful to write. We find find that this repetition in source code is not entirely the result of grammar constraints, and thus some repetition must result from human choice. While the evidence we find of similar repetitive behavior in technical and learner corpora does not conclusively show that such language is used by humans to mitigate difficulty, it is consistent with that theory.

Comments:	Preprint
Subjects:	Computation and Language (cs.CL)
MSC classes:	68N15, 68T50
Cite as:	arXiv:1806.02437 [cs.CL]
	(or arXiv:1806.02437v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1806.02437

Submission history

From: Casey Casalnuovo [view email]
[v1] Wed, 6 Jun 2018 22:00:32 UTC (4,462 KB)

Computer Science > Computation and Language

Title:Studying the Difference Between Natural and Programming Language Corpora

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Studying the Difference Between Natural and Programming Language Corpora

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators