Tag Archives: punctuation

Punctuation: a universal complement to the mathematical perfection of language

Before getting to the research into mathematics and punctuation, I’m setting the scene with snippets from a February 13, 2023 online article by Dan Falk for Aperio magazine, which seems to function both as a magazine and an advertisement for postdoctoral work in Israel funded by the Azrieli Foundation,

Four centuries ago, Galileo famously described the physical world as a realm that was rooted in mathematics. The universe, he wrote, “cannot be read until we have learnt the language and become familiar with the characters in which it is written. It is written in mathematical language, and the letters are triangles, circles and other geometrical figures, without which means it is humanly impossible to comprehend a single word.”

Since Galileo’s time, scientists and philosophers have continued to ponder the question of why mathematics is so shockingly effective at describing physical phenomena. No one would deny that this is a deep question, but for philosopher Balthasar Grabmayr, an Azrieli International Postdoctoral Fellow at the University of Haifa, even deeper questions lie beneath it. Why does mathematics work at all? Does mathematics have limits? And if it does, what can we say about those limits?

Grabmayr found his way to this field from a very different passion: music. Growing up in Vienna, he attended a music conservatory and was set on becoming a classical musician. Eventually, he began to think about what made music work, and then began to think about musical structure. “I started to realize that, actually, what I’m interested in — what I found so attractive in music — is basically mathematics,” he recalls. “Mathematics is the science of structure. I was completely captured by that.”

One of Grabmayr’s main areas of research involves Gödel coding, a technique that, roughly put, allows mathematics to study itself. Gödel coding lets you convert statements about a system of rules or axioms into statements within the original system.

Gödel coding is named for the Austrian logician Kurt Gödel, who in the 1930s developed his famous “incompleteness theorems,” which point to the inherent limitations of mathematics. Although expressed as an equation, Gödel’s proof was based on the idea that a sentence such as “This statement is unprovable” is both true and unprovable. As Rebecca Goldstein’s biography of Gödel declares, he “demonstrated that in every formal system of arithmetic there are true statements that nevertheless cannot be proved. The result was an upheaval that spread far beyond mathematics, challenging conceptions of the nature of the mind.”

Grabmayr’s work builds on the program that Gödel began nearly a century ago. “What I’m really interested in is what the limitations of mathematics are,” he says. “What are the limits of what we can prove? What are the limits of what we can express in formal languages? And what are the limits of what we can calculate using computers?” (That last remark shows that Gödel coding is of interest well beyond the philosophy of mathematics. “We’re surrounded by it,” says Grabmayr. “I mean, without Gödel coding there wouldn’t be any computers.”)

Another potential application is in cognitive science and the study of the mind. Psychologists and other scientists have long debated to what extent the mind is, or is not, like a computer. When we “think,” are we manipulating symbols the way a computer does? The jury is still out on that question, but Grabmayr believes his work can at least point toward some answers. “Cognitive science is based on the premise that we can use computational models to capture certain phenomena of the brain,” he says. “Artificial intelligence, also, is very much concerned with trying to formally capture our reasoning, our thinking processes.”

Albert Visser, a philosopher and logician at Utrecht University in the Netherlands and one of Grabmayr’s PhD supervisors, sees a number of potential payoffs for this research. “Balthasar’s work has some overspill to computer science and linguistics, since it involves a systematic reflection both on coding and on the nature of syntax,” he says. “The discussion of ideas from computer science and linguistics in Balthasar’s work is also beneficial in the other direction. [emphases mine]

Now for the research into punctuation in European languages. From an April 19, 2023 Henryk Niewodniczanski Institute of Nuclear Physics Polish Academy of Sciences press release (also on EurekAlert but published April 20, 2023),

A moment’s hesitation… Yes, a full stop here – but shouldn’t there be a comma there? Or would a hyphen be better? Punctuation can be a nuisance; it is often simply neglected. Wrong! The most recent statistical analyses paint a different picture: punctuation seems to “grow out” of the foundations shared by all the (examined) languages, and its features are far from trivial.

To many, punctuation appears as a necessary evil, to be happily ignored whenever possible. Recent analyses of literature written in the world’s current major languages require us to alter this opinion. In fact, the same statistical features of punctuation usage patterns have been observed in several hundred works written in seven, mainly Western, languages. Punctuation, all ten representatives of which can be found in the introduction to this text, turns out to be a universal and indispensable complement to the mathematical perfection of every language studied. Such a remarkable conclusion about the role of mere commas, exclamation marks or full stops comes from an article by scientists from the Institute of Nuclear Physics of the Polish Academy of Sciences (IFJ PAN) in Cracow, published in the journal Chaos, Solitons & Fractals.

“The present analyses are an extension of our earlier results on the multifractal features of sentence length variation in works of world literature. After all, what is sentence length? It is nothing more than the distance to the next specific punctuation mark –  the full stop. So now we have taken all punctuation marks under a statistical magnifying glass, and we have also looked at what happens to punctuation during translation,” says Prof. Stanislaw Drozdz (IFJ PAN, Cracow University of Technology).

Two sets of texts were studied. The main analyses concerning punctuation within each language were carried out on 240 highly popular literary works written in seven major Western languages: English (44), German (34), French (32), Italian (32), Spanish (32), Polish (34) and Russian (32). This particular selection of languages was based on a criterion: the researchers assumed that no fewer than 50 million people should speak the language in question, and that the works written in it should have been awarded no fewer than five Nobel Prizes for Literature. In addition, for the statistical validity of the research results, each book had to contain at least 1,500 word sequences separated by punctuation marks. A separate collection was prepared to observe the stability of punctuation in translation. It contained 14 works, each of which was available in each of the languages studied (two of the 98 language versions, however, were omitted due to their unavailability). In total, authors in both collections included such writers as Conrad, Dickens, Doyle, Hemingway, Kipling, Orwell, Salinger, Woolf, Grass, Kafka, Mann, Nietzsche, Goethe, La Fayette, Dumas, Hugo, Proust, Verne, Eco, Cervantes, Sienkiewicz or Reymont.

The attention of the Cracow researchers was primarily drawn to the statistical distribution of the distance between consecutive punctuation marks. It soon became evident that in all the languages studied, it was best described by one of the precisely defined variants of the Weibull distribution. A curve of this type has a characteristic shape: it grows rapidly at first and then, after reaching a maximum value, descends somewhat more slowly to a certain critical value, below which it reaches zero with small and constantly decreasing dynamics. The Weibull distribution is usually used to describe survival phenomena (e.g. population as a function of age), but also various physical processes, such as increasing fatigue of materials.

“The concordance of the distribution of word sequence lengths between punctuation marks with the functional form of the Weibull distribution was better the more types of punctuation marks we included in the analyses; for all marks the concordance turned out to be almost complete. At the same time, some differences in the distributions are apparent between the different languages, but these merely amount to the selection of slightly different values for the distribution parameters, specific to the language in question. Punctuation thus seems to be an integral part of all the languages studied,” notes Prof. Drozdz, only to add after a moment with some amusement: “…and since the Weibull distribution is concerned with phenomena such as survival, it can be said with not too much tongue-in-cheek that punctuation has in its nature a literally embedded struggle for survival.”

The next stage of the analyses consisted of determining the hazard function. In the case of punctuation, it describes how the conditional probability of success – i.e. the probability of the next punctuation mark – changes if no such mark has yet appeared in the analysed sequence. The results here are clear: the language characterised by the lowest propensity to use punctuation is English, with Spanish not far behind; Slavic languages proved to be the most punctuation-dependent. The hazard function curves for punctuation marks in the six languages studied appeared to follow a similar pattern, they differed mainly in vertical shift.

German proved to be the exception. Its hazard function is the only one that intersects most of the curves constructed for the other languages. German punctuation thus seems to combine the punctuation features of many languages, making it a kind of Esperanto punctuation. The above observation dovetails with the next analysis, which was to see whether the punctuation features of original literary works can be seen in their translations. As expected, the language most faithfully transforming punctuation from the original language to the target language turned out to be German.

In spoken communication, pauses can be justified by human physiology, such as the need to catch one’s breath or to take a moment to structure what is to be said next in one’s mind. And in written communication?

“Creating a sentence by adding one word after another while ensuring that the message is clear and unambiguous is a bit like tightening the string of a bow: it is easy at first, but becomes more demanding with each passing moment. If there are no ordering elements in the text (and this is the role of punctuation), the difficulty of interpretation increases as the string of words lengthens. A bow that is too tight can break, and a sentence that is too long can become unintelligible. Therefore, the author is faced with the necessity of ‘freeing the arrow’, i.e. closing a passage of text with some sort of punctuation mark. This observation applies to all the languages analysed, so we are dealing with what could be called a linguistic law,” states Dr Tomasz Stanisz (IFJ PAN), first author of the article in question.

Finally, it is worth noting that the invention of punctuation is relatively recent – punctuation marks did not occur at all in old texts. The emergence of optimal punctuation patterns in modern written languages can therefore be interpreted as the result of their evolutionary advancement. However, the excessive need for punctuation is not necessarily a sign of such sophistication. English and Spanish, contemporarily the most universal languages, appear, in the light of the above studies, to be less strict about the frequency of punctuation use. It is likely that these languages are so formalised in terms of sentence construction that there is less room for ambiguity that would need to be resolved with punctuation marks.

The Henryk Niewodniczański Institute of Nuclear Physics (IFJ PAN) is currently one of the largest research institutes of the Polish Academy of Sciences. A wide range of research carried out at IFJ PAN covers basic and applied studies, from particle physics and astrophysics, through hadron physics, high-, medium-, and low-energy nuclear physics, condensed matter physics (including materials engineering), to various applications of nuclear physics in interdisciplinary research, covering medical physics, dosimetry, radiation and environmental biology, environmental protection, and other related disciplines. The average yearly publication output of IFJ PAN includes over 600 scientific papers in high-impact international journals. Each year the Institute hosts about 20 international and national scientific conferences. One of the most important facilities of the Institute is the Cyclotron Centre Bronowice (CCB), which is an infrastructure unique in Central Europe, serving as a clinical and research centre in the field of medical and nuclear physics. In addition, IFJ PAN runs four accredited research and measurement laboratories. IFJ PAN is a member of the Marian Smoluchowski Kraków Research Consortium: “Matter-Energy-Future”, which in the years 2012-2017 enjoyed the status of the Leading National Research Centre (KNOW) in physics. In 2017, the European Commission granted the Institute the HR Excellence in Research award. As a result of the categorization of the Ministry of Education and Science, the Institute has been classified into the A+ category (the highest scientific category in Poland) in the field of physical sciences.

Here’s a link to and a citation for the paper,

Universal versus system-specific features of punctuation usage patterns in major Western languages by Tomasz Stanisz, Stanisław Drożdż, and Jarosław Kwapień. Chaos, Solitons & Fractals Volume 168, March 2023, 113183 DOI: https://doi.org/10.1016/j.chaos.2023.113183

This paper is behind a paywall but the publishers do offer a preview of sorts.

There is also an earlier, less polished, open access version on the free peer review website arXiv,

Universal versus system-specific features of punctuation usage patterns in~major Western~languages by Tomasz Stanisz, Stanislaw Drozdz, Jaroslaw Kwapie. arXiv:2212.11182 [cs.CL] (or arXiv:2212.11182v1 [cs.CL] for this version) DOI: https://doi.org/10.48550/arXiv.2212.11182 Postede Wed, 21 Dec 2022 16:52:10 UTC (1,073 KB)

Evolution of literature as seen by a classicist, a biologist and a computer scientist

Studying intertextuality shows how books are related in various ways and are reorganized and recombined over time. Image courtesy of Elena Poiata.

I find the image more instructive when I read it from the bottom up. For those who prefer to prefer to read from the top down, there’s this April 5, 2017 University of Texas at Austin news release (also on EurekAlert),

A classicist, biologist and computer scientist all walk into a room — what comes next isn’t the punchline but a new method to analyze relationships among ancient Latin and Greek texts, developed in part by researchers from The University of Texas at Austin.

Their work, referred to as quantitative criticism, is highlighted in a study published in the Proceedings of the National Academy of Sciences. The paper identifies subtle literary patterns in order to map relationships between texts and more broadly to trace the cultural evolution of literature.

“As scholars of the humanities well know, literature is a system within which texts bear a multitude of relationships to one another. Understanding what is distinctive about one text entails knowing how it fits within that system,” said Pramit Chaudhuri, associate professor in the Department of Classics at UT Austin. “Our work seeks to harness the power of quantification and computation to describe those relationships at macro and micro levels not easily achieved by conventional reading alone.”

In the study, the researchers create literary profiles based on stylometric features, such as word usage, punctuation and sentence structure, and use techniques from machine learning to understand these complex datasets. Taking a computational approach enables the discovery of small but important characteristics that distinguish one work from another — a process that could require years using manual counting methods.

“One aspect of the technical novelty of our work lies in the unusual types of literary features studied,” Chaudhuri said. “Much computational text analysis focuses on words, but there are many other important hallmarks of style, such as sound, rhythm and syntax.”

Another component of their work builds on Matthew Jockers’ literary “macroanalysis,” which uses machine learning to identify stylistic signatures of particular genres within a large body of English literature. Implementing related approaches, Chaudhuri and his colleagues have begun to trace the evolution of Latin prose style, providing new, quantitative evidence for the sweeping impact of writers such as Caesar and Livy on the subsequent development of Roman prose literature.

“There is a growing appreciation that culture evolves and that language can be studied as a cultural artifact, but there has been less research focused specifically on the cultural evolution of literature,” said the study’s lead author Joseph Dexter, a Ph.D. candidate in systems biology at Harvard University. “Working in the area of classics offers two advantages: the literary tradition is a long and influential one well served by digital resources, and classical scholarship maintains a strong interest in close linguistic study of literature.”

Unusually for a publication in a science journal, the paper contains several examples of the types of more speculative literary reading enabled by the quantitative methods introduced. The authors discuss the poetic use of rhyming sounds for emphasis and of particular vocabulary to evoke mood, among other literary features.

“Computation has long been employed for attribution and dating of literary works, problems that are unambiguous in scope and invite binary or numerical answers,” Dexter said. “The recent explosion of interest in the digital humanities, however, has led to the key insight that similar computational methods can be repurposed to address questions of literary significance and style, which are often more ambiguous and open ended. For our group, this humanist work of criticism is just as important as quantitative methods and data.”

The paper is the work of the Quantitative Criticism Lab (www.qcrit.org), co-directed by Chaudhuri and Dexter in collaboration with researchers from several other institutions. It is funded in part by a 2016 National Endowment for the Humanities grant and the Andrew W. Mellon Foundation New Directions Fellowship, awarded in 2016 to Chaudhuri to further his education in statistics and biology. Chaudhuri was one of 12 scholars selected for the award, which provides humanities researchers the opportunity to train outside of their own area of special interest with a larger goal of bridging the humanities and social sciences.

Here’s another link to the paper along with a citation,

Quantitative criticism of literary relationships by Joseph P. Dexter, Theodore Katz, Nilesh Tripuraneni, Tathagata Dasgupta, Ajay Kannan, James A. Brofos, Jorge A. Bonilla Lopez, Lea A. Schroeder, Adriana Casarez, Maxim Rabinovich, Ayelet Haimson Lushkov, and Pramit Chaudhuri. PNAS Published online before print April 3, 2017, doi: 10.1073/pnas.1611910114

This paper appears to be open access.