Tortured phrases: What they are, how they are detected, and how to avoid them
For those who routinely use digital methods for transaction, CAPTCHA codes are nothing new—and present no difficulty. No matter how strange or distorted the numerals or characters that make up the code may look on the screen, the brain rarely makes mistakes. Yet, this seemingly simple task can defeat even highly advanced AI, or artificial intelligence. After all, CAPTCHA is a somewhat laboured acronym for “completely automated public Turing test to tell computers and humans apart,” and it is an effective test indeed (named after Alan Turing, the English mathematician who is regarded by many as the father of theoretical computer science and AI).
Why is the task difficult? It is difficult because we know what the essential features of a given letter are and so long as they are present, identify that letter or numeral quickly and confidently (for instance, a capital A is essentially a triangle with it base missing whereas a “V” is the same thing upside down).
Now, if this is a challenge, consider the difficulty in interpreting the meaning of even a simple word, because part of that meaning, more often than not, resides in adjacent words or context. Take the famous quip by Groucho Marx: “Time flies like an arrow; fruit flies like a banana.” To “get” it, you should know that “flies” is a verb in the first part and a noun in the second, and that “like” is a conjunction in the first part but a verb in the second. You may have realized that many words in a language have more than one meaning: a plant can mean a green plant or it can mean a factory; a coach can mean an instructor or a bus in which many people travel together. When you type the word “bank” in the search box, you may have in mind the financial institution. But searches are carried out by robots, who may also interpret the word as meaning a river boundary and include results for the word in this sense as well.
Given the pressure on researchers to publish and that too in a foreign language, many non–English-speaking researchers may write the paper in their first language and then use a software program to translate it into English. Yet, the introduction to this post will have convinced you that the task is far from easy. The difficulty arises mainly because of two reasons: (1) Words in the original language also have multiple meanings, and a computer may not always choose corresponding words with the intended meaning. (2) Words also change their meaning depending on the word or words they have been paired with. Some pairs are appropriate, and some are not—and computers are not smart enough to know the difference. Take a few simple pairs of words: “artificial intelligence,” “big data,” and “random value.” Simple enough, right? But what if they are taken to mean “counterfeit consciousness,” “colossal information,” and “irregular esteem”? These pairs never occur in proper English.
In fact, these pairs are so odd and rare that they and similar weird phrases aroused the curiosity of some computer scientists. They encountered the phrases in the pages of Microprocessors and Microsystems and a few other journals.1 Further probing showed that the phrases were probably the outcome of using automated translation/paraphrasing, not so much to switch between languages as to thwart automated detection of plagiarism. Such software is yet another hurdle for those writing up their research in a language that is not their first language, usually in English. This “translation plagiarism” seeks to lower the “similarity percentage,” and for that, the researcher-authors must paraphrase, avoiding the phrases used in a paper they have referred to and replacing them with phrases that convey the same meaning but not using the same words—a task even more difficult if your grasp of English is weak. Enter tortured phrases or weird English phrases.
The computer scientists who continued their examination of such weird phrases, which they labelled “tortured phrases,” found the weird phrases mentioned above, and many more, to be concentrated in about 500 papers, many of them published in special issues of the journal mentioned above. To facilitate their search, the computer scientists, led by Cabanac, Labbé, and Magazinov, developed Problematic Paper Screener,2 a tool, a software package—talk of setting a thief to catch a thief—to track down papers that contain tortured phrases or weird English phrases.
As the pressure to publish mounts, such translation plagiarism is likely to be used even more widely: by January 2022, Cabanac, Labbé, and Magazinov had found nearly 3,200 papers containing tortured phrases or weird English phrases even in reputable and peer-reviewed journals.3
Technology has come a long way. More than 50 years ago, Phillip Broughton published the “Systematic Buzz Phrase Projector.”4 This was simply a table comprising 30 buzzwords arranged in three columns and ten rows, with the cells in each column numbered, from 0 at the top to 9 at the bottom. To generate an impressive-sounding but meaningless phrase, all you had to do was to think of a 3-digit number and then build a 3-word phrase using words that corresponded to each number, across columns. Then came programs that could generate entire research papers that were devoid of any authentic content but could pass for regular papers. Some of them were even published.
Again, researchers desperate to add more papers to their CV or résumé as quickly as possible may resort to such means but that is not a solution: such papers can be traced and their authors not only forced to retract the publications but also may lose any benefits won on the strength of those publications. The practice is also highly unethical and weakens the very foundation of academic publishing, which is trust. Generating entire papers using computer programs may be rare, but using AI-based tools to lower the “similarity percentage” is becoming increasingly common—and increasingly easier to detect, also by using AI in combination with weird phrases.
The arms race between software packages that generate tortured phrases or entire papers and software packages that detect the use of such packages is likely to continue so long as researchers are judged solely by their published output. The responsibility to dampen that race thus falls upon not only those who evaluate researchers but also on researchers themselves: while the former should devise better means to evaluate researchers, the latter should shun the use of such devious means and either strive to write better or take the help of genuine editing/translation services.
1 Else H. 2021. “Tortured phrases” give away fabricated research papers. Nature 596: 328–329 [https://www.nature.com/articles/d41586-021-02134-0]
2 https://www.irit.fr/~Guillaume.Cabanac/problematic-paper-screener
3 Cabanac G, Labbé C, and Magazinov A. 2022. “Bosom peril” is not “breast cancer”: How weird computer-generated phrases help researchers find scientific publishing fraud [https://thebulletin.org/2022/01/bosom-peril-is-not-breast-cancer-how-weird-computer-generated-phrases-help-researchers-find-scientific-publishing-fraud/]
4 Broughton P. 1968. How to win at wordsmanship: the systematic buzz phrase projector. Newsweek (8 May): 104 [https://www.gsrc.ca/buzzword.htm]
Comments
You're looking to give wings to your academic career and publication journey. We like that!
Why don't we give you complete access! Create a free account and get unlimited access to all resources & a vibrant researcher community.
Subscribe to Manuscript Writing