NoRaRe: A Multilingual Database of Word and Concept Properties

August 09, 2021

A new study in Behavior Research Methods by a team of researchers in Germany presents the Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe), an openly curated resource for interdisciplinary studies with data from psychology and linguistics

Data collection is becoming easier and more frequent in all areas of research. For studies that investigate different aspects of language, whether in psychology or linguistics, large amounts of data on the properties of words are particularly important. The problem, oftentimes, is finding and making use of the data once it has been collected.

Psychologists conduct studies that collect the frequency of words in everyday language, the emotional connotations of words, as well as other properties such as age of acquisition. Linguists, on the other hand, evaluate the relationships between words across languages to reconstruct their history. But, so far, the vast amounts of data that exist in both fields are incompatible, hindering interdisciplinary study and often limiting results to a single language. A new paper in Behavior Research Methods by a team of researchers from the Max Planck Institutes for the Science of Human History (Jena, Germany) and Evolutionary Anthropology (Leipzig, Germany) presents a new database that makes data from psychology and linguistics accessible and comparable, creating a framework for reproducible data analyses and interdisciplinary studies.

By combining data from psychology and linguistics, the Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe) hopes to advance research in both fields by allowing the comparison of word and concept properties across languages. With these capabilities, NoRaRe enables a deeper understanding of language and offers researchers the possibility of answering new questions.

Using NoRaRe

NoRaRe is the first online resource to provide standardized data to make cross-linguistic comparisons possible and reproducible. The properties offered in the database range from automatically collected variables such as word frequencies (norms) to psycholinguistic studies with human participants (ratings) to comparative data from within or across languages (relations). This allows researchers to either compare the same variable (e.g. age-of-acquisition ratings) across languages or use different variables to investigate their specific research question. The database is infinitely expandable and carefully curated, including automatic tests for consistency.

The database currently includes small word lists from underrepresented languages, as well as large-scale studies with several thousand items for languages such as English, German, Dutch, Italian, Russian and Chinese. In addition, NoRaRe is openly curated on GitHub (https://github.com/concepticon/norare-data) and accessible through a web interface (https://digling.org/norare/), making it possible for researchers to expand the database by contributing their own word lists.

A solid base for a database

“In the NoRaRe project, we could build on several years of experience in collaborative coding and data curation,” says Robert Forkel, chief programmer of the study.

The NoRaRe database expands on the Concepticon project (https://concepticon.clld.org), a database of detailed information on the concepts for which scholars provide translations in linguistic fieldwork.

“While Concepticon provides detailed information about the datasets in which concepts are translated into different languages of the world, NoRaRe adds a new dimension by providing information on the properties of concepts and words,” says Johann-Mattis List, senior author of the study.

A versatile resource with room to grow

To test the capabilities of the new database, a recent study investigated whether the frequency of words in related languages is more similar than in unrelated languages. The study used three word lists with frequency data for English, German and Chinese from the NoRaRe database. The results showed that English and German have more words in common with similar frequency than either language has with Chinese. But there are many more questions that could be answered with the data in NoRaRe. For example, do children who speak different languages learn words for the same concepts at the same age? Do word frequencies explain the stability of word meanings over time?

“If researchers find that they cannot answer their question with the data already present in NoRaRe, it is easy to add new word lists,” says Annika Tjuka, lead author of the current paper. “They can either contribute the data themselves or point us to an existing list so that we can add it. The GitHub platform allows them to post improvements and recommendations directly.”

Although not every variable is currently available for all languages, NoRaRe gives researchers the opportunity to identify these gaps and expand the resource infinitely.

“We invite researchers to use the database for their interdisciplinary studies and to collaborate with us to make the database as inclusive as possible,” says Tjuka.