Language plays an increasingly important role in search. Revealing the similarities and differences between languages through network science might seem daunting, but researchers at the Tokyo University of Science have found a way to make it work.
Researchers use a word co-occurrence network (WCN) to understand the commonalities and differences between languages. WCN is a method to analyze text that includes a graphic of potential relationships between people, organizations, concepts, biologic organisms, and more.
People worldwide generate an incredible amount of data in a variety of languages each day. There are about 7,000 different languages in the world, so to think there are “several quintillion bytes of data” being generated in nearly all of them daily is mind-boggling.
While it poses an interesting challenge for data analysis, researchers have proposed the idea of complex network theory as one solution. A main type of semantic network — word co-occurrence network (WCN), can form the highest point of the network and the edges between these connect words on the basis of a string of words called an “N-gram.”
The “N” refers to the number of consecutive words in a sentence that are analyzed at one time. Previous research has been limited to WCNs with a maximum “N” of two and have found that these WCNs can capture the characteristic features of multiple languages fairly well.
What happens if the words in the string are phrased differently, what happens when you increase the number of “Ns” beyond two?
Professor Tohru Ikeguchi from Tokyo University of Science led a research team to investigate the syntactic dependency and relations in languages by using WCNs with more than two Ns. The study was published in Nonlinear Theory and Its Applications, IEICE on April 1, 2022. It can be found here.
The research team transcribed well-known works in eight languages into WCNs, ranging from the New Testament of the Christian Bible, the United Nations proceedings, the Paris agreement, and novels by different authors. word co-occurrence network (WCN). The documents were chosen because they have been translated into many languages.
Professor Ikeguchi wrote as part of the conclusion that the important features of each language appear in the networks with more than three co-occurrences, and that some of the network guides used to evaluate the structural features of the networks depend on the text data.
“The network indices that are dependent on the text data include the number of words and vertices, the density of the network, the triangle clustering coefficient and the square clustering coefficient,” explains Professor Ikeguchi in the report. “However, the research team also observed that some indices remained independent of the text data, such as the triangle clustering coefficient and the average shortest-path length, thereby enabling the description of the similarities and differences between languages.”