Rafael Brown’s Post

View profile for Rafael Brown, graphic

CEO & Founder at Symbol Zero // Microsoft Regional Director

Highlighting: "The amount of training data for each language was not the only factor at play: The study found that the chatbot had particular difficulty with low-resource languages that were structurally different from English." ----- AI training for LLMs reflects what you put into it. And if you train only in English you don't reflect the rest of the world. Open AI and most AI solutions made in English speaking countries have major failings outside of English. ----- RestOfWorld: "We tested ChatGPT in Bengali, Kurdish, and Tamil. It failed. Outside of English, ChatGPT makes up words, fails logic tests, and can't do basic information retrieval." (6 Sept 2023) (Andrew Deck) "ChatGPT is being used all over the world, generating Amazon listings in China and call center scripts in the Philippines. But while ChatGPT thrives in English, Spanish, Japanese, and other dominant languages, it struggles to produce the same quality of text in languages like Bengali, Swahili, Urdu, and Thai — languages that have millions of speakers, but appear much less often online. When Rest of World tested ChatGPT’s ability to respond in underrepresented languages, we found problems reaching far beyond translation errors, including fabricated words, illogical answers and, in some cases, complete nonsense. Take Tigrinya, a language which has over 7 million speakers, with the vast majority located in Eritrea and the northern part of Ethiopia. Tigrinya shares a similar script to Amharic, a more dominant Ethiopian language, but there are significant differences between the two. When asked to list examples of African countries, ChatGPT mixed up Tigrinya and Amharic, adding characters that don’t exist in Tigrinya. It created an output to this simple question that is challenging to read for native speakers of both languages. ChatGPT also inserted countries that aren’t on the African continent, including Jordan and Canada. Researchers told Rest of World that proper nouns — names, places, and institutions — are a persistent weak point for ChatGPT, and that this problem is common across many underrepresented languages. “If you ask ChatGPT in Tigrinya or Amharic the simplest and most frequently asked questions, it gives you gibberish, a mix of Tigrinya and Amharic, or even made-up words,” said Asmelash Teka Hadgu, co-founder and chief technology officer of Lesan, a startup developing machine translation products for Ethiopian languages. “Chatbots like ChatGPT are utterly broken or useless for these languages.” Many of these languages are what AI researchers call “low-resource.” AI language models are largely trained on data scraped from the internet. While languages like Bengali are some of those most spoken in the world, they are less represented online, so there is less digitized text available to train models tailored to them." RestOfWorld: https://lnkd.in/g2hPqKUc #ai #fakeai #genai #degenai #generativeai #degenerativeai

We tested ChatGPT in Bengali, Kurdish, and Tamil. It failed.

We tested ChatGPT in Bengali, Kurdish, and Tamil. It failed.

restofworld.org

Rafael Brown

CEO & Founder at Symbol Zero // Microsoft Regional Director

2mo

Highlighting: "Many of these languages are what artificial intelligence researchers call “low-resource.” AI language models are largely trained on data scraped from the internet. While languages like Bengali are some of those most spoken in the world, they are less represented online, so there is less digitized text available to train models tailored to them. A tool like ChatGPT, built on this data, frequently produces less intelligent — and, at times, unintelligible — responses in low-resource languages. A recent study by researchers at the University of Oregon made a similar finding, testing ChatGPT’s ability to complete several writing tasks in 37 different languages. In low-resource languages, the chatbot routinely underperformed in the tasks. The amount of training data for each language was not the only factor at play: The study found that the chatbot had particular difficulty with low-resource languages that were structurally different from English. Currently, OpenAI does not include any language guidelines in its usage policy for ChatGPT." https://restofworld.org/2023/chatgpt-problems-global-language-testing/

Like
Reply
Rafael Brown

CEO & Founder at Symbol Zero // Microsoft Regional Director

2mo

Highlighting: "Much has been made of the tendency of AI chatbots to “hallucinate” — shorthand for fabrications that chatbots state as facts. This problem is common with ChatGPT responses in low-resource languages. But in multiple instances, rather than generating fake numbers or other facts, Rest of World found that ChatGPT simply makes up words. When asked to explain the U.S. asylum application process, ChatGPT answered fluidly and succinctly in English. In Haitian Creole, though, it frequently used “Ewoyezi” as the main subject of its sentence, seemingly to refer to asylum. “Ewoyezi” is not in the Haitian Creole dictionary. Only after being pressed multiple times to explain what “Ewoyezi” means did ChatGPT finally respond that the word does not exist. “Suffice it to say that it’s full of syntactical errors, errant Frenchisms, and most damningly of all, words that just don’t exist,” said Laura Wagner, a Haitian Creole team lead at Respond Crisis Translation. She frequently works with Haitian migrants applying for asylum in the U.S., and reviewed the text for Rest of World." https://restofworld.org/2023/chatgpt-problems-global-language-testing/

Rafael Brown

CEO & Founder at Symbol Zero // Microsoft Regional Director

2mo

Highlighting: "Tamil is one language with a rich history of literature. Spoken by over 78 million people, it is an official language of both Sri Lanka and the Indian state of Tamil Nadu. Venpa is one popular style of metered poetry, which is commonly used in published works in Tamil-language literature. We asked ChatGPT to write one — but the English venpa was far more adept than the Tamil one. In English, ChatGPT was able to create a poem that had fluid and poetic descriptions. Its Tamil counterpart, meanwhile, was incorrectly structured, and rife with errors and garbled phrases. Despite venpa being a style of poetry that originated in Tamil, ChatGPT struggled to produce a legible poem in the language. “If I were to rate the above poetry, like a Tamil teacher in a school, I would give zero marks for ChatGPT,” Sankar, a Chennai-based developer who is the creator of the Tamil version of Wordle, told Rest of World. “And I may ask to meet with ChatGPT’s parents.” https://restofworld.org/2023/chatgpt-problems-global-language-testing/

John Woodworth

Proactive Solutionist, On a Quest for Knowledge | Technology | Innovation | Security | Robotics | IoT | Optics | CGI | Sci-Fi | Video-Games | 3D-Animation | Quanta | Gravity |

2mo

Everything works like magic if you do your best to *not* understand it. My *biggest* problem with language models is the complete lack of a testing framework. The *very*first*thing* I would have done to test theories of translation would be to create languages (think ISO) for testing and benchmarking. These standardized languages would also be used for accurate calculations of made-up-isms (aka hallucinations). Apparently, we threw out everything we knew about software development, engineering, and science to make these fraud factories. Just because doing something is difficult, doesn't mean we just ignore it and pretend we shouldn't be doing it.

Like
Reply
Gil Steiner

I write about Technology, Web Dev & Games | Follow Me | Techno Luddite | CTO @ NoWaste | Frontend Architect | Freelance Frontend Engineer | Entrepreneur | Game Designer & Developer | Creativity Enthusiast

2mo

This means biases within biases

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics