This article (https://lnkd.in/dyzimUhj) announces new Moroccan Arabic 'darija' models (Atlas Chat), based on Gemma-2, and evaluation dataset ('darija' MMLU - apparently translated from Arabic translation).
Abed Khooli’s Post
More Relevant Posts
-
One content for all Languages 😍 AI makes the impossible with Content translation #AI #GenAI #translation #dataintelligence #data #content #youtube #language
Instant, live translation. It's time to offer content suitable for all languages. And AI makes it possible, effortlessly.👇 On September 12, Synthesia launches live. Looking forward to seeing this 😀 P.S: Here is the LinkedIn event (in English): https://lnkd.in/dPdXDQD4 Traduction instantanée, en direct. Il est temps d'offrir un contenu adapté à toutes les langues. Et l'IA rend ça possible, sans effort.👇 Le 12 septembre, Synthesia fait le lancement en direct. Hâte de voir ça 😀 P.S : Voici l'événement LinkedIn (en anglais) : https://lnkd.in/dPdXDQD4
To view or add a comment, sign in
-
Once data has effectively been sourced and extracted, before its analysed and converted into actionable insights, FACT360's unique natural language extraction tool classifies all into topics, classes and categories in the native language - supporting large complex investigations globally. The FACT360 ACC tool allows the classification of Arabic documents including Optical Character Recognition, all digital communications, as well as Arabic written and typed documents. The platform then classifies millions of these communications into topics, classes and categories for any legal body to extract for further forensic investigation. One of the unique aspects of the ACC tool, is that the natural processing is done in native Arabic, so no data ever gets lost or misconstrued in translation. Our clients using it now have no need for high cost of large teams of native speakers and the full automation in the scale of the process considerably reduce the life cycle of completion of their investigations. #arabic #naturallanguagetool #nativelanguage
To view or add a comment, sign in
-
Live Translate is one of the coolest AI features that debuted earlier this year with #GalaxyS24. It’s fun when your voice calls, face to face conversations, and text messages get automatically translated in to your preferred language. But developing the feature involved solving complex challenges. Take Hindi for example. SRI-Bangalore’s Giridhar Jakki and his language AI team had to ensure more than 20 regional dialects, tonal inflections, punctuation and colloquialisms were covered. Additionally, it is common for Hindi speakers to mix English words in their conversations. This required the team to carry out multiple rounds of AI model training with a combination of translated and transliterated data. “Every language has its challenges,” says Jakki. “But when you consider the end goal of bringing people the ability to communicate in other languages, it’s worth every ounce of effort.” Read on... https://lnkd.in/g5hWjW9k
To view or add a comment, sign in
-
Extended Abstract accepted in Wiki Workshop 2024 (https://wikiworkshop.org/) Heartiest congratulations for the acceptance of the following work in Wiki Workshop 2024. The title, list of authors and abstract of the work are as follows: Title: WikiTransfer: Knowledge transfer from High-Resource to Low-Resource Language in Multilingual Wikipedia, Author: Paramita Das, Amartya R., Animesh Mukherjee Abstract: To address content disparities in multilingual Wikipedia versions, in this work, we propose a lightweight approach to enhance information across diverse linguistic communities. By identifying cross-lingual similarities between Hindi and English, we employ powerful machine translation techniques, such as the MBART model, and further improvising the model to accurately translate English text into Hindi. Our approach aims to improve overall content quality, demonstrated through its effectiveness in enhancing existing Wikipedia articles at the section level. This adaptable framework offers the potential for extending content quality enhancements across various language pairs.
Wiki Workshop 2024
wikiworkshop.org
To view or add a comment, sign in
-
Translation Studies are in an urgent need of updating their "official curricula" about: 1) "Transcreation" as a task of transference of culturally hypersignified semiotic frames between super-globalized languages. If you need help, start with "Decoding Advertisements", by Judith Williamson. As for the super-global languages... well, check how (or what) Spanish youtubers talk to their target audience. You'll get the idea after their first sentence. 2) "AI and translation studies" in the age of ChatGPT4, Gemini, etc. All those comparative tests with G.Translate and Deepl were ok, but we are in a radically different scenario. De nada.
To view or add a comment, sign in
-
How to explain AI to a preschooler? Marina Pantcheva, PhD used an Asterix and Obelix metaphor for that in her Keynote on #2024TEF. Asterix (human) is very smart, and Obelix (AI) has fallen into a magical cauldron full of data as a baby, a cauldron with 10 trillion words – that’s the amount of data LLMs are trained on. Now he’s very strong, though not as smart as his human friend. Marina talked about things that LLMs do better than humans and worse than humans, about human-in-the-loop vs human-in the cockpit, and the future of translation. Marina mentioned the linguistic determinism of LLMs – say, if it is trained on Arabic texts, it “knows” a lot about camels and not so much about snow. So I asked a question about low resource languages and how the evolution of LLMs will affect them. Marina’s answer was very optimistic. Apparently, if you train the language model on a mix of gazillion words of English and 10% Swahili, it would perform on Swahili better than the language model trained purely on all existing Swahili texts. Sounds very promising! I recommend to watch the recording of Keynote – it is available for free on YouTube (I’ll put the link in comments) and it is absolutely brilliant and entertaining. #LITranslators #2024TEF #LLMs #LowResourseLanguages #translationfuture
To view or add a comment, sign in
-
Next up, automatic voice translation of English videos 😊 - EWE version (2 of 2) Below is an example generated by our partner Sauti (cc Saint H. Doe-Tamakloe & Dr. Keita Broadwater, MBA) using our API The example includes Twi, Ewe & original English versions of the video, but since LinkedIn only allows one video at a time, I am only attaching EWE here. You can watch all the videos on YouTube here: TWI - https://lnkd.in/ddjRyEEx Original English Input - https://lnkd.in/dibTEJNK EWE - https://lnkd.in/dF8hdrMf The underlying algorithm is composed of 3 steps - 1) ASR (speech recognition) 2) Translation 3) TTS (text to speech) **) NOTE all models are ours - locally developed and owned - no external APIs 👍🏾 To learn more about our partner’s work please see mysauti.ai Want to harness these powerful AI? Sign up for a free API account at translation.ghananlp.org Prefer a mobile app? Search for “Khaya” on all stores All links 👉🏾 linktr.ee/nlpghana
To view or add a comment, sign in
-
Llama 3.1 is getting great results. The secret sauce? It learned more languages... 😏 #MultilingualAI #GlobalAI Llama 3.0: "To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages. However, we do not expect the same level of performance in these languages as in English." Llama 3.1: “Data mix summary. Our final data mix contains roughly 50% of tokens corresponding to general knowledge, 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.” “We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from the tiktoken tokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama 2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to 3.94 characters per token. This enables the model to “read” more text for the same amount of training compute. We also found that adding 28K tokens from select non-English languages improved both compression ratios and downstream performance, with no impact on English tokenization.” Llama 3.1 Paper:
The Llama 3 Herd of Models
ai.meta.com
To view or add a comment, sign in
-
#AI_Meets_Culture! 🌍 I'm excited to share with you my latest project, #TunDerja – an AI-powered translator that brings Tunisian Derja to the world! Tunisian Derja is a dialect rich in cultural heritage, distinct from other forms of Arabic. It carries centuries of history, and I wanted to preserve and promote it in the digital age. That’s why I developed TunDerja, a tool that translates Tunisian Derja into English, making it easier for people worldwide to experience the beauty and depth of our language. 🌟 In my latest blog on #LinkedIn, I walk you through the journey of building this AI tool: - From data gathering and fine-tuning AI models to the challenges of creating a conversational translation experience. - How I used OpenAI’s GPT-4o-mini to bring Tunisian culture to life through technology. 🔧 Note: #TunDerja is still in its early stages, and I would greatly appreciate your feedback as we continue to train the model on more data to improve accuracy and enhance its capabilities. Your input will be invaluable in shaping its future! 💬 💡 If you're interested in how AI can be used to bridge cultural gaps and preserve languages, I’d love for you to check it out here: https://lnkd.in/erEkMweB #ai #openai #gpt-4o #gpt #culture #technology #womenintech
To view or add a comment, sign in
-
Imagine a dictionary that doesn’t just cover one version of a language but opens the door to 33 unique dialects. The IgboAPI Dataset is a groundbreaking resource tailored to empower Igbo language technologies. Developed with the rich diversity of Igbo dialects in mind, this dataset covers 33 unique dialects, over 5,000 Igbo words, and nearly 28,000 parallel sentences. Created by expert lexicographers, IgboAPI aims to improve representation in machine translation and language models, transforming how the Igbo language is processed and understood. The goal? To go beyond "Standard Igbo" and reflect the authentic richness of all Igbo dialects. At EqualyzAI, we’re driven by this same mission—to bring the fullness of African languages into technology. This research is a call to expand Igbo representation in tech, ensuring future Igbo language tools are inclusive, accurate, and serve everyone. View the full paper to see how this work is changing the landscape for Igbo language technology. - https://lnkd.in/d2a-qwN7 #igbolanguagetech #multidialect #linguisticdiversity #igboresearch #africanlanguages Authors: Ifeoma Okoh, Chris Chinenye Emezue, Chinedu Mbonu, Chiamaka Chukwuneke, Dr. Daisy Monika Lal, Dr. Ignatius Ezeani, Paul Rayson, Ijemma Onwuzulike, Chukwuma Okeke, Gerald Nweya, Bright Ogbonna, Chukwuebuka Uchenna Oraegbunam, Ph.D., Esther Chidinma Awo-Ndubuisi, Akudo Amarachukwu Osuagwu, Obioha Nmezi
To view or add a comment, sign in
2 PhD in Ai | Google AI/ML Dev Expert | AI Engineer
2moDr. Khalil Mrini