Casmir Anyaegbu’s Post

4mo Edited

#Boom #My_Models have been #trained #Project_title: Sentiment Analysis for Customer Feedback: Product and Service Improvements with Precision The processing times and speeds provide insight into the efficiency of the data preprocessing stage for this datasets. Here's a breakdown of what these figures imply: #Training_Dataset (3.6 Million Reviews) - Total Time_Taken: 41 minutes and 31 seconds - Processing Speed: 1444.68 iterations per second (it/s) #Test_Dataset (400,000 Reviews) - Total Time Taken: 4 minutes and 28 seconds - Processing Speed: 1488.19 iterations per second (it/s) #Observations_and_Implications: 1. Processing Speed: The speed of processing, measured in iterations per second, is relatively consistent between the training and test datasets. The slight variation could be due to differences in system resources, data structure, or concurrent processes running during the preprocessing. 2. Scalability: The fact that the system can handle a large dataset of 3.6 million reviews within a reasonable timeframe (41 minutes) demonstrates good scalability and efficiency of the preprocessing step. This efficiency is crucial for projects involving large-scale data, where preprocessing can become a bottleneck. 3. Progress Feedback: The use of tqdm to provide progress feedback is particularly beneficial in such scenarios. It not only gives an estimate of the time required but also helps in monitoring the process for any potential issues. 4. Optimization Opportunities: While the current processing speed is efficient, there may still be room for optimization. This could involve parallel processing, more efficient algorithms for stopword removal, or hardware improvements. 5. Practical Considerations: The total processing time and speed are important metrics for planning and resource allocation, especially if preprocessing is a recurring task or if different datasets need to be processed frequently. Overall, these metrics are a positive indication of the system's capability to handle extensive preprocessing tasks, an essential step in preparing data for subsequent analysis or machine learning modeling. The code is in the comment session AMDARI, #NLP #DataScience, #PredictiveModelling #SentimentAnalysis #CustomerFeedback #ServiceImprovements Omowumi Victoria Samson

2 Comments

Casmir Anyaegbu

4mo

total_rows = len(test_dataset) tqdm.pandas(total = total_rows) test_dataset['stop words'] = test_dataset['text'].progress_apply(remove_stopwords)

Casmir Anyaegbu

4mo

##i would love to see a progress bar when we process for all the 3.6 million reviews total_rows = len(train_dataset) tqdm.pandas(total = total_rows) train_dataset['stop words'] = train_dataset['text'].progress_apply(remove_stopwords)

See more comments

To view or add a comment, sign in

More Relevant Posts

Muhammed Aslam A

Founder of ContextGram – The Future of Email Management | Transforming Email Productivity with Generative AI | AI Solutions Expert | LLM | RAG | Agent | Voice AI
5mo
Report this post
🌐 Microsoft-GraphRAG 🤔 Why GraphRAG? Traditional RAG has been great for improving accuracy and context in responses. However, it struggles with complex, global queries. GraphRAG changes the game by creating a sophisticated knowledge graph that maps relationships between different pieces of information, providing a comprehensive view of the data landscape. 🔑 Key Features and Workflow: 1. 🏗️ Building the Knowledge Graph: GraphRAG breaks down documents into manageable text-chunks, extracting entities and their relationships using LLMs and finely-tuned prompts. 2.📊 Graph Summarization: Summaries are created for nodes and edges, combining similar information. 3.🔗 Creating Communities: GraphRAG uses the Leiden Algorithm to cluster subgraphs into communities, organizing information based on relationships. 4.🧮 Vector Representations: Using techniques like Node2Vec, GraphRAG creates efficient search indexes. 5.💡 Smart Summarization: Each community gets its own summary, balancing context and token limits for efficient data handling. 6.🔍 Querying: Where the Magic Happens GraphRAG goes beyond surface-level answers: 🔎 Local Search: For specific queries, it pulls detailed information from the knowledge graph. 🌍 Global Search: For broader questions, it assembles insights from various communities. 📈 Benefits GraphRAG offers more accurate and contextual responses by understanding the context and relationships within the information, making it promising for research, education, and everyday information retrieval. ⚠️ Challenges: 📈 Scaling for massive datasets 🔄 Integration with existing systems 🤖 Managing LLM dependencies https://lnkd.in/guX8PJXC #AI #MachineLearning #MicrosoftGraphRAG #InformationRetrieval #Microsoft #Innovation #ArtificialIntelligence #DataScience #BigData #TechInnovation #FutureOfAI #AITrends #KnowledgeGraphs #NLP #DeepLearning #BusinessIntelligence #DigitalTransformation #TechTrends #AIResearch #DataDriven #EmergingTech #AIEthics #STEM #TechNews #LinkedInLearning
1 Comment
Like Comment
To view or add a comment, sign in
Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head
4mo
Report this post
It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. In a paper titled "AI models collapse when trained on recursively generated data" authors argue what may happen to GPT-{n} once LLMs contribute much of the text found online. They find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. They refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. They demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet. URL: https://lnkd.in/gEsKPGJU #llms #genai #generatieveai #capgemini #llm #opensourceai #capgeminiindia #ai #artificialintelligence #software #modelcollapse #safety #ai4good #airisks #airiskmanagement #airisk #aisafety #redteam #redteaming

AI models collapse when trained on recursively generated data - Nature

nature.com
Like Comment
To view or add a comment, sign in
Learners Galaxy

1,071 followers
3w
Report this post
𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐁𝐥𝐨𝐜𝐤𝐬 𝐨𝐟 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠: 𝐊𝐞𝐲 𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝐬 𝐘𝐨𝐮 𝐍𝐞𝐞𝐝 𝐭𝐨 𝐊𝐧𝐨𝐰 Whether you're starting your journey in machine learning or looking to refine your skills, these fundamental algorithms are the building blocks of the field: 1️⃣ 𝐋𝐨𝐠𝐢𝐬𝐭𝐢𝐜 𝐑𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧: Despite its name, this is a classification algorithm. It's commonly used to solve binary problems (e.g., spam vs. not spam, fraud detection). 2️⃣ 𝐋𝐢𝐧𝐞𝐚𝐫 𝐑𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧: This algorithm is foundational in predictive analytics. It predicts continuous numerical outcomes, like forecasting sales or determining the value of real estate. 3️⃣ 𝐑𝐚𝐧𝐝𝐨𝐦 𝐅𝐨𝐫𝐞𝐬𝐭: An ensemble learning method that combines multiple decision trees to improve prediction accuracy and reduce overfitting. It's versatile and works well for both classification and regression. 4️⃣ 𝐍𝐚𝐢𝐯𝐞 𝐁𝐚𝐲𝐞𝐬: Based on Bayes' Theorem, it’s particularly effective for text classification tasks like email filtering or detecting fake reviews. 5️⃣ 𝐊-𝐍𝐞𝐚𝐫𝐞𝐬𝐭 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 (𝐊𝐍𝐍): A simple yet intuitive algorithm that classifies data points based on their proximity to other data points. It’s commonly used in recommendation systems and pattern recognition. 6️⃣ 𝐒𝐮𝐩𝐩𝐨𝐫𝐭 𝐕𝐞𝐜𝐭𝐨𝐫 𝐌𝐚𝐜𝐡𝐢𝐧𝐞𝐬 (𝐒𝐕𝐌): A powerful tool for finding decision boundaries in datasets, especially when the data is small and well-separated into classes. 7️⃣ 𝐊-𝐌𝐞𝐚𝐧𝐬: A popular unsupervised learning algorithm used for clustering tasks. It groups similar data points into clusters, often used in customer segmentation or image compression. 𝐖𝐡𝐲 𝐀𝐫𝐞 𝐓𝐡𝐞𝐬𝐞 𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝐬 𝐈𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭? Each algorithm serves a unique purpose depending on the nature of the data and the problem at hand. By mastering these algorithms, you can tackle a wide range of real-world machine learning challenges—whether it's predicting trends, classifying data, or uncovering hidden patterns. These algorithms are the foundation for more advanced techniques, such as deep learning and neural networks. They’re the stepping stones every data scientist and machine learning engineer must know. #MachineLearning #DataScience #ArtificialIntelligence #MLAlgorithms #TechSkills #DataAnalytics #AI #MachineLearningBasics #TechInnovation #LearnML #LearnersGalaxy
Like Comment
To view or add a comment, sign in
Nilton Omura

Global Technology and Digital Transformation Senior Executive / Consulting / Banking Technology / Technology-Enabled Business Transformation
4mo
Report this post
While current Large Language Models are primarily trained on human-generated content, this reality may shift significantly with future generations of LLMs, as the internet is already increasingly being "polluted" with AI-generated content. This thought-provoking article from nature.com explores the implications of indiscriminate use of AI-generated content in model training. It highlights the potential for irreversible defects in models, leading to a distorted perception of reality and eventual model collapse. The research calls for the critical importance of maintaining original human-generated data and content in future model training to mitigate these risks. A must-read for all involved in training and utilizing LLMs. #artificialintelligence #trainingLLMs #GenAI Check it out here: https://lnkd.in/gejs-6if

AI models collapse when trained on recursively generated data - Nature

nature.com
Like Comment
To view or add a comment, sign in
Manjunath Suresh Manure

Data Science | Machine Learning | GenAI | Power BI | ETL | Python | Linux | Deep Learning |
5mo
Report this post
#UnsupervisedLearning Steps to build a base model. Further perform #hyperparameter tuning and choose the best model. Question: For a giving a data set how to find out the optimum number of clusters and build a model? Explanation: Step 1: Using a range of clusters to plot an elbow curve How to draw elbow curve? Answer: Using Within cluster sum of squares(WCSS) What it WCSS? Answer: measures the sum of squared distances between each data point and the centroid of its assigned cluster. The goal in clustering is to minimize this value because a smaller #WCSS indicates tighter clusters. #kmeans.inertia_: function to get the wcss Step 2: choose the optimal K value where we find a sudden drop in y axis Y= Number of Clusters X= WCSS Step 3: Train and evaluate the model on this K value Step 5: Use silhouette Coeff for #modelevaluation formula: s(i)=b(i)-a(i)/max(b(i),a(i)) Note: #silhouetteCoeff closer to 1 indicates the data points are well clustered. Step 6: use below mentioned parameters for tuning 1.Number of clusters(K) 2.Evaluation metrics (wcss vs sihouette score) 3.K-fold cross validation Step 7: Iterate the model and find out the best parameters Step 8: Train and evaluate the model on best parameters Step 9: Build and deploy the model once it is trained on ample amount of data as per the requirements. Real world examples of where K-means clustering is applied #CustomerSegmentation #ImageCompression #AnomalyDetection #DocumentClustering #GeneticClustering #MarketSegmentation #WirelessSensorNetworkOptimization #RecommendationSystems #FraudDetection #ClimatePatternAnalysis #NLP #UnsupervisedLearning #KMeansClustering #NonHierarchicalclustering #ModelBuilding #ModelTraining #ModelEvaluation #ArtificialIntelligence #MachineLearning
Like Comment
To view or add a comment, sign in
Miguel Caetano
4mo
Report this post
#AI #GenerativeAI #LLMs #ModelCollapse #SyntheticData: "Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet." https://lnkd.in/dmRCysHg

AI models collapse when trained on recursively generated data - Nature

nature.com

1 Comment
Like Comment
To view or add a comment, sign in
Mridul Jhawar

AI & NLP Engineer@HuroAI | Specializing in LLMs and Multimodal Models | Experienced in ML & DL | USC & WVU Alum | Ex-Data Engineer@ADP
9mo
Report this post
🚀 100 Days of Tech - Day 18: Local Sensitivity Hashing (LSH) used by Google, Netflix Amazon, etc Today's focus was on LSH, a fascinating technique for approximating nearest neighbors in high-dimensional data. LSH is designed to maximize collisions for similar vectors, thereby efficiently identifying "almost the same" objects within datasets. This method traditionally employs a combination of shingling, MinHashing, and a final banded LSH function to achieve its goals. Key Components of LSH: Shingling: Converts documents or items into sets of shingles (substrings) to prepare for comparison. MinHashing: Reduces the dimensionality of the sets while preserving similarity. Banded LSH Function: Further hashes the MinHash signatures into bands to maximize collision probability for similar items. Comparison with K-Mean Clustering: While exploring LSH, I noticed a superficial resemblance to K-Mean clustering, particularly in the initial impression that both techniques seek to group similar items. However, LSH distinctly focuses on identifying items that are "almost the same," rather than discovering larger structures within the data. This fundamental difference highlights LSH's unique approach to dealing with similarity and near-duplicate detection. Performance Insights: Parameter Sensitivity: LSH's performance heavily relies on its parameter settings. A balance must be struck between search quality and speed, as higher quality results tend to slow down the search process, and faster searches may compromise result quality. Dimensionality Challenges: High-dimensional data pose significant challenges for LSH, often leading to poor performance. In cases with large dimensionality (high d values) or extensive indexes, LSH might not be the optimal choice. Moving Beyond LSH: It intersting that I learnt about Hierarchical Navigable Small World (HNSW) before LSH and HNSW does a better job due to its handling of high dimensional data and large datsets using advanced data strcutres like skip-gram and navigable small world graphs. Resources: Pinecone: This articles compares major indexing algroithms (https://lnkd.in/eXmr66dK) In depth LSH Illustration: https://lnkd.in/e3DT_Wtz (Amazing job by James Briggs) #nlp #llm #ai #100daysoflearning #opentowork
7 Comments
Like Comment
To view or add a comment, sign in
Ian Spektor

CTO @ Puppeteer AI
9mo
Report this post
Interested in learning what’s the current status of the field of #foundation models for #timeseries #forecasting? 👀👇🏼 “A Survey of Deep Learning and Foundation Models for Time Series Forecasting” was released about a month ago. It goes over how deep learning, and the transformer architecture especially, is being applied to the temporal domain. It does a thorough job of introducing the field, for example explaining: 👉🏼 What the current TRUE state-of-the-art in forecasting is, using the M competition’s winners as the source of truth (hint: not a transformer in sight). 👉🏼 The main differences between different groups of models: statistical, deep learning, transformers, foundation models, and even GNNs. 👉🏼 How foundation models could prove useful for forecasting events that have never happened on a time series, but have on other ones (using pandemic prediction as an example). And of course reviews several of the state-of-the-art foundation models, like #LagLlama, #TimeGPT, and #PreDcT (yep - being published a month ago means some notable new models like #MOIRAI and #MOMENT didn’t make it in 🥲). They mention two challenges time series foundation models face (when compared to LLMs) that I found interesting: 👉🏼 Unlike natural language, existing time series data lacks the inherent semantic richness. 👉🏼 The semantics in time series data are often heavily domain-specific. Recommend the read, especially if you’ve been wanting to get into the field. Is anyone else extremely eager for an updated survey on the new foundation models only, that goes more in-depth into architectural differences and overall performance and applicability? 👀
2 Comments
Like Comment
To view or add a comment, sign in
Nagendra Jupudy

Jio Institute AI&DS 2024-25 || Ex- Senior Hardware Engineer at Bosch Global Software Technologies
4mo
Report this post
🚀 𝗘𝘅𝗽𝗹𝗼𝗿𝗶𝗻𝗴 𝗨𝗻𝘀𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: 𝗨𝗻𝘃𝗲𝗶𝗹𝗶𝗻𝗴 𝘁𝗵𝗲 𝗣𝗼𝘄𝗲𝗿 𝗼𝗳 𝗗𝗮𝘁𝗮 𝗪𝗶𝘁𝗵𝗼𝘂𝘁 𝗟𝗮𝗯𝗲𝗹𝘀 📊 In the world of data science and machine learning, unsupervised learning stands out for its ability to uncover hidden patterns and structures within unlabelled data. Here's a quick dive into its core techniques: 𝟭. 𝗗𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝗮𝗹𝗶𝘁𝘆 𝗥𝗲𝗱𝘂𝗰𝘁𝗶𝗼𝗻: This technique simplifies datasets by reducing the number of features while retaining essential information. Methods like PCA (Principal Component Analysis) and t-SNE are crucial for visualizing high-dimensional data and improving model performance. 𝟮. 𝗖𝗹𝘂𝘀𝘁𝗲𝗿𝗶𝗻𝗴: Clustering algorithms, such as K-Means and Hierarchical Clustering, group data points into clusters based on similarity. This helps in discovering natural groupings and patterns within the data, aiding in segmentation and anomaly detection. 𝟯. 𝗗𝗲𝗻𝘀𝗶𝘁𝘆 𝗘𝘀𝘁𝗶𝗺𝗮𝘁𝗶𝗼𝗻: Techniques like Gaussian Mixture Models (GMMs) estimate the probability distribution of data points, providing insights into the data's underlying structure. This is essential for tasks like anomaly detection and feature extraction. 𝟰. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝗠𝗶𝗻𝗶𝗻𝗴: Uncovering patterns and associations within large datasets is key to deriving actionable insights. Algorithms like Apriori and FP-Growth help in discovering frequent itemsets and associations, driving decision-making in various domains. Unsupervised learning opens up endless possibilities for innovation by making sense of complex datasets without predefined labels. #DataScience #MachineLearning #UnsupervisedLearning #DimensionalityReduction #Clustering #DensityEstimation #PatternMining #AI #DataAnalysis
Like Comment
To view or add a comment, sign in

6,869 followers

View Profile Connect

Casmir Anyaegbu’s Post

More from this author

Creating a Date Table in Power BI for Shopping Cart Abandonment Analysis

Model Comparison and Selection for Computer Vision Project: Detecting Cassava Diseases and Classifying Symptoms

Understanding Data Types in SQL: Why They Matter

Explore topics