Data Science Salon’s Post

View organization page for Data Science Salon, graphic

6,521 followers

We’re excited to announce the release of the #DSSATX 2024 playlist, featuring in-depth sessions from leading data science professionals. This year’s event in Austin brought together experts to discuss advanced applications of predictive analytics, generative AI, and machine learning across various industries. A Few Featured Sessions: • #DataLabeling: The Unsung Hero of Machine Learning Lydia Chen, Director of Data Science at Gannett, emphasizes the critical role of precise data labeling in developing effective machine learning models. • Lessons Learned Applying LLMs in Healthcare Hasham Ul Haq, Sr Machine Learning Engineer at John Snow Labs, shares real-world applications and insights from implementing large language models in the healthcare sector. • Intelligent Decision Companion in Industrial Environments Shahid Bashir, Data Analytics and Cybersecurity Lead at Phillips66 Lubricantes Grupo Beredali, discusses the integration of intelligent decision systems within industrial settings. • Automated Parsing and Response Generation for Financial Documents Using LLMs Dr. Lovedeep S., Data Science Manager at Arthur J. Gallagher, explores the automation of financial document processing through large language models. • Generative AI’s Role in Shaping Future Engineering Teams Fatma Tarlaci, examines how #generativeAI is influencing the structure and dynamics of engineering teams. • Fueling the Data Engine: How LLMs Can Ignite Your Data Enablement Strategy Sreevani Konda, Director of Data Analytics and Data Governance at Fidelity Investments, highlights the role of large language models in enhancing data strategies. • Leveraging #LLMs for Text Augmentation in #NLP Tasks Ariel Gamiño, Lead #AI and #MachineLearning Engineer at Trajector Medical, discusses the application of large language models for text augmentation in natural language processing tasks. WATCH HERE: https://lnkd.in/g-jcGppx We're excited to be back at Oracle HQ in Austin on Feb 19-20! CFP is now open [in comments] #datascience #genAI #enterpriseAI

5 Comments

Christianna Clark

GenAI Solutions Architect @ Google

Seriously the best time ever!!! Can’t wait to have y’all back in Austin again ❤️❤️

2 Reactions

To view or add a comment, sign in

More Relevant Posts

Unstract

935 followers
4mo
Report this post
Building LLM applications? Extracting clean data from PDFs is crucial for feeding your LLMs, but real-world data is far from perfect. 🌍 Join Shuveb Hussain and Yujian Tang with OSS4AI in a practical workshop to tackle your data preparation challenges! What you'll learn: ▸ Extracting data from scanned PDFs, handwritten forms, and complex tables. ▸ Comparing and choosing the right libraries and techniques. ▸ Practical methods to extract clean, usable text for your LLMs. Walk away with the skills to: ✅ Structure your data for easy processing. ✅ Maximize your LLMs with high-quality input. Perfect for: ✦ Machine Learning Engineers ✦ Data Scientists ✦ Software Developers (NLP/AI) ✦ Document AI Specialists ✦ Anyone working with LLMs 📅 Date: Thursday, 25th July 🕒 Time: 9:00 - 10:00 GMT-7 Don't miss out! Register today ➡ https://lu.ma/gxra6lza #LLM #DataExtraction #MachineLearning #AI #Workshop
Like Comment
To view or add a comment, sign in
Paramesh Gunasekaran

Solutions Architect • DevSecOps • TOGAF Certified Enterprise Architect • AWS Certified Solution Architect
3mo
Report this post
Are vector databases the future of data management in AI? In the lightning-paced world of AI, vector databases are taking center stage. But why? Let's find out. 🌟What are Embeddings? 🔹Embeddings are numerical representations of data that can be understood by machine learning models. 🔹Typically, embeddings are multi-dimensional numerical arrays, which are essentially vectors. 🌟Why use vector databases? 🔹More than 80% of today's data is unstructured. 🔹Keyword tagging of unstructured data is not efficient. 🔹Vector embeddings improve data accessibility. 🔹They facilitate efficient nearest neighbor searches. 🔹Allow for similarity searches across different data types, including text, images, videos, and audio. 🔹Indexing enables quick query processing and data retrieval. 🔹They convert vectors into structures that are searchable. 🌟Where use vector databases? 🔹Long-term Memory for GPT-4 and enhancing contextual memory. 🔹Semantic search beyond exact matches. 🔹Similarity search as keyword search is out-dated. 🔹Recommendation engines to provide smarter e-commerce suggestions. 🌟Popular Databases 🔹Pinecone 🔹Weaviate 🔹Chroma 🔹Redis Vector DB These tools redefine how we index and retrieve data in modern AI applications. Vector databases hold transformative potential. Drop a comment and share your thoughts! #AI #MachineLearning #DataScience #VectorDatabases #SoftwareDevelopment #NLP — 💡 Stay updated on software architecture insights at follow.paramg.com. ✍️ Connect with me on blog.paramg.com for more in-depth articles.
3 Comments
Like Comment
To view or add a comment, sign in
Mouhssine AKKOUH

Freelancer Data Scientist | Helping Businesses & Researchers with Data Solutions
7mo
Report this post
🤔 𝗔𝗿𝗲 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝗔𝗹𝘄𝗮𝘆𝘀 𝗕𝗲𝘁𝘁𝗲𝗿❓ 𝗟𝗲𝘁'𝘀 𝗘𝘅𝗽𝗹𝗼𝗿𝗲❗ The debate between deep learning and classical machine learning models is one that continues to spark discussions in the field of data science. Let's explore different perspectives on this intriguing question: 𝟭. 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴'𝘀 𝗦𝘁𝗿𝗲𝗻𝗴𝘁𝗵𝘀: Deep learning models, with their ability to automatically learn intricate patterns from raw data, excel in tasks like image recognition, natural language processing, and speech recognition. Their hierarchical architecture allows them to capture complex relationships within data, leading to state-of-the-art performance in many domains. 𝟮. 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 𝗼𝗳 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: While deep learning models shine in certain areas, they also come with their limitations. They often require massive amounts of data for training, extensive computational resources, and careful tuning of hyperparameters. In domains with limited data availability or where interpretability is crucial, traditional machine learning approaches may be more suitable. 𝟯. 𝗩𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗶𝘁𝘆 𝗼𝗳 𝗖𝗹𝗮𝘀𝘀𝗶𝗰𝗮𝗹 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: Classical machine learning models, such as decision trees, support vector machines, and logistic regression, offer simplicity, interpretability, and efficiency. They can perform well in scenarios with smaller datasets, where feature engineering plays a crucial role, or when transparency and explainability are essential, such as in regulatory compliance or medical diagnosis. 𝟰. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗠𝗮𝘁𝘁𝗲𝗿𝘀: The choice between deep learning and classical machine learning depends on various factors, including the nature of the data, the complexity of the problem, the availability of computational resources, and the specific requirements of the application. It's essential to consider these factors carefully and choose the approach that best suits the task at hand. In conclusion, while deep learning models have achieved remarkable success in various domains, they are not always superior to classical machine learning models. The choice between the two depends on the specific requirements and constraints of the problem at hand. By understanding the strengths and limitations of each approach, data scientists can make informed decisions to tackle real-world challenges effectively. 𝗪𝗵𝗮𝘁'𝘀 𝘆𝗼𝘂𝗿 𝘁𝗮𝗸𝗲 𝗼𝗻 𝘁𝗵𝗲 𝗱𝗲𝗯𝗮𝘁𝗲❓ 𝗦𝗵𝗮𝗿𝗲 𝘆𝗼𝘂𝗿 𝘁𝗵𝗼𝘂𝗴𝗵𝘁𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗰𝗼𝗺𝗺𝗲𝗻𝘁𝘀 𝗯𝗲𝗹𝗼𝘄❗ 💡🤖 #DeepLearning #MachineLearning #DataScience #AI #Perspective 🌐🚀
Like Comment
To view or add a comment, sign in
Elena Iurco

Data Analyst
7mo
Report this post
Excited to announce my completion of the "Understanding Artificial Intelligence" course on DataCamp! 🎓 As an aspiring data analyst, I'm passionate about bridging the gap between AI and data. 💡 I'm thrilled to continue learning about AI and exploring how it can enhance data analysis to drive insights and innovation. #ArtificialIntelligence #DataAnalysis #DataCamp #AI #DataScience #DataAnalytics #TechEducation

Elena Iurco's Statement of Accomplishment | DataCamp

datacamp.com
Like Comment
To view or add a comment, sign in
Joel Luis Carbonera

Adjunct Professor at Federal University of Rio Grande do Sul
3mo
Report this post
In this paper, the authors present evidence that models trained solely on data generated by other models perform worse than the original model, and that performance degenerates with each generation of the model if this process is repeated recursively. The models essentially lose the ability to represent information that is less frequent in the initial training data. This data supports a hypothesis that the scientific community has been discussing for some time. Given that training data for large language models is mostly extracted from the internet and that the internet has been flooded with synthetic data (generated by generative AI), this could be bad news for the next generations of large language models. https://lnkd.in/eTG3cC94

AI models collapse when trained on recursively generated data - Nature

nature.com
Like Comment
To view or add a comment, sign in
Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head
4mo
Report this post
It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. In a paper titled "AI models collapse when trained on recursively generated data" authors argue what may happen to GPT-{n} once LLMs contribute much of the text found online. They find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. They refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. They demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet. URL: https://lnkd.in/gEsKPGJU #llms #genai #generatieveai #capgemini #llm #opensourceai #capgeminiindia #ai #artificialintelligence #software #modelcollapse #safety #ai4good #airisks #airiskmanagement #airisk #aisafety #redteam #redteaming

AI models collapse when trained on recursively generated data - Nature

nature.com
Like Comment
To view or add a comment, sign in
Nilesh Kumar

Data Scientist || Artificial Intelligence || Machine Learning || Deep Learning || Document Processing || Image Processing || Natural Language Processing || Python || Generative AI || LangChain || RAG || LLM Model
4mo Edited
Report this post
#Vector #database #AI/ML #GenAI A vector database is a specialized type of database optimized for storing, indexing, and querying high-dimensional vectors, which are mathematical representations of data points. Vectors are used to represent complex data such as text, images, audio, and other types of unstructured data. Each vector is a list of numbers (features) that captures the essential information of the original data in a compact form. Vector databases are crucial for various applications, including machine learning, natural language processing, recommendation systems, and more. Key Features of Vector Databases 1.High-Dimensional Data Handling: Vector databases are designed to efficiently store and process high-dimensional data, often with hundreds or thousands of dimensions. 2. Similarity Search: They provide advanced algorithms for similarity search, enabling fast and accurate retrieval of vectors that are similar to a given query vector. This is essential for tasks like nearest neighbor search. 3. Indexing: Vector databases use specialized indexing techniques, such as hierarchical navigable small world (HNSW) graphs, approximate nearest neighbor (ANN) search, and other spatial indexing methods to speed up queries. 4. Scalability: They are built to handle large-scale datasets, often distributed across multiple nodes to support scalability and high availability. 5. Integration with Machine Learning Pipelines: Vector databases are commonly integrated with machine learning models and pipelines to support real-time inference and recommendation systems. Common Use Cases: 1. Image and Video Retrieval: Searching for similar images or videos based on visual content. 2. Text and Document Search: Finding documents or text snippets that are semantically like a query. 3. Recommendation Systems: Recommending products, content, or users based on similarity measures. 4. Anomaly Detection: Identifying unusual patterns or outliers in high-dimensional data. How Vector Databases Work? 1. Data Ingestion: Vectors are generated from raw data using feature extraction methods, often involving machine learning models like embeddings from neural networks. 2. Indexing: The vectors are indexed using efficient data structures that support quick similarity searches. Techniques like KD-trees, R-trees, and graph-based methods are commonly used. 3. Querying: A query vector is compared against the indexed vectors to find the most similar ones. This process involves calculating distances (e.g., Euclidean, cosine similarity) between vectors. 4. Result Retrieval: The database returns the vectors (and associated original data) that are closest to the query vector, based on the defined similarity metric.
Like Comment
To view or add a comment, sign in
Lorenzo De Tomasi

Sr. AI & Data Architect at Barilla Group
3mo
Report this post
🚀 Breakthrough in AI: GraphRAG - Elevating Retrieval-Augmented Generation to New Heights! 🧠🔍 Microsoft Research has unveiled GraphRAG, a game-changing approach to retrieval-augmented generation (RAG) that's now available on GitHub! This innovative tool offers unprecedented capabilities in information retrieval and comprehensive response generation. 🔑 Key Advancements of GraphRAG: 1️⃣ LLM-Generated Knowledge Graphs: • Creates rich, structured representations of private datasets • Enables complex reasoning and connection of disparate information 2️⃣ Whole Dataset Reasoning: • Identifies top themes and semantic concepts across entire datasets • Outperforms baseline RAG in understanding global context 3️⃣ Superior Query Performance: • Excels in "connecting the dots" across multiple pieces of information • Provides more relevant and comprehensive answers 4️⃣ Enhanced Provenance and Trustworthiness: • Offers source grounding for generated responses • Enables quick human auditing of LLM outputs 5️⃣ Hierarchical Semantic Clustering: • Organizes data into meaningful semantic clusters • Allows for pre-summarization of themes and concepts 🔬 Impressive Results: • Consistently outperforms baseline RAG in comprehensiveness and diversity • Maintains high faithfulness to source material 💡 Real-World Applications: • Analysis of complex, real-world datasets (e.g., VIINA dataset) • Applicable to social media, news articles, workplace productivity, and scientific domains 🛠️ Technical Process: 1. LLM processes entire dataset to create knowledge graph 2. Graph used for bottom-up clustering and semantic organization 3. Both structures inform context window population at query time 🔮 Future Directions: • Expanding to new domains and use cases • Developing robust evaluation frameworks • Enhancing metrics for accuracy and context relevance GraphRAG represents a significant leap forward in RAG technology, offering powerful capabilities for understanding and querying complex datasets. Its ability to reason over entire datasets and provide trustworthy, comprehensive answers makes it an invaluable tool for data analysis and information retrieval. #ArtificialIntelligence #GraphRAG #MachineLearning #NLP #DataAnalysis #KnowledgeGraphs #InformationRetrieval #AIResearch #MicrosoftAI #AzureAI #DataScience #LargeLanguageModels #AIInnovation #DataVisualization #CognitiveComputing #TextAnalytics #AIApplications #ResearchAndDevelopment #TechInnovation #FutureOfAI
2 Comments
Like Comment
To view or add a comment, sign in
Pawel Potasinski

CTO at InfiniteDATA Services
7mo Edited
Report this post
📢 Last chance to submit your session proposals and become a speaker at Data Science Summit Machine Learning Edition 2024! 🌐 June 13 Online and VOD 🏢 June 14 Onsite in Warsaw 💡 If you have an idea for an interesting speech on a topic related to the broad area of technology: ◾️Machine Learning ◾️Artificial Intelligence ◾️Data Science ◾️Trends in ML / AI ◾️LLMs ◾️GPT ◾️NLP ◾️Computer Vision ◾️Reinforcement Learning◾️Deployment / MLOps (I'm in the program committee behind this specific track!)◾️Predictive Analytics ◾️AI Use Cases◾️AI in Healthcare ◾️AI Business Transformation… ➡️ ...hurry up and apply here: https://lnkd.in/dkrM8YD9 #DataScienceSummit #DataScience #MachineLearning #MLOps #LLM
1 Comment
Like Comment
To view or add a comment, sign in
Rucheek Kashyap

Actively Seeking Business Analyst or Data Analyst roles | MS Business Analytics at UT Dallas | Graduate in Applied Machine Learning | Python | R | SQL | ML | NLP | PyTorch | Power BI
3w
Report this post
🚀 Project Update: Multilabel Classification with BERT on Imbalanced Datasets 🚀 I recently completed a project centered around implementing three distinct BERT models for multi-label classification, tailored specifically for a highly imbalanced dataset. Tackling data imbalance is crucial in ensuring that models perform reliably across all labels, especially underrepresented ones. Here's a breakdown of my approach and key insights: 🔍 Project Highlights: Custom Hugging Face Dataset: Built from scratch to seamlessly integrate with our models, allowing efficient data handling and preprocessing. Custom Configuration & Training Components: Developed a custom configuration, tokenizer vocabulary, and collate function to maximize training efficiency and model compatibility. Class Weights for Imbalance Handling: Incorporated class weights to improve performance across underrepresented classes, which is essential in multilabel tasks. Compute Metrics & Evaluation: Implemented custom metrics to evaluate model performance, providing a robust understanding of how each model handles the imbalanced data. Confusion Matrix for Detailed Analysis: Generated confusion matrices to visualize each model’s performance, particularly highlighting areas impacted by data imbalance. ⚙️ Experiments Conducted: Experiment 1: RoBERTa Base (Model: roberta-base) Experiment 2: DistilBERT (Model: distilbert-base-uncased) Experiment 3: DistilRoBERTa Base (Model: distilroberta-base) ⚙️ Models Evaluated: Each of the three BERT models underwent rigorous testing, enabling a comparative analysis of their effectiveness in managing the nuances of our data. Despite the limited size and significant imbalance in the dataset, this work illustrates the application of BERT models in challenging multilabel classification scenarios. 📉 Insights: While the dataset limitations affected final performance, this project showcases strategies for managing imbalanced data in complex classification tasks. It also underscores the value of model comparison for informed decision-making in real-world AI applications. Excited to leverage these insights in future projects and continue tackling challenging AI problems! If you're interested in multilabel classification, BERT models, or handling data imbalance, let’s connect and exchange ideas! #NLP #PyTorch #MachineLearning #BERT #RoBERTa #DistilBERT #DeepLearning #DataScience #MultilabelClassification #AI #HuggingFace Check it out here: https://lnkd.in/gUtbHuwC
Like Comment
To view or add a comment, sign in

6,521 followers

View Profile Follow

Data Science Salon’s Post

More Relevant Posts

Explore topics