Evaluating synthetic data https://ift.tt/gAfTWLH Assessing plausibility and usefulness of data we generated from real data Synthetic data serves many purposes, and has been gathering attention for a while, partly due to the convincing capabilities of LLMs. But what is «good» synthetic data, and how can we know we managed to generate it ? Photo by Nigel Hoare on Unsplash What is synthetic data ? Synthetic data is data that has been generated with the intent to look like real data, at least on some aspects (schema at the very least, statistical distributions, …). It is usually generated randomly, using a wide range of models : random sampling, noise addition, GAN, diffusion models, variational autoencoders, LLM, … It is used for many purposes, for instance : training and education (eg, discovering a new database or teaching a course), data augmentation (ie, creating new samples to train a model), sharing data while protecting privacy (especially useful from an open science point of view), conducting research while protecting privacy. It is particularily used in software testing, and in sensitive domains like healthcare technology : having access to data that behaves like real data without jeopardizing patients privacy is a dream come true. Synthetic data quality principles Individual plausibility For a sample to be useful it must, in some way, look like real data. The ultimate goal is that generated samples must be indistinguishable from real samples : generate hyper-realistic faces, sentences, medical records, … Obviously, the more complex the source data, the harder it is to generate «good» synthetic data. Usefulness In many cases, especially data augmentation, we need more than one realistic sample, we need a whole dataset. And it is not the same to generate a single sample and a whole dataset : the problem is very well known, under the name of mode collapse, which is especially frequent when training a generative adversarial network (GAN). Essentially, the generator (more generally, the model that generates synthetic data) could learn to generate a single type of sample and totally miss out on the rest of the sample space, leading to a synthetic dataset that is not as useful as the original dataset. For instance, if we train a model to generate animal pictures, and it finds a very efficient way to generate cat pictures, it could stop generating anything else than cat pictures (in particular, no dog pictures). Cat pictures would then be the “mode” of the generated distribution. This type of behaviour is harmful if our initial goal is to augment our data, or create a dataset for training. What we need is a dataset that is realistic in itself, which in absolute means that any statistic derived from this dataset should be close enough to the same statistic on real data. Statistically speaking, this means that univariate and multivariate distributions should be the same (or at least “close enough”). Privacy We...
Massimiliano Marchesiello’s Post
More Relevant Posts
-
FedCAP: Robust Federated Learning via Customized Aggregation and Personalization https://ift.tt/CepMvdb arXiv:2410.13083v1 Announce Type: new Abstract: Federated learning (FL), an emerging distributed machine learning paradigm, has been applied to various privacy-preserving scenarios. However, due to its distributed nature, FL faces two key issues: the non-independent and identical distribution (non-IID) of user data and vulnerability to Byzantine threats. To address these challenges, in this paper, we propose FedCAP, a robust FL framework against both data heterogeneity and Byzantine attacks. The core of FedCAP is a model update calibration mechanism to help a server capture the differences in the direction and magnitude of model updates among clients. Furthermore, we design a customized model aggregation rule that facilitates collaborative training among similar clients while accelerating the model deterioration of malicious clients. With a Euclidean norm-based anomaly detection mechanism, the server can quickly identify and permanently remove malicious clients. Moreover, the impact of data heterogeneity and Byzantine attacks can be further mitigated through personalization on the client side. We conduct extensive experiments, comparing multiple state-of-the-art baselines, to demonstrate that FedCAP performs well in several non-IID settings and shows strong robustness under a series of poisoning attacks. via cs.LG updates on arXiv.org https://ift.tt/QwsCkH2 October 18, 2024 at 05:00AM
FedCAP: Robust Federated Learning via Customized Aggregation and Personalization https://ift.tt/CepMvdb arXiv:2410.13083v1 Announce Type: new Abstract: Federated learning \(FL\), an emerging distributed machine learning paradigm, has been applied to various privacy-preserving scenarios. However, due to its distributed nature, FL faces two key issues: the non-independent and identical...
uk.linkedin.com
To view or add a comment, sign in
-
Continual Learning: A Primer https://ift.tt/Es3IJk9 Plus paper recommendations Training large language models currently costs somewhere between $4.3 Million (GPT3) and $191 Million (Gemini) [1]. As soon as new text data is available, for example through licensing agreements, re-training with this data can improve model performance. However, at these costs (and not just at these levels; which company has $1 Million to spare for just the final training, not speaking of the preliminary experiments before?), frequent re-training from scratch is prohibitively expensive. Photo by Dan Schiumarini on Unsplash This is where continual learning (CL) jumps in. In CL, data arrives incrementally over time, and cannot be (fully) stored. The machine learning model is trained solely on the new data; the challenge here is catastrophic forgetting: performance on old data drops. The reason for the performance drop is that the model adapts its weights to the current data only, as there is no incentive to retain information gained from previous data. To combat forgetting and retain old knowledge, many methods have been proposed. These methods can be grouped into three central categories*: rehearsal-based, regularization-based, and architecture-based. In the following sections, I will detail each category and introduce select papers to explore further. While I focus on classification problems, all covered ideas are mostly equally valid for, e.g., regression tasks but might require adaptations. In the end, I recommend papers to further explore CL. Rehearsal-based methods Schematic view of the rehearsal-based category. Besides original data from the current task, data from old tasks is replayed from a small memory buffer. Image by the author. Methods from the rehearsal-based category (also called: memory-, replay-based) maintain an additional small memory buffer. This buffer can either store samples from old tasks or hold generative models. In the first case, the stored samples can be real samples [2], synthetic samples [3], or merely feature representations of old data [4]. As the memory size is commonly limited, the challenge is which samples (or features) to store and how to best exploit the stored data. Strategies here range from samples that are most representative of a data class (say, the most average cat image) [5] to ensuring diversity [6]. In the second case, the additional memory buffer is used to store one or more generative models. These models are maintained alongside the main neural network, and are trained to generate task-specific data. After training, these models can dynamically be queried for data of tasks that are no longer available. The generative networks usually are GANs (e.g., [7]) or VAEs (e.g., [8]). In both cases, the replayed data mostly is combined with the current task’s data to perform joint training, through other variants exist (e.g., [9]). Architecture-based methods Schematic...
Continual Learning: A Primer https://ift.tt/Es3IJk9 Plus paper recommendations Training large language models currently costs somewhere between $4.3 Million \(GPT3\) and $191 Million \(Gemini\) \[1\]. As soon as new text data is available, for example through licensing agreements, re-training with this data can improve model performance. However, at these costs \(and not just at these...
uk.linkedin.com
To view or add a comment, sign in
-
Co Founder Aufgesang 🚀SEO+LLMO+Semantic Search+ E-E-A-T🔎Customer Journey Management+Content Marketing🤓Founder SEO Research Suite🏠based in 🇩🇪 & 🇵🇹
✌️☀️Ranking algorithm development and evaluation: Click analyzes vs. manual ratings👇 The paper "Large-Scale Validation and Analysis of Interleaved Search Evaluation" compares the quality of a relevance evaluation in the course of ranking algorithm development via clicks or manual evaluations, e.g. by quality raters. Interleaving is a method used in information retrieval to compare the effectiveness of different search algorithms by interleaving their results into a single ranked list presented to users. This technique allows researchers and practitioners to evaluate which search algorithm better satisfies user needs based on actual user interactions, such as clicks, rather than relying solely on theoretical or simulated measures of effectiveness. The study's findings demonstrate that the interleaving method is highly sensitive, capable of detecting differences in search quality with relatively small sample sizes. This contrasts with absolute metrics, which often require significantly larger data sets to reliably identify any differences. One key advantage of interleaving over absolute measurements arises from two aspects of its experimental design: firstly, interleaving exhibits increased sensitivity because it constitutes a paired test in terms of both queries and users. Secondly, by prompting a direct expression of preference between two options, interleaving measures differences in relevance more directly and reliably. In evaluating the value of a click compared to a manually judged query, the analysis indicated that approximately ten interleaved queries with clicks provide equivalent evaluation power to one manually judged query. Assuming each manually judged query requires at least five document evaluations, the feedback from two interleaved queries equates to at least one evaluated document. However, the consistency of these relationships across different document collections and user populations remains unclear. Perhaps interesting for Hanns Kronenberg, Marcus Tandler, Marie Haynes, Michael King, Kevin Indig ... 🔔Click the bell or follow me for more about Semantic SEO, E-E-A-T, Content Marketing and (Thought)Leadership.
To view or add a comment, sign in
-
A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data' https://ift.tt/pXjIaDg arXiv:2410.12954v1 Announce Type: new Abstract: The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable. via cs.LG updates on arXiv.org https://ift.tt/QwsCkH2 October 18, 2024 at 05:00AM
A Note on Shumailov et al. \(2024\): `AI Models Collapse When Trained on Recursively Generated Data' https://ift.tt/pXjIaDg arXiv:2410.12954v1 Announce Type: new Abstract: The study conducted by Shumailov et al. \(2024\) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly...
uk.linkedin.com
To view or add a comment, sign in
-
Voice RAG for Structured and Unstructured Data https://ift.tt/cmzuFT0 Voice RAG with GPT-4O Realtime for Structured and Unstructured Data All about GPT-4O realtime api and a Step-by-Step Guide to implementing Voice RAG with Practical Python example Photo from oneusefulthing.org The accompanying code for this tutorial is: here Introduction In the ever-evolving landscape of artificial intelligence, the introduction of the Azure OpenAI GPT-4o Realtime API marks a significant milestone. As an AI enthusiast and developer, I was thrilled to explore this cutting-edge technology and its potential applications. This blog delves into the intricacies of the GPT-4o Realtime API, exploring its features, capabilities, and practical uses. Whether you’re a seasoned developer or an AI enthusiast, this comprehensive guide will provide you with a detailed understanding of how to leverage the GPT-4o Realtime API for creating immersive, real-time speech-to-speech experiences. What’s GPT-4o Realtime? 🤔 The GPT-4o Realtime is designed to enable developers to build low-latency, multimodal experiences in their applications. Imagine having natural, seamless conversations with AI-powered voice assistants that understand and respond in real-time. Unlike traditional methods that required multiple models to handle speech recognition, text processing, and speech synthesis, the GPT-4o Realtime API streamlines the process into a single API call, significantly reducing latency and improving the naturalness of interactions. Demo😱 Key Features and Capabilities Low-Latency Speech-to-Speech Interactions The GPT-4o Realtime API supports fast, real-time speech-to-speech interactions. This is achieved through a persistent WebSocket connection that allows for asynchronous streaming communication between the user and the model. This setup ensures that responses are generated quickly, maintaining the flow of natural conversation. Multimodal Support The API is capable of handling various input and output modalities, including text, audio, and function calls. This flexibility allows developers to create rich, interactive experiences that can respond to user inputs in multiple formats. Function Calling One of the standout features of the GPT-4o Realtime API is its support for function calling. This enables voice assistants to perform actions or retrieve context-specific information based on user requests. For example, a voice assistant could place an order or fetch customer details to personalize responses. Voice Activity Detection (VAD) The API includes advanced voice activity detection capabilities, which automatically handle interruptions and manage the flow of conversation. This ensures that the system can respond appropriately to user inputs without unnecessary delays. Integration with Existing Tools The GPT-4o Realtime API is designed to work seamlessly with existing tools and services. For instance, it can be integrated with Twilio’s Voice APIs to build and...
Voice RAG for Structured and Unstructured Data https://ift.tt/cmzuFT0 Voice RAG with GPT-4O Realtime for Structured and Unstructured Data All about GPT-4O realtime api and a Step-by-Step Guide to implementing Voice RAG with Practical Python example Photo from oneusefulthing.org The accompanying code for this tutorial is: here Introduction In the ever-evolving landscape of artificial...
uk.linkedin.com
To view or add a comment, sign in
-
SEO veteran turned data scientist. Driving organic growth for leading startups. Author of Data-Driven SEO, published by Springer APress.
What can be measured? When it comes to making your growth campaigns more targeted and effective? This short extract from my book on data driven SEO shows my thinking... Even with the best will to collect the data, not everything worth measuring can be measured. Although this is likely to be true of all marketing channels, not just SEO, it’s not the greatest reason for data scientists not to move into SEO. If anything, I’d argue the opposite in the sense that many things in SEO are measurable and that SEO is data rich. There are things we would like to measure such as • Search query: Google, for some time, has been hiding the search query detail of organic traffic, of which the keyword detail in Google Analytics is shown as “Not Provided.” Naturally, this would be useful as there are many keywords to one URL relationship, so getting the breakdown would be crucial for attribution modeling outcomes, such as leads, orders, and revenue. • Search volume: Google Ads does not fully disclose search volume per search query. The search volume data for long tail phrases provided by Ads is reallocated to broader matches because it’s profitable for Google to encourage users to bid on these terms as there are more bidders in the auction. Google Search Console (GSC) is a good substitute, but is first-party data and is highly dependent on your site’s presence for your hypothesis keyword. • Segment: This would tell us who is searching, not just the keyword, which of course would in most cases vastly affect the outcomes of any machine-learned SEO analysis because a millionaire searching for “mens jeans” would expect different results to another user of more modest means. After all, Google is serving personalized results. Not knowing the segment simply adds noise to any SERPs model or otherwise. https://amzn.eu/d/357QNmP
Data-Driven SEO with Python: Solve SEO Challenges with Data Science Using Python
amazon.co.uk
To view or add a comment, sign in
-
Is your content not ranking or getting any clicks? Here’s a few things to check if that’s the case: 1. GSC to see if it’s been indexed - Manually inspect your URL - Check the coverage report - Request your URL to be crawled 2. Plug your URL in as a page filter in GSC - Are relevant queries showing? - Are any queries showing? - Are other URLs showing up for those queries? - Are there any impressions? - If no queries, then Google likely doesn’t find your content relevant - Or it’s a cannibalization issue 3. Does your page have any internal links? - Check if it’s orphaned - Check the anchor text used - Add links from other relevant pages 4. Is it relevant to the query? - Is it helpful for the searcher? - Does it answer their question? - Is it filled with fluff? - Is it too salesly? 5. Is it the same quality/better than what’s out there? - Does it match the quality of other articles? - Are you adding anything new to the conversation? - Add images, data, insights, quotes, etc 6. Is it AI-generated without any human edits? - Rewrite it - Don’t worry about AI detection; worry about quality It really depends on why your page isn’t ranking, but relevancy, quality, and intent seem to be the reason 75% of the time. The remaining 25% could be due to indexing issues. If it’s page-specific, then it’s probably a content issue…if it’s site-wide, then it’s likely a technical SEO/content audit issue.
To view or add a comment, sign in
-
Knowledge graphs empower machines to extract meaningful knowledge from data by presenting information in a machine-readable format. 💡But did you know you can also create a “content” knowledge graph that is particularly useful for SEO initiatives? Although structured like a general knowledge graph, a content knowledge graph functions as a representation of the entities on your website and the relationships between them. Read our article to learn how a knowledge graph is structured & how the entities on your website can contribute to your own content knowledge graph ⬇️ https://bit.ly/49787ST #KnowledgeGraph #SemanticSearch #GenerativeAI
The Anatomy of a Content Knowledge Graph | Schema App Solutions
https://www.schemaapp.com
To view or add a comment, sign in
-
Snowflake Makes Strategic Investment in LLM Developer Mistral AI Snowflake, a data cloud company, has made a strategic investment in the first round of venture capital for Mistral AI, a developer of large language models (LLMs), and announced the launch of a new Snowflake solution that uses Mistral AI tech which allows users to leverage their enterprise data in “a wide range of use cases.” Through the multi-year partnership, Mistral AI and Snowflake said they will deliver capabilities for enterprises to tap into large language LLMs. According to the Snowflake, “any user with SQL skills can leverage smaller LLMs to cost-effectively address specific tasks such as sentiment analysis, translation, and summarization in seconds.” “For more complex use cases, Python developers can go from concept to full-stack AI apps such as chatbots in minutes, combining the power of foundation LLMs — including Mistral AI’s LLMs in Snowflake Cortex — with chat elements,” according to the announcement. As of January 2024, users at 691 of the 2023 Forbes Global 2000 are customers for Snowflake Data Cloud, the company said. https://lnkd.in/eKNyiTf5
Snowflake Makes Strategic Investment in LLM Developer Mistral AI
blog.presspool.ai
To view or add a comment, sign in
-
🚀 Exciting Read Alert! Dive into our last article by the brilliant Emilia Gjorgjevska 🧙♀️ 😎 In this piece, Emilia explores the intersection of #Artificial #Intelligence and #SEO, shedding light on how to build #trust in these rapidly evolving fields. Whether you're a digital marketer, SEO specialist, or tech enthusiast, this article is packed with valuable insights on: 📑 The role of AI and #semantic #knowledge in enhancing SEO strategies and organizing information. 📈 The importance of maintaining high #data #quality, seamless data integration, and ethical practices in AI. 👥 How to leverage AI for better #user #experience, #content #creation, and ensuring #data #privacy and #security. Don't miss out on this opportunity to stay ahead of the curve and learn how to effectively integrate AI into your SEO efforts. Click the link below to read the full article and elevate your #digital #marketing game! 📈 👉 Read the article: https://wor.ai/aqZ3cU
Building Trust in AI: Ethical SEO Practices for Consumer Confidence
https://wordlift.io/blog/en
To view or add a comment, sign in