One question, which might become relevant for the current "Death of Software" debate, is to what lengths OpenAI and other foundational model providers will go to obtain more training data once they've scraped and used most publicly available data on the Internet (which is likely the case soon, if it isn't already). Deals with large book publishers and movie right owners seem like a logical next step. How about internal documents from enterprises? Or large amounts of data from e.g. hospitals, insurances, law firms, or accounting firms to improve model performance in those areas? The assumption is, of course, that personally identifiable data will be removed before training. If it turns out to be too hard to buy this kind of data – and if it's important for e.g. OpenAI to get better with healthcare – they could simply buy a few hospitals. Curious to hear if people think of this is a realistic scenario.
The future is all about how useful the data is, not only how much you have it (signal/noise ratio). The most valuable AI company of this decade is not yet known. Scraping and getting access to public data is not the way to obtain high-quality data. The interaction and feedback loops with human intelligence are. And there are multiple ways how to realize tools for that. There are already much better (and more productive) ways to use language models outside the usual 'chat' environment. None of them is made by OpenAI. And they don't have access to the valuable data produced (as it is not text). But structured knowledge and meta-data that can be used, during training and fine-tuning. There were so many tech 'revolutions' in the past. What makes us think that OpenAI will remain the market leader? Too much money in the company can kill innovation, by making it focused too much on monetization. We really just started with AI.
Are we not already seeing organisations grabbing at this very problem, by issuing updated T&Cs that state they can use your data for training LLMs? I’m sure there was a backlash against this only recently, but the company who did it escapes me. It was somebody big like Adobe, or Slack. There’s also the danger LLMs start “eating” their own regurgitations, in the form of synthetic data. Ed Zitron wrote a scathing piece about that idea only recently too.
Except, I don't think accessing patient records for the purpose of training AI could be considered a "valid reason." Though, I am not a HIPAA expert. "Accessing a patient’s chart without a valid reason can lead to serious consequences, primarily due to violations of the Health Insurance Portability and Accountability Act (HIPAA) in the United States."
Interesting thoughts. Maybe you don't have to buy several hospitals, maybe one or two established software providers or a bunch of small start-ups in one or more specific verticals will be enough to acquire the necessary data. It is difficult to say whether OpenAI will prefer to be the gateway to vertical AI solutions (e.g. a la Marketplace) in the future and let all other providers survive or earn money, or whether it will become the all-embracing god-mode AI monopoly itself.
Without acquiring propriety data, their leap compared to the competitors might be shrinking, as everyone will have access to the same data. So, there is no doubt that they will go on a shopping spree (similar to the agreement with Reddit). I doubt they want to acquire large non-tech businesses, as running them will provide unnecessary complexity. However, there may be older SaaS businesses that have had a hard time innovating but are sitting on a large amount of data. This might be an exit opportunity for them - even if the value is only the data. Then comes the question - how good are the unit economics for LLMs if the price for access keeps dropping, and they will have to keep going on a data-shopping-spree to stay ahead? Will they look more like content businesses?
It appears that, at least for selling to enterprises, the origin of training data and licensing are becoming more and more relevant to comply with copyright law. This remains a significant question mark until a few lawsuits, like Getty vs. Stability AI, have been decided. If these are decided in favour of copyright owners, it might trigger a race to see who can scale the licensing of large amounts of data the fastest, regardless of whether that data is publicly available or not.
No need to buy hospitals, the market for this kind of data flow exists already and there are companies specializing there. E.g. https://dandelionhealth.ai/
You can get a lot of data without having to actually buy the whole thing. Plus, they don't need the actual data of patient record 2G4F but can do just as well with synthetic data that looks just like the data for patient 2G4F. That's a more likely scenario IMHO.
It’s exactly why my thesis is data infrastructure > AI for most industrial B2B (ip AEC) for the next few years
It's time to build 🏗️
6moFor foundation models imho the main ways to get better fast re data are - access to proprietary data - curation of data (not all data is good, most is counter-productive) - generation of synthetic data imho a bit counterintuitive: the 2nd and 3rd are undervalued and where most of the alpha is why? proprietary data sounds good but when you zoom in into a concrete problem domain: rarely is important data unobtainable, rarely is the proprietary data better than non-proprietary. There are counter-examples in niches but these don't matter as much for foundation model training @ increasing capability of the models in reasoning. hth