Come on over to our booth to grab some delicious Data fortune cookies and pick up a fun DatologyAI-branded fidget cube! You can find us at booth 303, right next to the entrance. We can't wait to see you!! #NeurIPS2024
DatologyAI
Technology, Information and Internet
Redwood City, California 2,082 followers
better data, better models, better business
About us
DatologyAI builds tools to automatically select the best data on which to train deep learning models. Our tools leverage cutting-edge research—much of which we perform ourselves—to identify redundant, noisy, or otherwise harmful data points. The algorithms that power our tools are modality-agnostic—they’re not limited to text or images—and don’t require labels, making them ideal for realizing the next generation of large deep learning models. Our products allow customers in nearly any vertical to train better models for cheaper.
- Website
-
www.datologyai.com
External link for DatologyAI
- Industry
- Technology, Information and Internet
- Company size
- 2-10 employees
- Headquarters
- Redwood City, California
- Type
- Privately Held
- Founded
- 2023
Locations
-
Primary
1001 Main St
Redwood City, California 94063, US
Employees at DatologyAI
Updates
-
We’re thrilled to announce that DatologyAI will be at NeurIPS 2024 in Vancouver, Canada! 🎉 Visit us at Booth 303 from December 10-12 to learn more about how we’re revolutionizing data curation. If you’re passionate about unlocking the power of your data or just curious about what we do, we’d love to meet you! Let’s talk about how data curation can transform your AI initiatives. See you! #NeurIPS2024 #DataCuration #DatologyAI
-
DatologyAI reposted this
Nothing matters more for a startup's success than its ability to ship quickly. And DatologyAI has been shipping amazingly quickly. Congratulations to the team on another incredible release! 🚀 https://lnkd.in/ggvNwAxd
-
Starting your Thanksgiving holiday with some fresh-out-of-oven DatologyAI Data:
Two weeks ago, we at DatologyAI released our first results demonstrating massive gains from data curation on contrastive image-text models. Today, I'm incredibly excited to share our newest results, applying our curation pipeline to LLMs. It's absolutely astonishing to see what a small, incredibly talented group of individuals can accomplish, and boy have we cooked! Starting with an exact-deduplicated version of Red Pajama v1 as our baseline and by manipulating only the training data for the model: Train Faster -- Training on our curated data reached the same baseline performance 7.7x faster, meaning results cost dramatically less to obtain and drastically improving iteration speed. Train Better -- Push the frontier of what's possible with a given budget, improving performance by 8.5 absolute percentage points (60.5% Datology vs. 52.0% RPJv1). This isn't just because of Red Pajama: compared to the strongest publicly curated datasets, DataComp-LM and FineWeb-Edu, improve performance by 4.4% and 6.1%, respectively. Train Smaller -- Better data enables you to train smaller models. Reduce cost per query at inference by 2.1x while simultaneously increasing performance over the baseline by 5.7%. As with our image-text results, we present these results both at a high-level (https://lnkd.in/g_hMR5Tx) and with an extremely meaty technical deep-dive for all of you who want the nitty-gritty details (https://lnkd.in/gY5tpq3s). We are just getting started on our journey and are so excited about what's in store. Are you training or customizing your own text models and want to improve performance, training efficiency, and inference efficiency through better data? Get in touch (https://lnkd.in/gSGckr6s)! Are you a data-obsessed researcher, engineer, or somewhere in between who wants to push the bounds of what's possible with better data? We're hiring Members of Technical Staff across a number of roles (https://lnkd.in/gHCwPk8e).
-
DatologyAI reposted this
Two weeks ago, we at DatologyAI released our first results demonstrating massive gains from data curation on contrastive image-text models. Today, I'm incredibly excited to share our newest results, applying our curation pipeline to LLMs. It's absolutely astonishing to see what a small, incredibly talented group of individuals can accomplish, and boy have we cooked! Starting with an exact-deduplicated version of Red Pajama v1 as our baseline and by manipulating only the training data for the model: Train Faster -- Training on our curated data reached the same baseline performance 7.7x faster, meaning results cost dramatically less to obtain and drastically improving iteration speed. Train Better -- Push the frontier of what's possible with a given budget, improving performance by 8.5 absolute percentage points (60.5% Datology vs. 52.0% RPJv1). This isn't just because of Red Pajama: compared to the strongest publicly curated datasets, DataComp-LM and FineWeb-Edu, improve performance by 4.4% and 6.1%, respectively. Train Smaller -- Better data enables you to train smaller models. Reduce cost per query at inference by 2.1x while simultaneously increasing performance over the baseline by 5.7%. As with our image-text results, we present these results both at a high-level (https://lnkd.in/g_hMR5Tx) and with an extremely meaty technical deep-dive for all of you who want the nitty-gritty details (https://lnkd.in/gY5tpq3s). We are just getting started on our journey and are so excited about what's in store. Are you training or customizing your own text models and want to improve performance, training efficiency, and inference efficiency through better data? Get in touch (https://lnkd.in/gSGckr6s)! Are you a data-obsessed researcher, engineer, or somewhere in between who wants to push the bounds of what's possible with better data? We're hiring Members of Technical Staff across a number of roles (https://lnkd.in/gHCwPk8e).
-
Train better, train faster, train smaller with DatologyAI!
Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!
-
DatologyAI reposted this
Incredible results and exciting announcement today from DatologyAI! If you are training or fine-tuning AI models, you can't afford not to use Datology. The gains in speed, cost, performance and efficiency that Datology's platform unlocks are astounding.
DatologyAI’s Image-Text Data Curation: Train Better, Faster, Smaller
datologyai.com
-
DatologyAI reposted this
Check out what I've been working on for the past 6 months! tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models! See this X/Twitter thread for more information as well: https://lnkd.in/g9RwS7uG
Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!
-
DatologyAI reposted this
High-quality data leads to better models! At DatologyAI, we've made data curation accessible! Our curation pipeline enables training faster (up to 98% less compute), better (up to 13% higher performance), and smaller (>60% fewer parameters). Check out the blog posts below: https://lnkd.in/dAySHZGK
Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!
-
DatologyAI reposted this
Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!