The Mass Data Fragmentation Cleanup

Alan Cohen

General Partner at DCVC

Published Oct 24, 2018

One of the most inspiring news stories in the past year is the recent launch of the Great Ocean Cleanup, an ambitious project led by 24-year-old Dutch inventor Boyan Slat. He started his quest in 2013 (when he was 18) to rid our oceans of millions of tons of plastic waste that clogs our oceans and destroy marine life across the planet. Every year, 8 million metric tons of plastic finds its way into the ocean, and that number is expected to multiply by 20 in seven years. There are now 1.8 trillion pieces of plastic in the Pacific Ocean. Slat's first target is the “Great Pacific Garbage Patch,” a gyre over 1.6 million square kilometersof plastic that has left the land and collected in the ocean.

There is a parallel in the IT world. Over this same 50-year period, we have created an enormous and uncontrollable sea of data. Some of it is important (e.g., college transcripts and tax records), yet a great deal of it is waste: Irrelevant and duplicative files parked on active servers, storage devices, and in the cloud, consuming computing and energy resources. Indeed, market researcher IDC (via StorageReview.com) estimates that 60% of all storage is made up of copies. Former Google CEO Eric Schmidt noted the amount of data created from the dawn of civilization until 2003 was 5 exabytes. Today, we create more data than that in less than 24 hours -- in fact, we are generating 50,000 GB per second.

The sheer usage of data in computing has grown at a Malthusian rate. A simple illustration of the past 50 years:

The Apollo 11 Guidance Computer in 1969 used 64KB of storage
A typical PowerPoint file can be anywhere from 50MB to 2GB.

It reminds me of a great quote by the former Microsoft CTO Nathan Myhrvold: “Software is a gas; it expands to fill its container.”

If this sort of trend continues, this means the annual market for data storage -- including the cloud -- could blow well past $100 billion (half in primary and half in secondary) in the next few years Today, about 60% of this is primary storage, but over the next few years, secondary storage (storage used for backups, analytics, testing, development, etc.) will be at least half of it. For all of the time, effort and money spent on data, most of our data is not usable in any practical way. It is bottled up in silos that are frequently replicated, unsearchable and poorly used -- and consuming precious IT and energy resources. In finance, this would fall into the category of stranded assets -- assets that have been written down in value and are turning into liabilities.

Much like the plastic stranded in the ocean, we have a similar challenge. We not only have a massive amount of data; we have mass data fragmentation. Mass data fragmentation is not some heap we can point to and clean up. Imagine if every piece of plastic in the Great Pacific Garbage Patch was well-distributed across the ocean -- we would never be able to attack the problem. As digital storage grows exponentially, including within the cloud, the problem becomes increasingly more difficult to address.

There are three leading causes of mass data fragmentation:

Multiple copies of data fragmentation within IT silos: Even within the same application environments, most organizations still create multiple copies of the same data. For example, you can see four different backups for the same data across virtual, physical, database and cloud. Despite the existence of deduplication technologies and software, this phenomenon continues today and is only getting worse.
Multiple copies of data because of fragmentation across IT silos: This is due to the fact that secondary IT operations such as backups, file sharing/storage, provisioning for test/dev and analytics are done in completely separate, siloed islands that don’t typically share data between them today.
Dark data fragmented across locations: There's a vast canyon of corporate and personal data that is unknown to IT leaders, organizations or even consumers, whether it’s on a personal device, on the largest storage arrays in the data center or in popular cloud storage services like Amazon S3.

This is not only an operational and cost issue, but it is a huge security hole. How are you protected if you do not know what you have and where it is?

And finally, we move into an era of artificial intelligence. This technology that's able to mine greater value from existing data -- such as finding a cure for a currently incurable cancer -- is an enormous benefit for better utilizing the data we have today.

Clearly, technology and IT industry -- which is so good at generating data -- will start to falter under its own data gravity unless it starts to implement systems that:

Produce fewer copies.
Better utilize what exists (resulting in fewer copies).
Identify all the sources of data replication and allow us to take overly redundant stores out of service to open up the capacity for data that matters.

Most CIOs I know will tell you they spend 80% or more of their resources just keeping the lights on for existing systems and users, only leaving a fraction for innovation. With the increasingly competitive nature of industries due to digital transformation, spending investment dollars on multiple copies of the same thing seems like a failing strategy. Bold technical leadership will start to move in the other direction.

Or, in the words of economist E. F. Schumacher, “Any intelligent fool can make things bigger and more complex … it takes a touch of genius -- and a lot of courage to move in the opposite direction.”

This article initially appeared in Forbes

The Mass Data Fragmentation Cleanup

Alan Cohen

General Partner at DCVC

More articles by this author

Insights from the community

Others also viewed

Excellence News

INDUSTRY 5.0 NEWSLETTER 2/34

Trust me, I'm a geoengineer

5 big risks of putting AI to work for the planet and the need for Responsible AI

AI for the Planet: How Tech is Saving the Earth

AI x Environment

AI's Environmental Cost: Why Should I Care, What Is the Impact, What Is Being Done?

Game Design and Development using AI and AI Powered by 5G for New Economic Growth Models

AI for Greener Government: EPA Use Cases

Why Data is the New Water, Not Oil

Explore topics

AI Turns Ultrasound Into The New Tricorder: GE HealthCare acquires Caption Health

Feb 9, 2023

Caption Health’s New Breakthroughs in Transforming Cardiac Care

Aug 12, 2021

Migrations: Evolv Technology Public Listing

Jul 19, 2021

Alpha Marketing: Be Like Mike

Jun 29, 2020

Atoms, After Bits

Feb 10, 2020

Father to Son: It's Time to Make a Change. A Holiday Missive

Dec 18, 2019

The Biological Revolution In Nitrogen

Mar 20, 2019

Mist Takes Flight

Mar 4, 2019

The Data Cloud

Jan 27, 2019

Can AI Love?

Nov 10, 2018