The Mass Data Fragmentation Cleanup

The Mass Data Fragmentation Cleanup

One of the most inspiring news stories in the past year is the recent launch of the Great Ocean Cleanup, an ambitious project led by 24-year-old Dutch inventor Boyan Slat. He started his quest in 2013 (when he was 18) to rid our oceans of millions of tons of plastic waste that clogs our oceans and destroy marine life across the planet. Every year, 8 million metric tons of plastic finds its way into the ocean, and that number is expected to multiply by 20 in seven years. There are now 1.8 trillion pieces of plastic in the Pacific Ocean. Slat's first target is the “Great Pacific Garbage Patch,” a gyre over 1.6 million square kilometersof plastic that has left the land and collected in the ocean.

There is a parallel in the IT world. Over this same 50-year period, we have created an enormous and uncontrollable sea of data. Some of it is important (e.g., college transcripts and tax records), yet a great deal of it is waste: Irrelevant and duplicative files parked on active servers, storage devices, and in the cloud, consuming computing and energy resources. Indeed, market researcher IDC (via StorageReview.com) estimates that 60% of all storage is made up of copies. Former Google CEO Eric Schmidt noted the amount of data created from the dawn of civilization until 2003 was 5 exabytes. Today, we create more data than that in less than 24 hours -- in fact, we are generating 50,000 GB per second.

The sheer usage of data in computing has grown at a Malthusian rate. A simple illustration of the past 50 years:

It reminds me of a great quote by the former Microsoft CTO Nathan Myhrvold: “Software is a gas; it expands to fill its container.”

If this sort of trend continues, this means the annual market for data storage -- including the cloud -- could blow well past $100 billion (half in primary and half in secondary) in the next few years Today, about 60% of this is primary storage, but over the next few years, secondary storage (storage used for backups, analytics, testing, development, etc.) will be at least half of it. For all of the time, effort and money spent on data, most of our data is not usable in any practical way. It is bottled up in silos that are frequently replicated, unsearchable and poorly used -- and consuming precious IT and energy resources. In finance, this would fall into the category of stranded assets -- assets that have been written down in value and are turning into liabilities.

Much like the plastic stranded in the ocean, we have a similar challenge. We not only have a massive amount of data; we have mass data fragmentationMass data fragmentation is not some heap we can point to and clean up. Imagine if every piece of plastic in the Great Pacific Garbage Patch was well-distributed across the ocean -- we would never be able to attack the problem. As digital storage grows exponentially, including within the cloud, the problem becomes increasingly more difficult to address.

There are three leading causes of mass data fragmentation:

  • Multiple copies of data fragmentation within IT silos: Even within the same application environments, most organizations still create multiple copies of the same data. For example, you can see four different backups for the same data across virtual, physical, database and cloud. Despite the existence of deduplication technologies and software, this phenomenon continues today and is only getting worse.
  • Multiple copies of data because of fragmentation across IT silos: This is due to the fact that secondary IT operations such as backups, file sharing/storage, provisioning for test/dev and analytics are done in completely separate, siloed islands that don’t typically share data between them today.
  • Dark data fragmented across locations: There's a vast canyon of corporate and personal data that is unknown to IT leaders, organizations or even consumers, whether it’s on a personal device, on the largest storage arrays in the data center or in popular cloud storage services like Amazon S3.

This is not only an operational and cost issue, but it is a huge security hole. How are you protected if you do not know what you have and where it is?

And finally, we move into an era of artificial intelligence. This technology that's able to mine greater value from existing data -- such as finding a cure for a currently incurable cancer -- is an enormous benefit for better utilizing the data we have today.

Clearly, technology and IT industry -- which is so good at generating data -- will start to falter under its own data gravity unless it starts to implement systems that:

  • Produce fewer copies.
  • Better utilize what exists (resulting in fewer copies).
  • Identify all the sources of data replication and allow us to take overly redundant stores out of service to open up the capacity for data that matters.

Most CIOs I know will tell you they spend 80% or more of their resources just keeping the lights on for existing systems and users, only leaving a fraction for innovation. With the increasingly competitive nature of industries due to digital transformation, spending investment dollars on multiple copies of the same thing seems like a failing strategy. Bold technical leadership will start to move in the other direction.

Or, in the words of economist E. F. Schumacher, “Any intelligent fool can make things bigger and more complex … it takes a touch of genius -- and a lot of courage to move in the opposite direction.”




This article initially appeared in Forbes

Saar Gillai

Chairman & Independent board member | Exec Mentor | Strategic Advisor | former CEO

6y

where do I store this article ? 

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics