DIK(I)W? Start with Data

DIK(I)W? Start with Data

Seeking wisdom?

You begin with data, organize related data into information, correlate and analyze information into knowledge, and then build insight. In time, you achieve wisdom. [appropriate sound effect – maybe the Windows ta-da?]

 The concept of this flow from...

data->

information->

knowledge->

wisdom.

...is called the DIKW pyramid.  I’ve added that insight part – because I think it takes specificity to reach wisdom.  In other words, wisdom has context that happens after knowledge.

Your goal is wisdom for your organization, and that path to wisdom begins with good, clean, organized, known data for integration between systems.

Data’s one of my favorite things.  When I was in my early 20s, I earned(??) the nickname of Datahead.  The things I can remember and classify – putting everything in its mental place, was, I guess, impressive.

For example, I know my library card number.  Do you? Credit card number and all of the peripherals?

Many of the things we used to memorize are now abstracted; we keep phone numbers in our contacts, and we just need to know how to reference the person. Our phones supply their names when they call us.

But – data.  Today’s topic is data. So, we’ll discuss a few properties about it before we talk – another time – about using and integrating it.

Data Types

We start with the type of the data.

String, Boolean, Numeric data and its subtypes.  BLOB!

I’ll unpack these.

We have to start with BLOB – don’t we? It stands for Binary Large Object. Think of images, video, audio, etc.  Now you know a thing.  BLOB! For a video, you could consume part of it, but to do so, you have to have access to the whole to get your part.  So, the BLOB is considered a piece of data.

Most data stored in a database for use in systems is structured.

Boolean

Boolean – it’s binary.  Yes, or no for data if you choose this type.  If data people are lazy, they don’t default a “Yes” or “No,” and you run into a three-condition Boolean, which is a logical nightmare and the cause of both all the moral problems in the world but also the prevalence of skinks in my yard and black cats in my home. Definitely. Maybe.

So, anyway – the three-condition Boolean tomfoolery is Yes, No, or Unset.  It’s best practice to default to yes or no…because unset is just evil…see skinks, cats.

A Boolean is the answer to any question you can ask a Magic 8 Ball.  You can substitute True for Yes and False for No.  Same thing.  Truly. 

Numeric

Numeric data types include integers (and subsets thereof) and floating point numbers – those with decimals.

With numeric datatypes, there are no letters.

Sometimes in data integration or use, numbers are converted to text, and when they are, they’re called strings.  Leading to the string datatype.

String

Alphanumeric text is called a string.  It’s a best practice to give it a length constraint for when it’s a single piece of data.  When it’s something like a document, then size may suffice. 

I’ll give you an address length sample below.

Date (date/time)

Date formats may vary, so date storage should be clear what time zone’s in consideration and what format’s in use in storage.

Dates can also be transmitted as strings and reconverted to dates if their format supports that – and the information about that format’s known.

For example, much of the world considers dates as dd/mm/yyyy, where dd = day, mm= month, and yyyy (or yy) = year.  In America, we use mm/dd/yyyy. 

So, if you merely see 01/02/2024, are you working with January 2nd or February 1st?  Need to know – via specification – what the data represents.

Data Cleanliness and Quality

If data types are clearly articulated in structured data (think database), then we know what to expect for each field and can ensure that structure within storage and rules around what’s stored.

A lot of data, though, is unstructured, and that cleanliness and quality and decisions regarding use have to be made long after the data is stored in its original form.

Think about a driver’s license number.  It’s a string – has a certain length it won’t exceed, but the length may vary by state. If I need to transmit that data somewhere, and it’s expecting one fewer character than my actual driver’s license number, what happens? Knowing and discussing the answer as early as possible helps determine that.

(Yes, things like this do happen).

Another good example is that US export data lengths were established back in the EDI (electronic data interchange) era.  An address line is 35 characters. Along came the Automated Export System (AES) from the US government, demanding address data.  What’d it allow per line?  32 characters.

What’s the solution in this case?  Truncation.  Send it the first 32.  But decisions like this require discussion and best practices. 

For numbers, truncation to the left of any decimals is disastrous. Does your company make 7 million each year or 70 million?

This turned into storytelling with Heather, but hopefully the discussion of cleanliness and quality at time of use is helpful in understanding some challenges in implementation if we’re looking for accuracy.

Data Classification and Labeling

When we store data in files, typically it’s in the form of information – data intermingled together.  In databases, it’s structured for use in a variety of contexts; some data is the result of computations of combined data, logically or mathematically.

We talk about data classification and labeling for use within (or restriction from) AI and also for privacy.  

In other words, classification/labeling are required to keep your HR sensitive information outside of generative AI computations.  No access for you, Microsoft Copilot.

Data Bias

A piece of data – a datum – is just a thing. Bias enters in the context of usage and what’s measured/determined and stored in that datum. So, a piece of data is just a representation of a very simple single thing.  

In other words, it’s the key and how we might derive the value. Back to key/value pairs

What do we do about that?

I saw Phadrea Boinodiris speak last October at the Northwest Arkansas Tech Summit about bias in AI.  Her book can be found here – it’s on my “to acquire and read” list…which is an immense list.

https://www.amazon.com/AI-Rest-Us-Phaedra-Boinodiris/dp/B0C6W1KJ49/

Data, AI.  Yeah – outside of scope here today while we’re hanging out with each individual datum.

Data Privacy

Do we even want to go here today?  It’s the foundation of everything we try to achieve in working with data – ensuring the right people have access to their own data’s use.  Others are restricted without explicit permission.

So, one paragraph.  If you’d like to follow an expert in the field, look at Brian Blakley.  

The Data of Grammar

I’ll keep this short – this extension of the rather long opening piece.

Grammar has “data types.” So, engineers who’re writing or working to upskill writing skill to do more writing, think of it that way.

·         Parts of speech (noun, verb, adjective, adverb, pronoun, preposition)

·         Specialized usage (verbal -> gerund, as an example). Participles

·         Advanced readability concepts – including when to break the rules.  I love sentence fragments because they introduce conversation to writing; we don’t, after all, speak in complete sentences to each other at all times. They only work, though, when you’ve established language mastery.

Some definite complex rules in here.  Things like American English versus British English.  Organization versus organisation on spelling. Collective nouns in American English that feel “weird” to the rest of the world.

An example is to refer to staff as singular – very American. Staff is versus staff are. A fix? Say "staff members"

Keeping Up Appearances

Erik Boemanns and I recorded a podcast, and that’ll release soon.  We talk about reducing the profitability of cybercrime.  Let’s do THAT, right?

I also wrote about that at Elnion this week, since it’s on my mind and heart. You’ll recognize themes from posts I do here. https://elnion.com/2024/08/05/devaluing-cybercrime/

Another Elnion piece from last week.  I write here once a week. My birthday article – My Library, Your Library - https://elnion.com/2024/07/29/my-library-your-library/

Upcoming?  Tomorrow I’m talking with a group in the morning about my long software career.

Codistac, Redux

About a year ago I procured an office at Missouri State University’s efactory, an amazing entrepreneurial space.  I had grand plans of offering training about cyber hygiene and doing culture work within organizations, especially with HR teams from hiring practices through to knowledge of security awareness at the level we need and really making the case for it.

Efactory offers great training rooms.  I was super excited. After making some inroads and planning, a life event made it difficult to get to that office on a regular basis. That, coupled with expanded work with Missouri Cybersecurity Center of Excellence, is leading me to abandon the office and that business model and refocus Codistac on the 3 exact areas where I shine and where work comes to me.

·         Writing (please act shocked).  Specifically brand-boosting messaging for technology companies that speaks to customers in their language instead of yours.

·         Software requirements, specifically for data integration. Yes, with security considerations in here, much like Moms try to sneak nutrition into good-tasting food.

·         Business strategy work for technology companies.

You can take a look at the redesign at https://www.codistac.com.  I’m seeking business in all of these areas. A 4th and natural extension is software testing – mostly from a quality “how brittle is your happy path” perspective.  

Some Quick Asks

Will you follow two pages?  Once is Codistac, where I plan to start writing regularly soon.  The other is the Missouri Cybersecurity Center of Excellence – same.

Also, please follow this newsletter.  It arrives every other Tuesday.

I write everyday, so please follow me, too, and interact if you see something that's worthy.

If you find my services intriguing, let's talk. I'm part time at Missouri Cybersecurity Center of Excellence and do have some space for additional work at Codistac beyond the clients I'm already serving.

Brian Blakley

Information Security & Data Privacy Leadership - CISSP, FIP, CIPP/US, CIPP/E, CIPM, CISM, CISA, CRISC, Certified CISO

4mo

Heather Noggle, thanks for the mention! Your insights on the DIKIW pyramid are spot-on. As a fellow data privacy advocate, I appreciate your emphasis on data cleanliness and quality, which are vital for protecting personal information. Clean, well-classified data not only aids analysis - but also - as you know - ensures ethical/lawful handling of personal information which builds & maintains trust. Thank you for supporting our community and sharing valuable insights.

Like
Reply

spot on, strongly agreed well and defined

April Webster Halden

Cyber Problem Solver, Veteran Supporter, Mama

4mo

Heather Noggle I love your writing. Data, storage, accessibility is key. :) 😍

Chris Marshall

A better way to protect your company against ransomware | Many backups fail to recover. We fix that.

4mo

Great article! When people don’t give careful thought to their data types and structure, bad things can happen. An ounce of prevention…

🔒Ivette B.

Privacy Minded. Innovation Impact Strategist. | Transformative. Empathetic. Strategic. Resilient. Bridge-builder.

4mo

Very insightful. Love the framing of it. Makes sense and a good reminder.... it's how we interact and integrate w IT that matters. 😉🎶🕳🗻🧭🌐

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics