Big Data: What the Heck are Pig and Hive?

Bernard Marr

📖 Internationally Best-selling #Author🎤 #KeynoteSpeaker🤖 #Futurist💻 #Business, #Tech & #Strategy Advisor

Published Jan 20, 2015

Big data is full of terms which can be confusing if you aren’t an expert in computer science. Luckily, it isn’t vital to know the exact technical details of every framework, protocol, file system and analytical tool to understand how big data can help you solve problems. But an overview of the most common terms is very useful for managers who will most likely want to be in frequent and close communication with technical staff when preparing a data strategy.

In this post I will give an overview of two terms you may come across frequently once you dive into the “nuts and bolts” of big data– Pig and Hive.

(In order to understand this best you might want to read my article on Hadoop first, as both Pig and Hive are technologies which run “on top” of Hadoop).

“Big data” as a concept concerns the collection, storage and analysis of large amounts of information. Pig and Hive both come into play during the last (but by no means least) stage of that process – the analysis stage. This is where your data is “cleaned” of distracting information not vital to the task at hand, and put into a format from which those all-important insights can be gleaned.

In many ways they both do the same thing – they are different tools which data specialists will deploy in different circumstances – but usually to achieve the same ends.

Both are open-source – maintained by the Apache Foundation – meaning they are completely free to use, and can be modified by anyone to create custom versions suitable to specific tasks.

The principle of Open Source is important – in fact, fundamental – to big data, as it means the technologies are available to anyone, anywhere in the world at very little or indeed no cost.

Companies still make money out of them – by repacking them with friendly interfaces, manuals and user support, or creating custom versions for specific jobs or industries. For example Amazon distributes its own version (known as a “fork”, as it branches off from the original code) of Hive as part of its commercial Amazon Web Services offering.

So, what do they actually do? Well essentially they allow programmers to create tools which can be used to query and evaluate enormous quantities of data, spread across distributed networks (where all the data is spread across many storage devices) very quickly and efficiently.

As we all know by now (unless you are totally new to the subject of big data, in which case, welcome!) data can be either structured (fits nicely into tables) unstructured (doesn’t) or semi-structured (somewhere in the middle).

Pig and Hive carry out a function known as MapReduce, which basically makes semi-structured data more structured – so it can be understood and quantified by computers. Computers find unstructured data very difficult to understand. It can then pass it onto humans who will hopefully draw actionable conclusions from it.

MapReduce, as the name implies, is two procedures. The first is mapping, which is reading the data from the database and putting it into a particular format. For example, which of the data in this huge database refer to customer satisfaction ratings? And which refer to the ages of our customers?

The second is reducing – carrying out mathematical operations on the data to come out with a particular figure. In the example above, you would be able to work out what percentage of customers under 35 rated your service highly.

Hive and Pig both offer systems for carrying out MapReduce functions – or creating software programs capable of automatically completing lots and lots of MapReduce functions, very quickly, and reporting the results to analysts.

Both use their own programming languages, known as Pig Latin and HiveQL to give programmers quicker access to the MapReduce functions of Hadoop. This means they are great for “modelling” – quickly building simulated systems which can be tinkered with in order to get the output you are looking for (e.g an answer to a question about your business).

Hive might be a little easier to use because HiveQL is a declarative language based on SQL, which is something many programmers are are already familiar with. Pig Latin, on the other hand, is an imperative programming language that might take a little longer to learn but should give programmers more flexibility.

So there you have it – Pig and Hive in a nutshell. I have tried to keep it as simple as possible to give an understanding of the actual mechanics of the operations that are taking place, and why they are useful, rather than technical details. If you want a more in-depth understanding then this article is a good place to start.

As always, I hope this was useful? Please let me know if you have any views or comments on the topic or would like to add something to this description.

--------------

I really appreciate that you are reading my post. Here, at LinkedIn, I regularly write about big data as well as management and technology issues and trends. If you would like to read my regular posts then please click 'Follow' and send me a LinkedIn invite. And, of course, feel free to also connect via Twitter, Facebook and The Advanced Performance Institute.

Check out other recent LinkedIn Influencer posts by Bernard Marr:

About : Bernard Marr is a globally recognized expert in strategy, performance management, analytics, KPIs and big data. He helps companies and executive teams manage, measure, analyze and improve performance.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance

Photo: Shutterstock.com

Ashwin Ittoo, PhD

Result-Driven AI / Digital Transformation | Regulation & Compliance | Full Professor ULiège | Co-Founder AIASHI

is the shift not towards spark, mesos (at the expense of the hadopp/hive ecosystems-?

Om Sharma

Nicely presenred. Thanks for sharing.

Paul Coetser

|| SAP Whisperer || Design & Delivery Assurance Advisor, Fractional CTO #thesapwhisperer

A most excellent article Bernard. Nicely written for those who don't know Big Data or are not familiar with it's terms. Well done mate. I will be writing an article soon as well that sets out how SAP HANA has these functions either natively or supported in it's stack. Good work!

Big Data: What the Heck are Pig and Hive?

Bernard Marr

📖 Internationally Best-selling #Author🎤 #KeynoteSpeaker🤖 #Futurist💻 #Business, #Tech & #Strategy Advisor

More articles by this author

Insights from the community

Others also viewed

5 Best Big Data Frameworks To Consider in 2024

Tools for the Data Scientists Working at Scale

Hortonworks accelerates the big data mashup between Hadoop and HP Haven

Hive Data Types

3 Reasons Why "Hadoop as a Service" Is Making Sense for Business Analytics?

Increasing/decreasing the size of Hadoop Datanode dynamically

Easy Analysis of HDF5 Datasets

Understanding What Data is Stored in the Name Node

Big Data and Its Key Tools: MapReduce, Spark, SQL (Hive), and Hadoop in Action

Hadoop. How to avoid the hype. (with Tamara Dull of SAS)

Explore topics

11 Most Reliable AI Content Detectors: Your Guide to Spotting Synthetic Media

Dec 20, 2024

7 Healthcare Trends That Will Transform Medicine in 2025

Dec 18, 2024

2025's Tech Forecast: The Consumer Innovations That Will Matter Most

Dec 16, 2024

The Simple ChatGPT Trick That Will Transform Your Business AI Interactions

Dec 15, 2024

The Five Biggest AI And Data Trends That Will Transform Businesses In 2025

Dec 13, 2024

The Third Wave Of AI Is Here: Why Agentic AI Will Transform The Way We Work

Dec 11, 2024

How Generative AI Will Change Jobs In Cybersecurity

Dec 9, 2024

The 10 Most Important Banking And Financial Technology Trends That Will Shape 2025

Dec 8, 2024

The 6 Most Powerful AI Marketing Trends That Will Transform Your Business In 2025

Dec 6, 2024

The Next Big Leap In 5G: How Innovation Is Making Networks Faster And Cleaner

Dec 4, 2024