Big Data: What the Heck are Pig and Hive?

Big Data: What the Heck are Pig and Hive?

Big data is full of terms which can be confusing if you aren’t an expert in computer science. Luckily, it isn’t vital to know the exact technical details of every framework, protocol, file system and analytical tool to understand how big data can help you solve problems. But an overview of the most common terms is very useful for managers who will most likely want to be in frequent and close communication with technical staff when preparing a data strategy.

In this post I will give an overview of two terms you may come across frequently once you dive into the “nuts and bolts” of big data– Pig and Hive.

(In order to understand this best you might want to read my article on Hadoop first, as both Pig and Hive are technologies which run “on top” of Hadoop).

“Big data” as a concept concerns the collection, storage and analysis of large amounts of information. Pig and Hive both come into play during the last (but by no means least) stage of that process – the analysis stage. This is where your data is “cleaned” of distracting information not vital to the task at hand, and put into a format from which those all-important insights can be gleaned.

In many ways they both do the same thing – they are different tools which data specialists will deploy in different circumstances – but usually to achieve the same ends.

Both are open-source – maintained by the Apache Foundation – meaning they are completely free to use, and can be modified by anyone to create custom versions suitable to specific tasks.

The principle of Open Source is important – in fact, fundamental – to big data, as it means the technologies are available to anyone, anywhere in the world at very little or indeed no cost.

Companies still make money out of them – by repacking them with friendly interfaces, manuals and user support, or creating custom versions for specific jobs or industries. For example Amazon distributes its own version (known as a “fork”, as it branches off from the original code) of Hive as part of its commercial Amazon Web Services offering.

So, what do they actually do? Well essentially they allow programmers to create tools which can be used to query and evaluate enormous quantities of data, spread across distributed networks (where all the data is spread across many storage devices) very quickly and efficiently.

As we all know by now (unless you are totally new to the subject of big data, in which case, welcome!) data can be either structured (fits nicely into tables) unstructured (doesn’t) or semi-structured (somewhere in the middle).

Pig and Hive carry out a function known as MapReduce, which basically makes semi-structured data more structured – so it can be understood and quantified by computers. Computers find unstructured data very difficult to understand. It can then pass it onto humans who will hopefully draw actionable conclusions from it.

MapReduce, as the name implies, is two procedures. The first is mapping, which is reading the data from the database and putting it into a particular format. For example, which of the data in this huge database refer to customer satisfaction ratings? And which refer to the ages of our customers?

The second is reducing – carrying out mathematical operations on the data to come out with a particular figure. In the example above, you would be able to work out what percentage of customers under 35 rated your service highly.

Hive and Pig both offer systems for carrying out MapReduce functions – or creating software programs capable of automatically completing lots and lots of MapReduce functions, very quickly, and reporting the results to analysts.

Both use their own programming languages, known as Pig Latin and HiveQL to give programmers quicker access to the MapReduce functions of Hadoop. This means they are great for “modelling” – quickly building simulated systems which can be tinkered with in order to get the output you are looking for (e.g an answer to a question about your business).

Hive might be a little easier to use because HiveQL is a declarative language based on SQL, which is something many programmers are are already familiar with. Pig Latin, on the other hand, is an imperative programming language that might take a little longer to learn but should give programmers more flexibility.

So there you have it – Pig and Hive in a nutshell. I have tried to keep it as simple as possible to give an understanding of the actual mechanics of the operations that are taking place, and why they are useful, rather than technical details. If you want a more in-depth understanding then this article is a good place to start.

As always, I hope this was useful? Please let me know if you have any views or comments on the topic or would like to add something to this description.

--------------

I really appreciate that you are reading my post. Here, at LinkedIn, I regularly write about big data as well as management and technology issues and trends. If you would like to read my regular posts then please click 'Follow' and send me a LinkedIn invite. And, of course, feel free to also connect via Twitter, Facebook and The Advanced Performance Institute.

Check out other recent LinkedIn Influencer posts by Bernard Marr:

About : Bernard Marr is a globally recognized expert in strategy, performance management, analytics, KPIs and big data. He helps companies and executive teams manage, measure, analyze and improve performance.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance

Photo: Shutterstock.com

Ashwin Ittoo, PhD

Result-Driven AI / Digital Transformation | Regulation & Compliance | Full Professor ULiège | Co-Founder AIASHI

9y

is the shift not towards spark, mesos (at the expense of the hadopp/hive ecosystems-?

Like
Reply
Om Sharma

Technical Architect | Project Manager | TCS | Ex- Accenture | Ex - Infosys | Banking | JAVA | Microservice |DevOps| AWS | API Gateway | APIGEE

9y

Nicely presenred. Thanks for sharing.

Like
Reply
Paul Coetser

|| SAP Whisperer || Design & Delivery Assurance Advisor, Fractional CTO #thesapwhisperer

9y

A most excellent article Bernard. Nicely written for those who don't know Big Data or are not familiar with it's terms. Well done mate. I will be writing an article soon as well that sets out how SAP HANA has these functions either natively or supported in it's stack. Good work!

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics