This document discusses data-intensive computing, which involves the production, manipulation, and analysis of large datasets ranging from hundreds of megabytes to petabytes in size. It characterizes challenges in data-intensive applications including scalable algorithms, metadata management, high-performance computing platforms, distributed file systems, and data reduction techniques. The document then provides a historical perspective on technologies that have enabled data-intensive computing such as high-speed networking, data grids, cloud computing, databases, and programming models like MapReduce.
This document discusses data-intensive computing, which involves the production, manipulation, and analysis of large datasets ranging from hundreds of megabytes to petabytes in size. It characterizes challenges in data-intensive applications including scalable algorithms, metadata management, high-performance computing platforms, distributed file systems, and data reduction techniques. The document then provides a historical perspective on technologies that have enabled data-intensive computing such as high-speed networking, data grids, cloud computing, databases, and programming models like MapReduce.
This document discusses data-intensive computing, which involves the production, manipulation, and analysis of large datasets ranging from hundreds of megabytes to petabytes in size. It characterizes challenges in data-intensive applications including scalable algorithms, metadata management, high-performance computing platforms, distributed file systems, and data reduction techniques. The document then provides a historical perspective on technologies that have enabled data-intensive computing such as high-speed networking, data grids, cloud computing, databases, and programming models like MapReduce.
This document discusses data-intensive computing, which involves the production, manipulation, and analysis of large datasets ranging from hundreds of megabytes to petabytes in size. It characterizes challenges in data-intensive applications including scalable algorithms, metadata management, high-performance computing platforms, distributed file systems, and data reduction techniques. The document then provides a historical perspective on technologies that have enabled data-intensive computing such as high-speed networking, data grids, cloud computing, databases, and programming models like MapReduce.
• Data-intensive computing is concerned with production, manipulation, and analysis of large-scale data in the range of hundreds of megabytes (MB) to petabytes (PB) Characterizing data-intensive computations • Data-intensive applications not only deal with huge volumes of data but, very often, also exhibit compute-intensive properties • Datasets are commonly persisted in several formats and distributed across different locations. Challenges ahead • Scalable algorithms that can search and process massive datasets • New metadata management technologies that can scale to handle complex, heterogeneous, and distributed data sources • Advances in high-performance computing platforms aimed at providing a better support for accessing in-memory multiterabyte data structures • High-performance, highly reliable, petascale distributed file systems Challenges ahead • Data signature-generation techniques for data reduction and rapid processing • New approaches to software mobility for delivering algorithms that are able to move the computation to where the data are located • Specialized hybrid interconnection architectures that provide better support for filtering multigigabyte datastreams coming from high- speed networks and scientific instruments • Flexible and high-performance software integration techniques Historical perspective storage, networking technologies, algorithms, and infrastructure software all together • The early age: high-speed wide-area networking • Data grids • Data clouds and “Big Data” • Databases and data-intensive computing high-speed wide-area networking
• 1989, the first experiments in high-speed
networking as a support for remote visualization of scientific data led the way • Two years later, the potential of using high- speed wide area networks for enabling high- speed, TCP/IP-based distributed applications was demonstrated at Supercomputing 1991 • Another important milestone was set with the Clipper project, Data grids • huge computational power and storage facilities • A data grid provides services that help users discover, transfer, and manipulate large datasets stored in distributed repositories • Data grids offer two main functionalities: high- performance and reliable file transfer for moving large amounts of data Characteristics and introduce new challenges • Massive datasets • Shared data collections • Unified namespace • Access restrictions Data clouds and “Big Data” • Scientific computing • searching, online advertising, and social media • It is critical for such companies to efficiently analyze these huge datasets because they constitute a precious source of information about their customers • Log analysis is an example Data clouds and “Big Data” Cloud technologies support data-intensive computing in several ways: • By providing a large amount of compute instances on demand • By providing a storage system • By providing frameworks and programming APIs Databases and data-intensive computing • Distributed Database Technologies for data-intensive computing Data-intensive computing concerns the development of applications that are mainly focused on processing large quantities of data. Storage systems • Growing of popularity of Big Data • Growing importance of data analytics in the business chain • Presence of data in several forms, not only structured • New approaches and technologies for computing Storage systems • High-performance distributed file systems and storage clouds – Lustre – IBM General Parallel File System (GPFS) – Google File System (GFS) – Sector – Amazon Simple Storage Service (S3) • NoSQL systems – Apache CouchDB and MongoDB – Amazon Dynamo – Google Bigtable – Hadoop HBase High-performance distributed file systems and storage clouds • Lustre • The Lustre file system is a massively parallel distributed file system that covers the needs of a small workgroup of clusters to a large-scale computing cluster. • The file system is used by several of the Top 500 supercomputing systems, • Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands of clients with an I/O throughput of hundreds of gigabytes per second (GB/s) High-performance distributed file systems and storage clouds • IBM General Parallel File System (GPFS). • high-performance distributed file system developed by IBM • support for the RS/6000 supercomputer and Linux computing clusters • GPFS is built on the concept of shared disks • GPFS distributes the metadata of the entire file system and provides transparent access High-performance distributed file systems and storage clouds • Google File System (GFS) • Distributed applications in Google’s computing cloud • The system has been designed to be a fault tolerant, highly available, distributed file system built on commodity hardware and standard Linux operating systems. • large files • workloads primarily consist of two kinds of reads: large streaming reads and small random reads. High-performance distributed file systems and storage clouds • Sector • storage cloud that supports the execution of data-intensive applications • deployed on commodity hardware across a wide-area network. • Compared to other file systems, Sector does not partition a file into blocks but replicates the entire files on multiple nodes • The system’s architecture is composed of four nodes: a security server, one or more master nodes, slave nodes, and client machines High-performance distributed file systems and storage clouds • Amazon Simple Storage Service (S3) • Amazon S3 is the online storage service provided by Amazon. • support high availability, reliability, scalability, infinite storage, • The system offers a flat storage space organized into buckets • Each bucket can store multiple objects, each identified by a unique key. Objects are identified by unique URLs and exposed through HTTP, NoSQL systems • Document stores (Apache Jackrabbit, Apache CouchDB, SimpleDB, Terrastore). • Graphs (AllegroGraph, Neo4j, FlockDB, Cerebrum). • Multivalue databases (OpenQM, Rocket U2, OpenInsight). • Object databases (ObjectStore, JADE, ZODB). • Tabular stores (Google BigTable, Hadoop HBase, Hypertable). • Tuple stores (Apache River). NoSQL systems • Apache CouchDB and MongoDB – document stores – schema-less – RESTful interface and represent data in JSON format. – allow querying and indexing data by using the MapReduce programming model – JavaScript as a base language for data querying and manipulation rather than SQL NoSQL systems • Amazon Dynamo – The main goal of Dynamo is to provide an incrementally scalable and highly available storage system. – serving 10 million requests per day – objects are stored and retrieved with a unique identifier (key) NoSQL systems • Google Bigtable – scale up to petabytes of data across thousands of server – Bigtable provides storage support for several Google applications – Bigtable’s key design goals are wide applicability, scalability, high performance, and high availability. – Bigtable organizes the data storage in tables of which the rows are distributed over the distributed file system supporting the middleware NoSQL systems • Apache Cassandra – managing large amounts of structured data spread across many commodity servers – Cassandra was initially developed by Facebook – Currently, it provides storage support for several very large Web applications such as Facebook, Digg, and Twitter – second-generation distributed database – column family NoSQL systems • Hadoop HBase. – distributed database – Hadoop distributed programming platform. – HBase is designed by taking inspiration from Google Bigtable – main goal is to offer real-time read/write operations for tables with billions of rows and millions of columns by leveraging clusters of commodity hardware Programming platforms • large quantity of information • runtime systems able to efficiently manage huge volumes of data. • database management systems based on the relational model - unsuccessful • unstructured or semistructured • large size or a huge number of medium-sized files rather than rows in a database • Distributed workflows The MapReduce programming model • map and reduce • Google introduced for processing large quantities of data. • Data transfer and management are completely handled by the distributed storage infrastructure Examples of MapReduce • Distributed grep – recognition of patterns within text streams • Count of URL-access frequency – key-value <pair , URL,1>, <URL, total-count> • Reverse Web-link graph – <target, source>, <target, list (source) > • Term vector per host. – Word Counting Exapmles of MapReduce • Inverted index – <word, document-id>, < word, list(document-id)> • Distributed sort
• Statistical algorithms such as Support Vector
Machines (SVM), Linear Regression (LR), Naive Bayes (NB), and Neural Network (NN) • two major stages can be represented in the terms of Map Reduce computation. – Analysis • operates directly on the data input file • embarrassingly parallel – Aggregation • operates on the intermediate results • aimed at aggregating, summing, and/or elaborating • previous stage to present the data in their final form Variations and extensions of MapReduce • MapReduce constitutes a simplified model for processing large quantities of data • model can be applied to several different problem scenarios • They aim at extending the MapReduce application space and providing developers with an easier interface for designing distributed algorithms. frameworks • Hadoop • Pig • Hive • Map-Reduce-Merge • Twister Hadoop
• Apache Hadoop is an open-source software
framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. • Initially developed and supported by Yahoo • 40,000 machines and more than 300,000 cores • http://hadoop.apache.org/ Pig • platform that allows the analysis of large datasets • high-level language for expressing data analysis programs • Developers can express their data analysis programs in a textual language called Pig Latin, • https://pig.apache.org/ Hive. • Provides a data warehouse infrastructure on top of Hadoop MapReduce. • It provides tools for easy data summarization • classical data warehouse, • https://hive.apache.org/ Map-Reduce-Merge • Map-Reduce-Merge is an extension of the MapReduce model • merging data already partitioned and sorted by map and reduce modules Twister • extension of the MapReduce model that allows the creation of iterative executions of MapReduce jobs. • 1. Configure Map • 2. Configure Reduce • 3. While Condition Holds True Do – a. Run MapReduce – b. Apply Combine Operation to Result – c. Update Condition • 4. Close Alternatives to MapReduce • Sphere. • All-Pairs. • DryadLINQ Sphere.
• Sector Distributed File System (SDFS).
• Sphere implements the stream processing model (Single Program, Multiple Data) • user-defined functions (UDFs) • it is built on top of Sector’s API for data access • UDFs are expressed in terms of programs that read and write streams. • Sphere client sends a request for processing to the master node, which returns the list of available slaves, and the client will choose the slaves on which to execute Sphere processes All-Pairs • Biometrics • (1) model the system; • (2) distribute the data; • (3) dispatch batch jobs; and • (4) clean up the system DryadLINQ. • Microsoft Research project that investigates programming models for writing parallel and distributed programs • small cluster to a large datacenter • Automatically parallelizing the execution of applications without requiring the developer to know about distributed and parallel programming. Aneka MapReduce programming • Developing MapReduce applications on top of Aneka • Mapper and Reducer - Aneka MapReduce APIs • Three classes are of Importent for application development: – Mapper < K,V > – Reducer <K,V > – MapReduceApplication <M,R >
– The submission and execution of a MapReduce
job is performed through the class MapReduceApplication <M,R > • Map Function APIs. • IMapInput<K,V> provides access to the input key-value pair on which the map operation is performed Reduce Function APIs
• Reduce (IReduceInputEnumerator < V > input)
• reduce operation is applied to a collection of values that are mapped to the same key • MapReduceApplication <M,R> • InvokeAndWait method: ApplicationBase<M,R> • WordCounterMapper and WordCounterReducer classes • The parameters that can be controlled – Partitions – Attempts – SynchReduce – IsInputReady – FetchResults – LogFile • WordCounter Job. -Program Runtime support • Task Scheduling – MapReduceScheduler class. • Task Execution. – MapReduceExecutor Task Scheduling Task Execution. Distributed file system support • Supports - Other programming models • MapReduce model does not leverage the default Storage Service for storage and data transfer • uses a distributed file system implementation • management are significantly different with respect to the other models • Distributed file system implementations guarantee high availability and better efficiency • Aneka provides the capability of interfacing with different storage implementations • Retrieving the location of files and file chunks • Accessing a file by means of a stream • classes SeqReader and SeqWriter Example application • MapReduce is a very useful model for processing large quantities of data • Semistructured • logs or Web pages • logs produced by the Aneka container Parsing Aneka logs • Aneka components produce a lot of information that is stored in the form of log files • DD MMM YY hh:mm:ss level - message Mapper design and implementation Reducer design and implementation Driver program Result