BDA Mid Term Self

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Big data is characterized by the following key features:

Volume: Big data refers to datasets that are too large and complex to be
processed using tradi<onal data processing tools. The amount of data generated
by individuals and organiza<ons today is growing exponen<ally, and big data can
refer to terabytes, petabytes, or even exabytes of data.

Velocity: Big data is oAen generated at a very high velocity, with data streaming
in from various sources in real-<me. This speed of data genera<on poses
challenges for tradi<onal data processing tools, which are oAen unable to keep
up with the rate of data flow.

Variety: Big data is characterized by its variety of data types and sources,
including structured, semi-structured, and unstructured data. It can come from
various sources, such as social media, sensors, mobile devices, and more.

Veracity: Big data can be highly unreliable, with issues such as inaccuracies,
errors, and biases that can affect the quality of the data. This makes it challenging
to analyze and draw accurate conclusions from the data.

Value: The main goal of big data is to extract value from large and complex
datasets. The insights generated from big data can help organiza<ons make more
informed decisions, improve their products and services, and iden<fy new
opportuni<es.

Variability: Big data can have high variability due to the dynamic nature of data
sources and the constantly evolving environment in which the data is generated.

Complexity: Big data can be complex to manage and analyze due to the sheer
volume, velocity, and variety of data. This requires specialized tools, techniques,
and skills to process, store, and analyze the data effec<vely.

Types Of Big Data

Big data can generally be categorized into three types:

Structured Data:
Structured data refers to the data that can be organized in a predefined format,
and its structure is known in advance. Structured data is commonly stored in
databases, spreadsheets, and tables. Examples of structured data include sales
records, financial transac<ons, and customer informa<on.

Unstructured Data:
Unstructured data refers to the data that does not have a specific format or
structure. It can be any type of data such as text, images, audio, and video. This
type of data is challenging to manage and analyze because of its complexity.
Examples of unstructured data include social media posts, emails, and sensor
data.

Semi-structured Data:
Semi-structured data refers to the data that is not completely structured, but it
has some organiza<on. This type of data is similar to unstructured data, but it
has some predefined tags, labels, or metadata that provide some level of
structure. Examples of semi-structured data include XML files, JSON files, and log
files.

These three types of data require different methods of storage, processing, and
analysis, and they all present different challenges and opportuni<es for
businesses and organiza<ons looking to extract value from their data.

Advantages of Big Data:

1. Better decision-making: With big data, organizations can make better


decisions based on insights gleaned from large data sets. They can analyze
customer behavior, market trends, and other important metrics to inform
business decisions.
2. Improved efficiency: Big data technologies can help organizations
automate repetitive tasks and improve operational efficiency. For
example, businesses can use machine learning algorithms to automate
fraud detection, reducing the time and effort required to identify
fraudulent activities.
3. Enhanced customer experience: Big data can help businesses understand
their customers better, allowing them to tailor their products and services
to meet their customers' needs. This can result in higher customer
satisfaction and loyalty.
4. New revenue streams: By analyzing big data, organizations can identify
new business opportunities and revenue streams. For example, a retailer
can analyze customer purchase data to offer personalized product
recommendations and increase sales.
5. Competitive advantage: Companies that effectively leverage big data can
gain a significant competitive advantage in their industries. By using data
to make more informed decisions, they can outmaneuver competitors
and improve their bottom line.

Challenges of Big Data:

1. Data quality: Big data often comes from disparate sources, making it
difficult to ensure data quality. Poor data quality can lead to inaccurate
insights and poor decision-making.
2. Data privacy and security: With the abundance of data comes increased
risks of data breaches and cyberattacks. Companies must take measures
to protect their data and ensure compliance with privacy regulations.
3. Talent shortage: The demand for skilled data scientists and analysts has
far outpaced supply, making it challenging for companies to find and
retain the talent they need to effectively analyze big data.
4. Infrastructure and cost: Big data requires significant infrastructure and
computing resources to manage and analyze. This can be costly and
difficult for smaller organizations to manage.
5. Legal and ethical concerns: Big data raises legal and ethical concerns, such
as data privacy, ownership, and potential bias in algorithms. Companies
must navigate these concerns to ensure they are using data ethically and
responsibly.

BDA in finance

Big data analytics has revolutionized the way financial institutions operate, by
enabling them to leverage massive amounts of data to extract valuable insights
and make more informed business decisions. Here are some of the key
applications of big data analytics in finance:

Risk Management: Big data analytics can be used to identify and mitigate
potential risks in the financial sector. By analyzing large datasets, financial
institutions can detect patterns and anomalies that may indicate fraudulent
activities or other risks.
Customer Segmentation: Big data analytics can help financial institutions
segment their customers based on demographics, purchasing behavior, and
other factors. This enables them to create more personalized marketing
campaigns, develop new products and services, and improve customer
satisfaction.

Fraud Detection: Big data analytics can be used to detect fraudulent activities in
real-time. By analyzing large volumes of data from multiple sources, financial
institutions can identify unusual transactions or behavior that may indicate
fraud.

Investment Management: Big data analytics can help investment managers


make better investment decisions by analyzing market trends and identifying
opportunities. They can also use data analytics to track the performance of their
investments in real-time and make adjustments accordingly.

Compliance Monitoring: Big data analytics can help financial institutions ensure
compliance with regulatory requirements. By analyzing large volumes of data,
they can detect potential compliance issues and take appropriate action to
address them.

Credit Risk Assessment: Big data analytics can help financial institutions assess
the credit risk of their customers. By analyzing data on their credit history,
income, and other factors, they can determine the likelihood of default and set
appropriate interest rates and credit limits.

Overall, big data analytics has the potential to transform the financial sector by
providing valuable insights and enabling more informed decision-making.

DBMS stands for Database Management System. It is a software system that


helps manage data in a structured way, allowing users to store, retrieve, and
manipulate data easily and efficiently. Some of the advantages of using a DBMS
are:

Advantages:

Data Integrity: DBMS systems ensure data integrity by providing mechanisms for
enforcing constraints, such as unique keys, referential integrity, and data
validation rules.
Data Consistency: DBMS systems ensure data consistency by allowing multiple
users to access and modify data concurrently, while maintaining data
consistency and preventing data corruption.

Data Security: DBMS systems provide a secure environment for data storage and
access by allowing users to define access control policies and by enforcing these
policies.

Data Sharing: DBMS systems allow multiple users to share data simultaneously,
providing a way to collaborate on projects and share information across the
organization.

Data Independence: DBMS systems provide data independence by separating


the logical view of data from the physical view, making it easier to modify the
database structure without affecting the applications that use the database.

Disadvantages:

Cost: DBMS systems can be expensive to acquire and maintain, especially for
small businesses and individual users.

Complexity: DBMS systems can be complex to design, implement, and maintain,


requiring specialized skills and knowledge.

Performance Overhead: DBMS systems can introduce performance overhead


due to the additional processing required to manage data, especially when
dealing with large volumes of data.

Single Point of Failure: DBMS systems can become a single point of failure, with
the entire application depending on the availability of the database server.

Vendor Lock-In: DBMS systems can create vendor lock-in, with users becoming
dependent on a particular DBMS vendor and its proprietary technologies.

A file processing system is a software system that manages data stored in


individual files on a computer's file system. In a file processing system, data is
often stored in a specific format, and applications must be written to read and
write data in that format. Data redundancy and inconsistency can occur because
the same data may be stored in multiple files, and updating that data in one file
may not update it in another.

On the other hand, a database management system (DBMS) is a software system


designed to manage large amounts of data stored in a structured format. In a
DBMS, data is stored in tables with predefined relationships between them.
Applications can access and manipulate data stored in the DBMS through a
query language such as SQL. A DBMS provides mechanisms for ensuring data
consistency and integrity, and it can support multiple concurrent users accessing
the same data.

Some key differences between file processing systems and DBMS include:

Data storage: In a file processing system, data is stored in individual files on a


file system. In a DBMS, data is stored in a database, which can be organized into
tables, views, indexes, and other structures.

Data access: In a file processing system, applications must be written to read


and write data in the specific format in which it is stored. In a DBMS, applications
can access and manipulate data stored in the database using a query language
such as SQL.

Data redundancy and inconsistency: In a file processing system, the same data
may be stored in multiple files, which can lead to data redundancy and
inconsistency. In a DBMS, data is stored in a structured format with predefined
relationships between tables, which can help ensure data consistency.

Concurrency control: In a file processing system, managing concurrent access to


data can be challenging. In a DBMS, mechanisms such as locking and transaction
management are used to ensure data consistency and integrity when multiple
users access the same data concurrently.

Scalability: File processing systems can be limited in terms of their scalability,


particularly when dealing with large amounts of data. DBMSs are designed to
handle large amounts of data and are typically more scalable.

Database Management Systems (DBMS) have many applications in business,


and here are some examples:
Customer relationship management (CRM): DBMS can be used to store and
manage customer information such as contact details, purchase history, and
customer preferences. This information can be used to improve customer
engagement and satisfaction. An example of a CRM application is Salesforce.

Inventory management: DBMS can be used to track inventory levels, monitor


stock movements, and automate order fulfillment. This can help businesses
avoid stockouts and reduce carrying costs. An example of an inventory
management system is Fishbowl.

Financial management: DBMS can be used to store and manage financial data
such as transactions, account balances, and financial statements. This
information can be used to manage cash flow, forecast revenues, and analyze
financial performance. An example of a financial management system is
QuickBooks.

Human resource management (HRM): DBMS can be used to store and manage
employee information such as personal details, job history, and performance
metrics. This information can be used to streamline HR processes, improve
employee engagement, and support decision-making. An example of an HRM
system is Workday.

Marketing automation: DBMS can be used to store and manage marketing data
such as campaign metrics, lead information, and customer behavior. This
information can be used to automate marketing activities, personalize
marketing messages, and optimize marketing ROI. An example of a marketing
automation system is HubSpot.

Overall, DBMS can be used to improve business efficiency, decision-making, and


customer satisfaction across various business functions.

ER diagrams, or Entity-Relationship diagrams, are visual representations of the


entities, relationships, and attributes involved in a particular system or
application. They are commonly used in database design and serve as a way to
organize and communicate the structure of the system.

There are three types of keys in ER diagrams:


Primary key: A primary key is a unique identifier for each record in a table. It is
used to ensure that each record is uniquely identifiable and can be used as a
reference for relationships with other tables. In ER diagrams, a primary key is
denoted by an underline.

Foreign key: A foreign key is a field in one table that refers to the primary key of
another table. It is used to establish relationships between tables and to ensure
data integrity by enforcing referential integrity constraints. In ER diagrams, a
foreign key is denoted by a dotted underline.

Alternate key: An alternate key is a unique identifier for each record in a table
that is not the primary key. It is used as a backup key in case the primary key is
lost or cannot be used. In ER diagrams, an alternate key is denoted by a dashed
underline.

Overall, ER diagrams provide a clear and concise way to visualize the


relationships between entities in a system and the keys that are used to link
them together.

In database management systems, integrity constraints are rules that ensure the
correctness and consistency of data stored in a database. There are several
types of integrity constraints, including:

Entity Integrity Constraint: This constraint ensures that each entity in a table has
a unique identifier, also known as a primary key. The primary key cannot be null
or empty, and it must be unique for each entity.

Referential Integrity Constraint: This constraint ensures that the relationships


between tables are maintained correctly. It requires that a foreign key in one
table must match a primary key in another table, or be null.

Domain Integrity Constraint: This constraint ensures that the values stored in a
database column meet certain predefined criteria, such as data type, range, or
format.

Check Constraint: This constraint ensures that the values stored in a database
column meet a specific condition or set of conditions, specified by the user.
User-defined Integrity Constraint: This constraint allows the user to define
custom rules that must be satisfied by the data stored in a database. This type
of constraint is often used to enforce business rules or other requirements
specific to a particular application or organization.

ACID is an acronym for the four properties that guarantee the reliability and
consistency of database transactions: Atomicity, Consistency, Isolation, and
Durability. These properties are essential for maintaining the integrity of data in
a database and ensuring that transactions are processed reliably.

Atomicity:
Atomicity refers to the property that guarantees that a transaction is treated as
a single, indivisible unit of work. It means that either all the operations within a
transaction are executed successfully or none of them is. If any operation within
the transaction fails, the entire transaction is rolled back, and the database is
returned to its previous state.

Consistency:
Consistency ensures that a transaction brings the database from one valid state
to another. The transaction should follow all the predefined rules and
constraints of the database. It means that any data written to the database
should be consistent with the database's schema and constraints. Inconsistent
data should not be written to the database.

Isolation:
Isolation refers to the property that ensures that concurrent transactions do not
interfere with each other. Each transaction should execute independently of
other transactions, without affecting the results of other concurrent
transactions. This is achieved by ensuring that transactions run in isolation and
are not affected by other transactions until they are completed.

Durability:
Durability refers to the property that guarantees that once a transaction is
committed, its effects are permanent and will survive any subsequent system
failures. It means that the changes made by a transaction should be recorded in
a permanent storage medium such as a hard disk. These changes should persist
even if the system crashes, power is lost, or some other catastrophic event
occurs.
Together, the ACID properties provide a set of guarantees that ensure that
transactions are reliable, consistent, and recoverable in the event of failures.
This makes ACID a critical set of properties for modern database systems.

Normalization refers to the process of organizing data in a database to reduce


redundancy and improve data consistency. It involves breaking down a larger
table into smaller, more manageable tables and establishing relationships
between them. The goal of normalization is to eliminate data duplication and
reduce the likelihood of inconsistencies.

There are several types of normalization, including:

First Normal Form (1NF): In this form, each column of a table should contain only
atomic (indivisible) values, and each row must be unique. For example, consider
a table that stores customer orders. Instead of storing all the items in a single
column, each item is stored in a separate row, with a unique order ID for each
item.

Second Normal Form (2NF): In this form, the table should meet the
requirements of 1NF and every non-key column should be functionally
dependent on the table's primary key. For example, in a table that stores orders
and products, each product should have its own unique identifier, and
information about the product, such as its name, price, and description, should
be stored in a separate table.

Third Normal Form (3NF): In this form, the table should meet the requirements
of 2NF, and all non-key columns should be dependent only on the primary key
and not on other non-key columns. For example, in a table that stores customer
orders, the customer's name and address should be stored in a separate table
from the order details.

Boyce-Codd Normal Form (BCNF): In this form, every determinant (a column or


set of columns that uniquely identifies a row) is a candidate key. For example,
consider a table that stores the relationship between a student and the courses
they are taking. The table should be split into two tables: one that stores
information about the student (e.g., name, ID) and another that stores
information about the courses (e.g., course name, ID). The relationship between
the two tables is established through a foreign key that links the two tables.
These forms of normalization can be used to progressively refine the structure
of a database to reduce data redundancy, improve data consistency, and
improve query performance.

A weak entity set is an entity set that cannot be uniquely identified by its
attributes alone. In other words, it depends on another entity set (the strong
entity set) for its identity. For example, consider an entity set called "Order Item"
that represents the items ordered in a customer's order. Each item in the order
can be uniquely identified by its own attributes (such as a product name or SKU),
but it cannot be uniquely identified by its own identifier alone since multiple
orders can contain the same item. Therefore, it depends on the "Order" entity
set to give it context and make it unique.

Codd's rules refer to a set of guidelines for relational database management


systems (RDBMS) proposed by Dr. Edgar F. Codd in 1970. These rules outline the
basic principles that should be followed to create a relational database that is
both efficient and easy to maintain. The 12 rules are as follows:

1. The Information Rule: All information in a relational database is


represented explicitly at the logical level and in exactly one way.

2. Guaranteed Access Rule: Each and every datum (atomic value) is


guaranteed to be accessible by using a combination of the table name,
primary key value, and column name.

3. Systematic Treatment of Null Values: Null values (distinct from empty


character strings or a string of spaces) are supported in the system for
representing missing information and inapplicable information in a
systematic way.

4. Dynamic Online Catalog Based on the Relational Model: The database


description is represented at the logical level in the same way as ordinary
data, so that authorized users can apply the same relational language to
its interrogation as they apply to regular data.

5. Comprehensive Data Sublanguage Rule: A relational system may support


several languages and various modes of terminal use (for example, the
fill-in-the-blanks mode). However, there must be at least one language
whose statements are expressible, per some well-defined syntax, as
character strings and whose ability to support all of the following is
comprehensive:

Data Definition
View Definition
Data Manipulation (Interactive and by Program)
Integrity Constraints
Authorization
View Updating Rule: All views that are theoretically updatable are also
updatable by the system.

6. High-Level Insert, Update, and Delete: The capability of handling a base


relation or a derived relation as a single operand applies not only to the
retrieval of data, but also to the insertion, update, and deletion of data.

7. Physical Data Independence: Application programs and terminal activities


remain logically unimpaired whenever any changes are made in either
storage representations or access methods.

8. Logical Data Independence: Application programs and terminal activities


remain logically unimpaired when information-preserving changes of any
kind that theoretically permit unimpairment are made to the base tables.

9. Integrity Independence: Integrity constraints specified to the system are


stored in the catalog, not in application programs.

10.Distribution Independence: A relational DBMS has distribution


independence.

11.Nonsubversion Rule: If a relational system has a low-level (single-record-


at-a-time) language, that low level cannot be used to subvert or bypass
the integrity rules or constraints expressed in the higher-level multiple-
records-at-a-time language.
Hadoop is a distributed data processing system that allows for the storage and
analysis of large data sets. It is an open-source soAware framework that is used
to store and process Big Data. Hadoop is made up of a number of components
that work together to store and process data efficiently.

Hadoop Ecosystem is a collec<on of tools and technologies that are used to


extend the func<onality of Hadoop. The Hadoop Ecosystem includes various
components such as Hive, Pig, HBase, Sqoop, Flume, Spark, and many more.
These components allow Hadoop to integrate with other systems and
technologies, making it a powerful plalorm for Big Data processing.

Let's take a look at some of the Hadoop ecosystem components and their
func<onali<es:
Hive: Hive is a data warehousing tool that allows SQL-like queries to be
executed on Hadoop data sets. It allows analysts to use familiar SQL commands
to extract meaningful insights from Big Data. For example, a company can use
Hive to analyze customer behavior by querying the data from their website and
mobile app.

Pig: Pig is a plalorm for analyzing large data sets that allows for the crea<on of
complex data processing pipelines using a high-level scrip<ng language. Pig can
be used to extract insights from data in real-<me. For example, a
telecommunica<ons company can use Pig to monitor network performance
and iden<fy poten<al issues before they become cri<cal.

HBase: HBase is a distributed NoSQL database that is used to store and manage
large amounts of structured and semi-structured data. HBase is used to store
data that requires fast access and retrieval. For example, a social media
plalorm can use HBase to store user profiles and their social connec<ons.

Sqoop: Sqoop is a tool that allows for the transfer of data between Hadoop and
other rela<onal databases. Sqoop can be used to import data from a rela<onal
database to Hadoop, or export data from Hadoop to a rela<onal database. For
example, a retail company can use Sqoop to transfer data from their inventory
system to Hadoop for analysis.

Flume: Flume is a distributed system that is used to collect, aggregate, and


move large amounts of data from various sources to Hadoop. Flume can be
used to collect data from sources such as web servers, social media plalorms,
and sensors. For example, a healthcare company can use Flume to collect data
from pa<ent monitoring devices and store it in Hadoop for analysis.

Spark: Spark is a fast and powerful data processing engine that is used to
perform data processing tasks in-memory. Spark can be used for a variety of
data processing tasks such as data cleansing, data transforma<on, and data
modeling. For example, a financial services company can use Spark to analyze
credit card transac<ons in real-<me to iden<fy poten<al fraudulent ac<vi<es.

In conclusion, the Hadoop ecosystem is a collec<on of tools and technologies


that extend the func<onality of Hadoop. These tools allow businesses to store,
process, and analyze large amounts of data in a cost-effec<ve manner. The
Hadoop ecosystem components listed above are just a few examples of the
many tools available to businesses looking to leverage Big Data to gain valuable
insights into their opera<ons.

Hadoop HDFS (Hadoop Distributed File System) has the following core
components:

NameNode: The NameNode is the central component of the HDFS


architecture. It stores metadata about all the files and directories in the
Hadoop file system. This metadata includes the file names, permissions,
<mestamps, and the physical loca<ons of the file blocks.

DataNode: The DataNode is responsible for storing and serving the actual data
blocks of files. Each DataNode stores a por<on of the data blocks and sends
heartbeats to the NameNode to inform it of its status.

Secondary NameNode: The Secondary NameNode is responsible for


performing periodic checkpoints of the NameNode's metadata. It merges the
edits log with the file system image and creates a new checkpoint.

Block: A block is the smallest unit of data that can be stored in HDFS. By
default, each block is 128 MB in size. HDFS stores files as a series of blocks, and
each block is replicated to mul<ple DataNodes for fault tolerance.

Rack: A rack is a collec<on of DataNodes that are physically close to each other.
HDFS replicates data blocks across mul<ple racks to improve fault tolerance
and reduce the likelihood of data loss.

Namespace: The namespace is the hierarchical structure of files and directories


in HDFS. The root of the namespace is represented by a forward slash (/).

File System Image: The file system image is a file that contains the metadata
about all the files and directories in HDFS. The NameNode reads this file when
it starts up and creates an in-memory representa<on of the namespace.

Edit Log: The edit log is a file that contains a record of all the changes that have
been made to the namespace since the last checkpoint. The NameNode uses
the edit log to rebuild the in-memory namespace aAer a restart.
Distributed compu<ng and parallel compu<ng are two related but dis<nct
concepts. In distributed compu<ng, tasks are divided among mul<ple
independent computers or nodes that communicate and coordinate their
ac<ons to achieve a common goal. Parallel compu<ng, on the other hand,
involves breaking down a single task into smaller, independent sub-tasks that
are executed simultaneously on mul<ple processors or cores within a single
computer.

Here are some challenges that are common to both distributed and parallel
compu<ng:

Communica<on: In both distributed and parallel compu<ng, communica<on


between different nodes or processors is cri<cal. However, communica<on can
be a significant bonleneck as it can introduce latency, consume network
bandwidth, and increase the overall complexity of the system.

Synchroniza<on: In a distributed system, nodes need to be synchronized to


ensure that they are all working on the same task and are up-to-date with each
other's progress. Similarly, in parallel compu<ng, sub-tasks need to be
synchronized to ensure that they are executed in the correct order and that
their results are combined correctly.

Load balancing: In both distributed and parallel compu<ng, workload


distribu<on is crucial to achieve op<mal performance. Load balancing involves
distribu<ng the workload evenly across all nodes or processors to ensure that
no single node or processor is overwhelmed with work.

Fault tolerance: In a distributed system, nodes can fail, and communica<on


links can break. In parallel compu<ng, individual processors can also fail.
Therefore, fault tolerance mechanisms must be in place to ensure that the
system can con<nue to operate correctly despite these failures.

Scalability: Both distributed and parallel compu<ng should be designed to scale


up or down easily. As the workload or the number of nodes or processors
increases, the system should be able to handle the addi<onal workload
efficiently without a significant decrease in performance.

Programming complexity: Both distributed and parallel compu<ng requires


specialized programming techniques, which can be more complex than
tradi<onal programming methods. The programmer must manage the
communica<on and synchroniza<on between nodes or processors and ensure
that the system is fault-tolerant and scalable.

RDBMS and Hadoop are two different technologies used for storing and
processing data, and there are several differences between them:

Data Model: RDBMS (Rela<onal Database Management System) is based on a


structured data model with tables, columns, and rows. On the other hand,
Hadoop is based on a distributed file system with a schema-less data model
that allows for unstructured and semi-structured data.

Data Processing: RDBMS is designed for online transac<on processing (OLTP)


applica<ons that require real-<me processing of data. Hadoop is designed for
batch processing of large datasets using MapReduce programming model.

Scalability: RDBMS is ver<cally scalable, which means it can scale up by adding


more hardware resources to a single server. Hadoop, on the other hand, is
horizontally scalable, which means it can scale out by adding more commodity
servers to a cluster.

Cost: RDBMS can be expensive due to licensing costs and hardware


requirements. Hadoop is open source soAware, and the hardware
requirements are rela<vely low, which makes it a cost-effec<ve solu<on for
processing large datasets.

Data Storage: RDBMS stores data in a structured format and requires


predefined schema for data storage. Hadoop, on the other hand, can store
both structured and unstructured data without predefined schema.

Performance: RDBMS is op<mized for handling small to medium-sized datasets


with high speed, whereas Hadoop is op<mized for processing large datasets
with high efficiency.

In summary, RDBMS and Hadoop have different strengths and are used for
different purposes. RDBMS is good for transac<onal applica<ons, while Hadoop
is bener suited for handling large amounts of data in batch processing.

You might also like