BDA Mid Term Self
BDA Mid Term Self
BDA Mid Term Self
Volume: Big data refers to datasets that are too large and complex to be
processed using tradi<onal data processing tools. The amount of data generated
by individuals and organiza<ons today is growing exponen<ally, and big data can
refer to terabytes, petabytes, or even exabytes of data.
Velocity: Big data is oAen generated at a very high velocity, with data streaming
in from various sources in real-<me. This speed of data genera<on poses
challenges for tradi<onal data processing tools, which are oAen unable to keep
up with the rate of data flow.
Variety: Big data is characterized by its variety of data types and sources,
including structured, semi-structured, and unstructured data. It can come from
various sources, such as social media, sensors, mobile devices, and more.
Veracity: Big data can be highly unreliable, with issues such as inaccuracies,
errors, and biases that can affect the quality of the data. This makes it challenging
to analyze and draw accurate conclusions from the data.
Value: The main goal of big data is to extract value from large and complex
datasets. The insights generated from big data can help organiza<ons make more
informed decisions, improve their products and services, and iden<fy new
opportuni<es.
Variability: Big data can have high variability due to the dynamic nature of data
sources and the constantly evolving environment in which the data is generated.
Complexity: Big data can be complex to manage and analyze due to the sheer
volume, velocity, and variety of data. This requires specialized tools, techniques,
and skills to process, store, and analyze the data effec<vely.
Structured Data:
Structured data refers to the data that can be organized in a predefined format,
and its structure is known in advance. Structured data is commonly stored in
databases, spreadsheets, and tables. Examples of structured data include sales
records, financial transac<ons, and customer informa<on.
Unstructured Data:
Unstructured data refers to the data that does not have a specific format or
structure. It can be any type of data such as text, images, audio, and video. This
type of data is challenging to manage and analyze because of its complexity.
Examples of unstructured data include social media posts, emails, and sensor
data.
Semi-structured Data:
Semi-structured data refers to the data that is not completely structured, but it
has some organiza<on. This type of data is similar to unstructured data, but it
has some predefined tags, labels, or metadata that provide some level of
structure. Examples of semi-structured data include XML files, JSON files, and log
files.
These three types of data require different methods of storage, processing, and
analysis, and they all present different challenges and opportuni<es for
businesses and organiza<ons looking to extract value from their data.
1. Data quality: Big data often comes from disparate sources, making it
difficult to ensure data quality. Poor data quality can lead to inaccurate
insights and poor decision-making.
2. Data privacy and security: With the abundance of data comes increased
risks of data breaches and cyberattacks. Companies must take measures
to protect their data and ensure compliance with privacy regulations.
3. Talent shortage: The demand for skilled data scientists and analysts has
far outpaced supply, making it challenging for companies to find and
retain the talent they need to effectively analyze big data.
4. Infrastructure and cost: Big data requires significant infrastructure and
computing resources to manage and analyze. This can be costly and
difficult for smaller organizations to manage.
5. Legal and ethical concerns: Big data raises legal and ethical concerns, such
as data privacy, ownership, and potential bias in algorithms. Companies
must navigate these concerns to ensure they are using data ethically and
responsibly.
BDA in finance
Big data analytics has revolutionized the way financial institutions operate, by
enabling them to leverage massive amounts of data to extract valuable insights
and make more informed business decisions. Here are some of the key
applications of big data analytics in finance:
Risk Management: Big data analytics can be used to identify and mitigate
potential risks in the financial sector. By analyzing large datasets, financial
institutions can detect patterns and anomalies that may indicate fraudulent
activities or other risks.
Customer Segmentation: Big data analytics can help financial institutions
segment their customers based on demographics, purchasing behavior, and
other factors. This enables them to create more personalized marketing
campaigns, develop new products and services, and improve customer
satisfaction.
Fraud Detection: Big data analytics can be used to detect fraudulent activities in
real-time. By analyzing large volumes of data from multiple sources, financial
institutions can identify unusual transactions or behavior that may indicate
fraud.
Compliance Monitoring: Big data analytics can help financial institutions ensure
compliance with regulatory requirements. By analyzing large volumes of data,
they can detect potential compliance issues and take appropriate action to
address them.
Credit Risk Assessment: Big data analytics can help financial institutions assess
the credit risk of their customers. By analyzing data on their credit history,
income, and other factors, they can determine the likelihood of default and set
appropriate interest rates and credit limits.
Overall, big data analytics has the potential to transform the financial sector by
providing valuable insights and enabling more informed decision-making.
Advantages:
Data Integrity: DBMS systems ensure data integrity by providing mechanisms for
enforcing constraints, such as unique keys, referential integrity, and data
validation rules.
Data Consistency: DBMS systems ensure data consistency by allowing multiple
users to access and modify data concurrently, while maintaining data
consistency and preventing data corruption.
Data Security: DBMS systems provide a secure environment for data storage and
access by allowing users to define access control policies and by enforcing these
policies.
Data Sharing: DBMS systems allow multiple users to share data simultaneously,
providing a way to collaborate on projects and share information across the
organization.
Disadvantages:
Cost: DBMS systems can be expensive to acquire and maintain, especially for
small businesses and individual users.
Single Point of Failure: DBMS systems can become a single point of failure, with
the entire application depending on the availability of the database server.
Vendor Lock-In: DBMS systems can create vendor lock-in, with users becoming
dependent on a particular DBMS vendor and its proprietary technologies.
Some key differences between file processing systems and DBMS include:
Data redundancy and inconsistency: In a file processing system, the same data
may be stored in multiple files, which can lead to data redundancy and
inconsistency. In a DBMS, data is stored in a structured format with predefined
relationships between tables, which can help ensure data consistency.
Financial management: DBMS can be used to store and manage financial data
such as transactions, account balances, and financial statements. This
information can be used to manage cash flow, forecast revenues, and analyze
financial performance. An example of a financial management system is
QuickBooks.
Human resource management (HRM): DBMS can be used to store and manage
employee information such as personal details, job history, and performance
metrics. This information can be used to streamline HR processes, improve
employee engagement, and support decision-making. An example of an HRM
system is Workday.
Marketing automation: DBMS can be used to store and manage marketing data
such as campaign metrics, lead information, and customer behavior. This
information can be used to automate marketing activities, personalize
marketing messages, and optimize marketing ROI. An example of a marketing
automation system is HubSpot.
Foreign key: A foreign key is a field in one table that refers to the primary key of
another table. It is used to establish relationships between tables and to ensure
data integrity by enforcing referential integrity constraints. In ER diagrams, a
foreign key is denoted by a dotted underline.
Alternate key: An alternate key is a unique identifier for each record in a table
that is not the primary key. It is used as a backup key in case the primary key is
lost or cannot be used. In ER diagrams, an alternate key is denoted by a dashed
underline.
In database management systems, integrity constraints are rules that ensure the
correctness and consistency of data stored in a database. There are several
types of integrity constraints, including:
Entity Integrity Constraint: This constraint ensures that each entity in a table has
a unique identifier, also known as a primary key. The primary key cannot be null
or empty, and it must be unique for each entity.
Domain Integrity Constraint: This constraint ensures that the values stored in a
database column meet certain predefined criteria, such as data type, range, or
format.
Check Constraint: This constraint ensures that the values stored in a database
column meet a specific condition or set of conditions, specified by the user.
User-defined Integrity Constraint: This constraint allows the user to define
custom rules that must be satisfied by the data stored in a database. This type
of constraint is often used to enforce business rules or other requirements
specific to a particular application or organization.
ACID is an acronym for the four properties that guarantee the reliability and
consistency of database transactions: Atomicity, Consistency, Isolation, and
Durability. These properties are essential for maintaining the integrity of data in
a database and ensuring that transactions are processed reliably.
Atomicity:
Atomicity refers to the property that guarantees that a transaction is treated as
a single, indivisible unit of work. It means that either all the operations within a
transaction are executed successfully or none of them is. If any operation within
the transaction fails, the entire transaction is rolled back, and the database is
returned to its previous state.
Consistency:
Consistency ensures that a transaction brings the database from one valid state
to another. The transaction should follow all the predefined rules and
constraints of the database. It means that any data written to the database
should be consistent with the database's schema and constraints. Inconsistent
data should not be written to the database.
Isolation:
Isolation refers to the property that ensures that concurrent transactions do not
interfere with each other. Each transaction should execute independently of
other transactions, without affecting the results of other concurrent
transactions. This is achieved by ensuring that transactions run in isolation and
are not affected by other transactions until they are completed.
Durability:
Durability refers to the property that guarantees that once a transaction is
committed, its effects are permanent and will survive any subsequent system
failures. It means that the changes made by a transaction should be recorded in
a permanent storage medium such as a hard disk. These changes should persist
even if the system crashes, power is lost, or some other catastrophic event
occurs.
Together, the ACID properties provide a set of guarantees that ensure that
transactions are reliable, consistent, and recoverable in the event of failures.
This makes ACID a critical set of properties for modern database systems.
First Normal Form (1NF): In this form, each column of a table should contain only
atomic (indivisible) values, and each row must be unique. For example, consider
a table that stores customer orders. Instead of storing all the items in a single
column, each item is stored in a separate row, with a unique order ID for each
item.
Second Normal Form (2NF): In this form, the table should meet the
requirements of 1NF and every non-key column should be functionally
dependent on the table's primary key. For example, in a table that stores orders
and products, each product should have its own unique identifier, and
information about the product, such as its name, price, and description, should
be stored in a separate table.
Third Normal Form (3NF): In this form, the table should meet the requirements
of 2NF, and all non-key columns should be dependent only on the primary key
and not on other non-key columns. For example, in a table that stores customer
orders, the customer's name and address should be stored in a separate table
from the order details.
A weak entity set is an entity set that cannot be uniquely identified by its
attributes alone. In other words, it depends on another entity set (the strong
entity set) for its identity. For example, consider an entity set called "Order Item"
that represents the items ordered in a customer's order. Each item in the order
can be uniquely identified by its own attributes (such as a product name or SKU),
but it cannot be uniquely identified by its own identifier alone since multiple
orders can contain the same item. Therefore, it depends on the "Order" entity
set to give it context and make it unique.
Data Definition
View Definition
Data Manipulation (Interactive and by Program)
Integrity Constraints
Authorization
View Updating Rule: All views that are theoretically updatable are also
updatable by the system.
Let's take a look at some of the Hadoop ecosystem components and their
func<onali<es:
Hive: Hive is a data warehousing tool that allows SQL-like queries to be
executed on Hadoop data sets. It allows analysts to use familiar SQL commands
to extract meaningful insights from Big Data. For example, a company can use
Hive to analyze customer behavior by querying the data from their website and
mobile app.
Pig: Pig is a plalorm for analyzing large data sets that allows for the crea<on of
complex data processing pipelines using a high-level scrip<ng language. Pig can
be used to extract insights from data in real-<me. For example, a
telecommunica<ons company can use Pig to monitor network performance
and iden<fy poten<al issues before they become cri<cal.
HBase: HBase is a distributed NoSQL database that is used to store and manage
large amounts of structured and semi-structured data. HBase is used to store
data that requires fast access and retrieval. For example, a social media
plalorm can use HBase to store user profiles and their social connec<ons.
Sqoop: Sqoop is a tool that allows for the transfer of data between Hadoop and
other rela<onal databases. Sqoop can be used to import data from a rela<onal
database to Hadoop, or export data from Hadoop to a rela<onal database. For
example, a retail company can use Sqoop to transfer data from their inventory
system to Hadoop for analysis.
Spark: Spark is a fast and powerful data processing engine that is used to
perform data processing tasks in-memory. Spark can be used for a variety of
data processing tasks such as data cleansing, data transforma<on, and data
modeling. For example, a financial services company can use Spark to analyze
credit card transac<ons in real-<me to iden<fy poten<al fraudulent ac<vi<es.
Hadoop HDFS (Hadoop Distributed File System) has the following core
components:
DataNode: The DataNode is responsible for storing and serving the actual data
blocks of files. Each DataNode stores a por<on of the data blocks and sends
heartbeats to the NameNode to inform it of its status.
Block: A block is the smallest unit of data that can be stored in HDFS. By
default, each block is 128 MB in size. HDFS stores files as a series of blocks, and
each block is replicated to mul<ple DataNodes for fault tolerance.
Rack: A rack is a collec<on of DataNodes that are physically close to each other.
HDFS replicates data blocks across mul<ple racks to improve fault tolerance
and reduce the likelihood of data loss.
File System Image: The file system image is a file that contains the metadata
about all the files and directories in HDFS. The NameNode reads this file when
it starts up and creates an in-memory representa<on of the namespace.
Edit Log: The edit log is a file that contains a record of all the changes that have
been made to the namespace since the last checkpoint. The NameNode uses
the edit log to rebuild the in-memory namespace aAer a restart.
Distributed compu<ng and parallel compu<ng are two related but dis<nct
concepts. In distributed compu<ng, tasks are divided among mul<ple
independent computers or nodes that communicate and coordinate their
ac<ons to achieve a common goal. Parallel compu<ng, on the other hand,
involves breaking down a single task into smaller, independent sub-tasks that
are executed simultaneously on mul<ple processors or cores within a single
computer.
Here are some challenges that are common to both distributed and parallel
compu<ng:
RDBMS and Hadoop are two different technologies used for storing and
processing data, and there are several differences between them:
In summary, RDBMS and Hadoop have different strengths and are used for
different purposes. RDBMS is good for transac<onal applica<ons, while Hadoop
is bener suited for handling large amounts of data in batch processing.