BDA Mid Term Self
BDA Mid Term Self
BDA Mid Term Self
Volume: Big data refers to datasets that are too large and complex to be
processed using traditional data processing tools. The amount of data
generated by individuals and organizations today is growing exponentially, and
big data can refer to terabytes, petabytes, or even exabytes of data.
Velocity: Big data is often generated at a very high velocity, with data streaming
in from various sources in real-time. This speed of data generation poses
challenges for traditional data processing tools, which are often unable to keep
up with the rate of data flow.
Variety: Big data is characterized by its variety of data types and sources,
including structured, semi-structured, and unstructured data. It can come from
various sources, such as social media, sensors, mobile devices, and more.
Veracity: Big data can be highly unreliable, with issues such as inaccuracies,
errors, and biases that can affect the quality of the data. This makes it
challenging to analyze and draw accurate conclusions from the data.
Value: The main goal of big data is to extract value from large and complex
datasets. The insights generated from big data can help organizations make
more informed decisions, improve their products and services, and identify
new opportunities.
Variability: Big data can have high variability due to the dynamic nature of data
sources and the constantly evolving environment in which the data is
generated.
Complexity: Big data can be complex to manage and analyze due to the sheer
volume, velocity, and variety of data. This requires specialized tools,
techniques, and skills to process, store, and analyze the data effectively.
Structured Data:
Structured data refers to the data that can be organized in a predefined format,
and its structure is known in advance. Structured data is commonly stored in
databases, spreadsheets, and tables. Examples of structured data include sales
records, financial transactions, and customer information.
Unstructured Data:
Unstructured data refers to the data that does not have a specific format or
structure. It can be any type of data such as text, images, audio, and video. This
type of data is challenging to manage and analyze because of its complexity.
Examples of unstructured data include social media posts, emails, and sensor
data.
Semi-structured Data:
Semi-structured data refers to the data that is not completely structured, but it
has some organization. This type of data is similar to unstructured data, but it
has some predefined tags, labels, or metadata that provide some level of
structure. Examples of semi-structured data include XML files, JSON files, and
log files.
These three types of data require different methods of storage, processing, and
analysis, and they all present different challenges and opportunities for
businesses and organizations looking to extract value from their data.
1. Data quality: Big data often comes from disparate sources, making it
difficult to ensure data quality. Poor data quality can lead to inaccurate
insights and poor decision-making.
2. Data privacy and security: With the abundance of data comes increased
risks of data breaches and cyberattacks. Companies must take measures
to protect their data and ensure compliance with privacy regulations.
3. Talent shortage: The demand for skilled data scientists and analysts has
far outpaced supply, making it challenging for companies to find and
retain the talent they need to effectively analyze big data.
4. Infrastructure and cost: Big data requires significant infrastructure and
computing resources to manage and analyze. This can be costly and
difficult for smaller organizations to manage.
5. Legal and ethical concerns: Big data raises legal and ethical concerns,
such as data privacy, ownership, and potential bias in algorithms.
Companies must navigate these concerns to ensure they are using data
ethically and responsibly.
BDA in finance
Big data analytics has revolutionized the way financial institutions operate, by
enabling them to leverage massive amounts of data to extract valuable insights
and make more informed business decisions. Here are some of the key
applications of big data analytics in finance:
Risk Management: Big data analytics can be used to identify and mitigate
potential risks in the financial sector. By analyzing large datasets, financial
institutions can detect patterns and anomalies that may indicate fraudulent
activities or other risks.
Customer Segmentation: Big data analytics can help financial institutions
segment their customers based on demographics, purchasing behavior, and
other factors. This enables them to create more personalized marketing
campaigns, develop new products and services, and improve customer
satisfaction.
Fraud Detection: Big data analytics can be used to detect fraudulent activities
in real-time. By analyzing large volumes of data from multiple sources, financial
institutions can identify unusual transactions or behavior that may indicate
fraud.
Compliance Monitoring: Big data analytics can help financial institutions ensure
compliance with regulatory requirements. By analyzing large volumes of data,
they can detect potential compliance issues and take appropriate action to
address them.
Credit Risk Assessment: Big data analytics can help financial institutions assess
the credit risk of their customers. By analyzing data on their credit history,
income, and other factors, they can determine the likelihood of default and set
appropriate interest rates and credit limits.
Overall, big data analytics has the potential to transform the financial sector by
providing valuable insights and enabling more informed decision-making.
Advantages:
Data Security: DBMS systems provide a secure environment for data storage
and access by allowing users to define access control policies and by enforcing
these policies.
Data Sharing: DBMS systems allow multiple users to share data simultaneously,
providing a way to collaborate on projects and share information across the
organization.
Disadvantages:
Cost: DBMS systems can be expensive to acquire and maintain, especially for
small businesses and individual users.
Single Point of Failure: DBMS systems can become a single point of failure,
with the entire application depending on the availability of the database
server.
Vendor Lock-In: DBMS systems can create vendor lock-in, with users becoming
dependent on a particular DBMS vendor and its proprietary technologies.
Some key differences between file processing systems and DBMS include:
Data redundancy and inconsistency: In a file processing system, the same data
may be stored in multiple files, which can lead to data redundancy and
inconsistency. In a DBMS, data is stored in a structured format with predefined
relationships between tables, which can help ensure data consistency.
Financial management: DBMS can be used to store and manage financial data
such as transactions, account balances, and financial statements. This
information can be used to manage cash flow, forecast revenues, and analyze
financial performance. An example of a financial management system is
QuickBooks.
Human resource management (HRM): DBMS can be used to store and manage
employee information such as personal details, job history, and performance
metrics. This information can be used to streamline HR processes, improve
employee engagement, and support decision-making. An example of an HRM
system is Workday.
Foreign key: A foreign key is a field in one table that refers to the primary key
of another table. It is used to establish relationships between tables and to
ensure data integrity by enforcing referential integrity constraints. In ER
diagrams, a foreign key is denoted by a dotted underline.
Alternate key: An alternate key is a unique identifier for each record in a table
that is not the primary key. It is used as a backup key in case the primary key is
lost or cannot be used. In ER diagrams, an alternate key is denoted by a dashed
underline.
Entity Integrity Constraint: This constraint ensures that each entity in a table
has a unique identifier, also known as a primary key. The primary key cannot
be null or empty, and it must be unique for each entity.
Domain Integrity Constraint: This constraint ensures that the values stored in a
database column meet certain predefined criteria, such as data type, range, or
format.
Check Constraint: This constraint ensures that the values stored in a database
column meet a specific condition or set of conditions, specified by the user.
User-defined Integrity Constraint: This constraint allows the user to define
custom rules that must be satisfied by the data stored in a database. This type
of constraint is often used to enforce business rules or other requirements
specific to a particular application or organization.
ACID is an acronym for the four properties that guarantee the reliability and
consistency of database transactions: Atomicity, Consistency, Isolation, and
Durability. These properties are essential for maintaining the integrity of data
in a database and ensuring that transactions are processed reliably.
Atomicity:
Atomicity refers to the property that guarantees that a transaction is treated
as a single, indivisible unit of work. It means that either all the operations
within a transaction are executed successfully or none of them is. If any
operation within the transaction fails, the entire transaction is rolled back, and
the database is returned to its previous state.
Consistency:
Consistency ensures that a transaction brings the database from one valid
state to another. The transaction should follow all the predefined rules and
constraints of the database. It means that any data written to the database
should be consistent with the database's schema and constraints. Inconsistent
data should not be written to the database.
Isolation:
Isolation refers to the property that ensures that concurrent transactions do
not interfere with each other. Each transaction should execute independently
of other transactions, without affecting the results of other concurrent
transactions. This is achieved by ensuring that transactions run in isolation and
are not affected by other transactions until they are completed.
Durability:
Durability refers to the property that guarantees that once a transaction is
committed, its effects are permanent and will survive any subsequent system
failures. It means that the changes made by a transaction should be recorded
in a permanent storage medium such as a hard disk. These changes should
persist even if the system crashes, power is lost, or some other catastrophic
event occurs.
Together, the ACID properties provide a set of guarantees that ensure that
transactions are reliable, consistent, and recoverable in the event of failures.
This makes ACID a critical set of properties for modern database systems.
First Normal Form (1NF): In this form, each column of a table should contain
only atomic (indivisible) values, and each row must be unique. For example,
consider a table that stores customer orders. Instead of storing all the items in
a single column, each item is stored in a separate row, with a unique order ID
for each item.
Second Normal Form (2NF): In this form, the table should meet the
requirements of 1NF and every non-key column should be functionally
dependent on the table's primary key. For example, in a table that stores
orders and products, each product should have its own unique identifier, and
information about the product, such as its name, price, and description, should
be stored in a separate table.
Third Normal Form (3NF): In this form, the table should meet the requirements
of 2NF, and all non-key columns should be dependent only on the primary key
and not on other non-key columns. For example, in a table that stores
customer orders, the customer's name and address should be stored in a
separate table from the order details.
A weak entity set is an entity set that cannot be uniquely identified by its
attributes alone. In other words, it depends on another entity set (the strong
entity set) for its identity. For example, consider an entity set called "Order
Item" that represents the items ordered in a customer's order. Each item in the
order can be uniquely identified by its own attributes (such as a product name
or SKU), but it cannot be uniquely identified by its own identifier alone since
multiple orders can contain the same item. Therefore, it depends on the
"Order" entity set to give it context and make it unique.
Data Definition
View Definition
Data Manipulation (Interactive and by Program)
Integrity Constraints
Authorization
View Updating Rule: All views that are theoretically updatable are also
updatable by the system.
Let's take a look at some of the Hadoop ecosystem components and their
functionalities:
Hive: Hive is a data warehousing tool that allows SQL-like queries to be
executed on Hadoop data sets. It allows analysts to use familiar SQL commands
to extract meaningful insights from Big Data. For example, a company can use
Hive to analyze customer behavior by querying the data from their website and
mobile app.
Pig: Pig is a platform for analyzing large data sets that allows for the creation of
complex data processing pipelines using a high-level scripting language. Pig can
be used to extract insights from data in real-time. For example, a
telecommunications company can use Pig to monitor network performance
and identify potential issues before they become critical.
HBase: HBase is a distributed NoSQL database that is used to store and manage
large amounts of structured and semi-structured data. HBase is used to store
data that requires fast access and retrieval. For example, a social media
platform can use HBase to store user profiles and their social connections.
Sqoop: Sqoop is a tool that allows for the transfer of data between Hadoop and
other relational databases. Sqoop can be used to import data from a relational
database to Hadoop, or export data from Hadoop to a relational database. For
example, a retail company can use Sqoop to transfer data from their inventory
system to Hadoop for analysis.
Spark: Spark is a fast and powerful data processing engine that is used to
perform data processing tasks in-memory. Spark can be used for a variety of
data processing tasks such as data cleansing, data transformation, and data
modeling. For example, a financial services company can use Spark to analyze
credit card transactions in real-time to identify potential fraudulent activities.
Hadoop HDFS (Hadoop Distributed File System) has the following core
components:
DataNode: The DataNode is responsible for storing and serving the actual data
blocks of files. Each DataNode stores a portion of the data blocks and sends
heartbeats to the NameNode to inform it of its status.
Block: A block is the smallest unit of data that can be stored in HDFS. By
default, each block is 128 MB in size. HDFS stores files as a series of blocks, and
each block is replicated to multiple DataNodes for fault tolerance.
Rack: A rack is a collection of DataNodes that are physically close to each other.
HDFS replicates data blocks across multiple racks to improve fault tolerance
and reduce the likelihood of data loss.
File System Image: The file system image is a file that contains the metadata
about all the files and directories in HDFS. The NameNode reads this file when
it starts up and creates an in-memory representation of the namespace.
Edit Log: The edit log is a file that contains a record of all the changes that have
been made to the namespace since the last checkpoint. The NameNode uses
the edit log to rebuild the in-memory namespace after a restart.
Distributed computing and parallel computing are two related but distinct
concepts. In distributed computing, tasks are divided among multiple
independent computers or nodes that communicate and coordinate their
actions to achieve a common goal. Parallel computing, on the other hand,
involves breaking down a single task into smaller, independent sub-tasks that
are executed simultaneously on multiple processors or cores within a single
computer.
Here are some challenges that are common to both distributed and parallel
computing:
RDBMS and Hadoop are two different technologies used for storing and
processing data, and there are several differences between them:
In summary, RDBMS and Hadoop have different strengths and are used for
different purposes. RDBMS is good for transactional applications, while Hadoop
is better suited for handling large amounts of data in batch processing.