SQL Basics
SQL Basics
SQL Basics
KARTHI P
BHUVI P
HARI
MOHAN S
VINAY
MARKS
12,34,55,98
87,34
23,54,67
12,23,23,24,34
11,22,33,44,55,66,77,88,99
It helps companies to analyze their business data for taking critical business decisions.
Transactional databases cannot answer complex business questions that can be answered by ETL example.
A Data Warehouse provides a common data repository
ETL provides a method of moving the data from various sources into a data warehouse.
As data sources change, the Data Warehouse will automatically update.
Well-designed and documented ETL system is almost essential to the success of a Data Warehouse project.
Allow verification of data transformation, aggregation and calculations rules.
ETL process allows sample data comparison between the source and the target system.
ETL process can perform complex transformations and requires the extra area to store the data.
ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent system
ETL is a predefined process for accessing and manipulating source data into the target database.
ETL in data warehouse offers deep historical context for the business.
It helps to improve productivity because it codifies and reuses without a need for technical skills.
2) Source to Target Count Testing: Make sure that the count of records loaded in the target is matching with the expected cou
3) Source to Target Data Testing: Make sure that all projected data is loaded into the data warehouse without any data loss or
4) Data Quality Testing: Make sure that ETL application appropriately rejects, replaces with default values and reports invalid d
5) Performance Testing: Make sure that data is loaded in the data warehouse within the prescribed and expected time frames
6) Production Validation Testing: Validate the data in the production system & compare it against the source data.
7) Data Integration Testing: Make sure that the data from various sources has been loaded properly to the target system and a
8) Application Migration Testing: In this testing, ensure that the ETL application is working fine on moving to a new box or platf
9) Data & constraint Check: The datatype, length, index, constraints, etc. are tested in this case.
10) Duplicate Data Check: Test if there is any duplicate data present in the target system. Duplicate data can lead to incorrect
Apart from the above ETL testing methods, other testing methods like system integration testing, user acceptance testing, incr
regression testing, retesting and navigation testing are also carried out to make sure that everything is smooth and reliable.
Given below is the list of objects that are treated as essential for validation in this testing:
1. Needs to validate the unique key, primary key and any other column should
be unique as per the business requirements are having any duplicate rows
Duplicate Check 2. Check if any duplicate values exist in any column which is extracting from
multiple columns in source and combining into one column
3. As per the client requirements, needs to be ensure that no duplicates in
combination of multiple columns within target only
Date values are using many areas in ETL development for
4. Sometimes based on the date values the updates and inserts are generated.
1. To validate the complete data set in source and target table minus a query in
a best solution
2. We need to source minus target and target minus source
3. If minus query returns any value those should be considered as mismatching
rows
Complete Data
Validation 4. Needs to matching rows among source and target using intersect statement
Complete Data
Validation
5. The count returned by intersect should match with individual counts of source
and target tables
6. If minus query returns of rows and count intersect is less than source count
or target table then we can consider as duplicate rows are existed.
Data Cleanness Unnecessary columns should be deleted before loading into the staging area.
Reproduce the problem on production and testing environment.
If the problem is occurring only on production environment then it may
be due to configuration issue.
On the other side if it is occurring on QA environment then check the
impact of that issue on the application.
Do a clear root cause analysis
Investigate the issue to find out that how long that defect has been
around.
Determine the fix of ticket and list out the areas where can put more
impact.
If the issue is impacting more customers then go for the hotfix and
deploy it on QA environment.
Testing team should focus on testing all the regression scenarios around
the fix.
If the applied fix works fine then it should be deployed to production
and post release sanity should be done so that it should not occur again.
Do a retrospection meeting.
What to do
when
defect is found
in production
but not during
the QA phase?
Bug It is the consequence/outcome of the coding error
Testing is the process of identifying defects, where a defect is any variance
Defect between actual and expected results.actually doesn't meet the reqiurements
error found by tester is called Defect, defect accepted by development team
then it is called Bug
re to one consistent system.
ey will check whether it is a valid defect. If not valid, the bug is rejected, and its new status is REJECTED.
atus ‘DUPLICATE’
gy transformation.
torical data.
sions are taken.
or real-time data as well as historical data.
ments with different applications running on distributed technology.
fferent departments.
he preferred format based on the business
to production.
t group of experts to make sure that the data warehouse is concrete and robust.
echnology or ETL tools used:
customer requirements and different data sources and a new data warehouse is built and verified with the help of ETL tools.
he job, but they are looking to bag new tools in order to improve efficiency.
Also, there might be a condition where customers need to change their existing business rules or they might integrate the new rules.
W builds. The report must be tested by validating the layout, data in the report and calculation.
he source data.
to the target system and all the threshold values are checked.
ole in validating and ensuring that the business information is accurate, consistent and reliable. It also minimizes the hazard of data loss in
elp of ETL tools.
For example:
A customer dimension may hold attributes such as name, address, and phone number.
Over time, a customer's details may change (e.g. move addresses, change phone number, etc).
A slowly changing dimension is able to accommodate these changes, with some SCD patterns having the added ability to prese
Deciding on which type of slowly changing dimension pattern to implement will vary based on your business requirements.
SCD Type 0
There are situations where you ignore any changes. For example, when an employee joined an organization, there are joined r
such as joined Designation and JoinedDate, etc. that should not change over time.
The following is the example for Type 0 of Slowly Changing Dimensions in Data Warehouse.
In the above Customer Dimension, FirstDesignation, JoinedDate and DateFirstPurchase are the attributes that will not be updated which is Type 0 SCD.
SCD Type 1
In the Type 1 SCD, you simply overwrite data in dimensions. There can be situations where you don’t have the entire data
when the record is initiated in the dimension. For example, when the customer record is initiated, you may not get all attribut
Therefore, when the customer record is initiated at the operational database, there will be empty or null records in the custom
Once the ETL is executed, those empty records will be created in the data warehouse. Once these attributes are filled in the op
SCD Type 2
Type 2 Slowly Changing Dimensions in Data warehouse is the most popular dimension that is used in the data warehouse.
As we discussed data warehouse is used for data analysis. If you need to analyze data, you need to accommodate historical as
For the SCD Type 2, we need to include three more attributes such as StartDate, EndDate and IsCurrent as shown below.
Type 2 Slowly Changing Dimensions in Data Warehouse
In the above customer dimension, there are two records and let us say that customer whose CustomerCode is AW00011012, h
SCD Type 3
Type 3 Slowly Changing Dimension in Data warehouse is a simple implementation where history will be kept in the additional
If we relate the same scenario that we discussed under Type 2 SCD to Type 3 SCD, the customer dimension would look like bel
Type 3 SCD
As you can see, historical aspects of the data are preserved as a different column. However, this method will not be scalable if
Typically, this would be better suited to implement name changes of an employee. In some cases, female employees will chan
history. Further, this technique will allow only to keep the last version of the history, unlike Type 2 SCD.
SCD Type 4
As we discussed in SCD type 2, we maintain the history by adding a different version of the row to the dimension. However, if
For example, let us assume we want to keep the customer risk type depending on his previous payment.
Since this is an attribute related to the customer, it should be stored in a customer dimension. This means every month there
customer record. If you have 1000 customers, you are looking at 12,000 records per month. As you can imagine this Slowly Ch
SCD Type 4 is introduced in order to fix this issue. In this technique, a rapidly changing column is moved out of the dimension
table. This new dimension is linked to the fact table as shown in the below diagram.
With the above implementation of Type 4 Slowly Changing Dimensions in Data Warehouse, you are eliminating the unnecessa
However, still you have the capabilities of performing the required analysis.
SCD Type 6
Type 6 Slowly Changing Dimensions in Data Warehouse is a combination of Type 2 and Type 3 SCDs. This means that Type 6 SC
With this implementation, you can further improve the analytical capabilities in the data warehouse
ange over time.
All the new transactions will be related to CustomerKey 11013 while previous transactional are related to CustomerKey 11012. This mecha
s, female employees will change their names after their marriage. In such situations, you can use Type 3 SCD since these types of changes
to the dimension. However, if the changes are rapid in nature Type 2 SCD will not be scalable.
CDs. This means that Type 6 SCD has both columns are rows in its implementation
record with the new value, you will not see the previous records. Therefore, a new record will be created with a new CustomerKey and a
stomerKey 11012. This mechanism helps to preserve the historic aspect of the customer as shown in the below query.
find the manager name for employee were empid and self join or left join but left join
managerid in same table gives you data for the emp who
doesn't have managerid
using dense_rank
we can use with(inside dense_ran)
don’t use rownumber it ives
sequential num so duplicates
there better use dense_rank
show a single or same row from atable twice in results use union all
find department that has less than 3 employees use join,groupby, having
ISNULL
case,Max
i/p:
name sales
april 100
jan 200
june 300
dec 150
i/p:
Date sales
2021-10-01 100
2021-01-01 100
2021-03-01 100
2021-02-01 100
previous quarter sales use lag( )
i/p:
year sales quarter
2021 100 Q1
2021 200 Q2
2021 300 Q3
2020 500 Q1
2020 300 Q2
2020 400 Q3
Lead()
i/p:
year sales quarter
2021 100 Q1
2021 200 Q2
2021 300 Q3
2020 500 Q1
2020 300 Q2
2020 400 Q3
left,right,len,charindex
i/p:
id Name
1 cooper,adam
TOP 50%record
fetch last 5 records
Which is faster in or EXISTS in SQL?
group by
Union since its removes duplicates
ORACLE SQL
Windows functions RANK,LAG,LEAD
last 30 days
select query retrive all the students name starts with charindex,left,substring
'M' without like operator
tomorrows date
sine sday
How to display employees who joined on leap years only with one condition year%4=0
use all condition with column
as leap or not leap year
11,22,33,44,55,66,77,88,99
insert into ship(id,name) values (1,'sherin');
SELECT * FROM [shipPERS] AS A
WHERE NOT EXISTS ( SELECT NULL FROM ship AS B WHERE A.SHIPPERID=B.SHIPPERID)
Use minus (a-b) and (b-a) operator and analyse scenarios in nextsheet manually
Else we can use inner full outer join using primary keys where a.PK=b.PK make sure
Primary keys don’t have duplicates
select emp,salary,count(*) from emp
group by emp,salary
having count(*) >1
WITH CTE as( select *,ROW_NUMBER() over (partition by email order by email) as RN from
customer)
select * from CTE where RN>1
select *,case when email= lag(email) over (order by email) then 'Yes' else 'No' end
duplicate
from customer order by email
select id,
case when Name='name' then value else ' ' end as name,
when Name='Gender' then value else ' ' end as Gender,
when Name='salary' then value else ' ' end as salary
from empl
o/p id name gender salary
1 Adam
1 male
1 50,000
Dd d sales
oct 10 100
jan 1 100
mar 3 100
feb 2 100
select year,quartername,sales ,lag(sales) over (partition by year order
by quarter) as previousquartersales from sales_detail
replace(address,' ',''),replace(address,'char(9),'')
not in query doesn’t give this emp id whereas not exitss will give
CHAR is used to store strings of fixed length
VARCHAR2 is used to store strings of variable length
SELECT column FROM table
ORDER BY RAND ( )
LIMIT 1
SELECT ROUND(235.415, -1) AS RoundValue
SELECT ROUND(235.415, 1) AS RoundValue
SELECT ROUND(235.415, 1) AS RoundValue
SELECT ROUND(235.415, -2) AS RoundValue;
SELECT concat('000',substring(phone,6,15)) a,*
FROM Shippers;
select * from (Select rownumber() over( partition by id order by 1)) as rw from ship)
where rw=1
select id,name from emp group by id,name
select id,name from emp union select id,name from emp
select id,name from emp union select null,null from dual where 1=2
function which uses values from one or multiple rows to return a value for each row.
(This contrasts with an aggregate function, which returns a single value for multiple rows.)
yes we can join as long as column values involved in the join can be converted into one datatype
occurs if a transaction tries to aquire an incompatiable lock on another resource that
another transaction has already locked.the blocked transaction remain blocked untill the
blocking transactions releases the lock
occurs when two or more transaction have a resource locked and each transaction
request a lock on the resource that
another transaction has already locked.the Neither of the transactions here can move
forward,as each one is waiting for other to release the lock
In this case,SQL server intervens and ends the deadlock by cancelling one of the
transaction,so the other
transaction can move forward
select * from emp where charindex('M',name)=1
select * from emp where Left(name,1)='M'
select * from emp where substring(name,1,1)='M'
select name,cast(dob as date) from emp where day(dob)=9 and month(dob) =10
select name,cast(dob as date) from emp where year(dob)=2017
select dateadd(day,-1,cast(getdate() as date)=yesterday date
select name,cast(dob as date) from emp where cast(dob as date)=dateadd(day,-1,cast(getdate() as date)
select name,cast(dob as date) from emp where cast(dob as date)=dateadd(day,1,cast(getdate() as date)
select name,cast(dob as date) from emp where cast(dob as date) between
dateadd(day,-1,cast(getdate() as date)and cast(getdate() as date)
select name,cast(dob as date) from emp where cast(dob as date) between
dateadd(day,-7,cast(getdate() as date)and dateadd(day,-1,cast(getdate() as date)
Alter table Employees
add constraint FK_Dept_Employees_Cascade_Delete
foreign key (DeptId) references Departments(Id) on delete cascade
Begin Try
Begin Tran
Commit Tran
End Try
Begin Catch
Rollback Tran
End Catch
Create function UDF_ExtractNumbers
(
@input varchar(255)
)
Returns varchar(255)
As
Begin
Declare @alphabetIndex int = Patindex('%[^0-9]%', @input)
Begin
While @alphabetIndex > 0
Begin
Set @input = Stuff(@input, @alphabetIndex, 1, '' )
Set @alphabetIndex = Patindex('%[^0-9]%', @input )
End
End
Return @input
End
select * from ( select row_number() over (order by outletnumber) as rn,* from wrap.
dimoutlet) a where rn between 55 and 60
select * from emp where year(hiredate)%4=0
select * from (
select case when (year(hiredate)%4=0 and year(a.hiredate)%100<>0 or year(hiredate
%400)=0
then 'Leapyear'
else 'Not a leap year'
end status
from emp) a
where a.status='Leapyear'
1) 1,2,3,4,5
2) 1,2,3
3) 4,5
4) 4,5
5) (cartesian product)
a-1,1,1,1,1,1,2,2,2,2,2,2,3,4,5
b-1,1,1,1,1,1,2,2,2,2,2,2,3,null,null
6) a-1,1,1,1,1,1,2,2,2,2,2,2,3
b-1,1,1,1,1,1,2,2,2,2,2,2,3
7) a-1,1,1,1,1,1,2,2,2,2,2,2,3
b-1,1,1,1,1,1,2,2,2,2,2,2,3
select id,
Max(case when Name='name' then value else
' ' end )as name,
Max(when Name='Gender' then value else ' '
end) as Gender,
Max(when Name='salary' then value else ' '
end )as salary
from empl
group by id
o/p id name gender salary
1 Adam male 50000
Dd d sales
jan 1 100
feb 2 100
mar 3 100
oct 10 100
0/p:lag(sales)
year sales quarter previousquarter
2021 100 Q1 NULL
2021 200 Q2 100
2021 300 Q3 200
2020 500 Q1 NULL
2020 300 Q2 100
2020 400 Q3 300
0/p:lead(sales)
year sales quarter previousquarter
2021 300 Q3 NULL
2021 200 Q2 300
2021 100 Q1 200
2020 400 Q3 NULL
2020 300 Q2 400
2020 500 Q1 300
o/p
id name lastname firstname
nto one datatype
1,cast(getdate() as date)
1,cast(getdate() as date)
count same then we can use
intersect
select country ,ity1,city2,city3 from(select *,'city'+cast(row_number()
over (partition by country order
by country) s as varchar(10) as column sequence) temp
pivot
( max(city)
for columnsequence in ( city1,city2,city3)) dd
0/p:lag(sales,2)-gives 2 previous qurter
year sales quarter previousquarter
2021 100 Q1 NULL
2021 200 Q2 NULL
2021 300 Q3 100
2020 500 Q1 NULL
2020 300 Q2 NULL
2020 400 Q3 500
Database:collection of data in the form of tables ex:ODS
An Organized collection of related data which stores data in a tabular format
A database is any collection of data organized for storage, accessibility, and
retrieval.
Data Integration Layer − The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged
into hierarchical groups, often called dimensions, and into facts and aggregate
facts. The combination of facts and dimensions tables in a DW system is called
a schema.
Access Layer − The access layer is used by end-users to retrieve the data for
analytical reporting and information
Data Purging:When data needs to be deleted from the data warehouse, it can
be a very tedious task to delete data in bulk. The term data purging refers to
methods of permanently erasing and removing data from a data
warehousehen you delete data, you are removing it on a temporary basis, but
when you purge data, you are permanently removing the data and freeing up
memory or storage space. In general, the data that is deleted is usually junk
data such as null values or extra spaces in the row. Using this approach, users
can delete multiple files at once and maintain both efficiency and speed.
OLAP: the main objective is data analysis.use for anlaysis the business
data is denormalised
involves historical proessing of information
useful in analyzing the business.we need to know business process whether
its gain/loss.
It focuses on information out.(outcome of existing data)
Based on Star,snowflake schema and fact constellation-Dimensional
contains historical data
provide summarized and consolidated data ( which means we are taking high
level info for anlaysis purposes)
Examples of OLAP:DWH
Annual financial performance
Trends in marketing leads
Features of OLAP:
They manage historical data.
These systems do not make changes to the transactional data.
Their primary objective is data analysis and not data processing.
Data warehouses store data in multidimensional format.
Advantages of OLAP:
Businesses can use a single multipurpose platform for planning, budgeting,
forecasting, and analysing.
Information and calculations are very consistent in OLAP.
Adequate security measures are taken to protect confidential data.
Disadvantages of OLAP:
Traditional tools in this system need complicated modelling procedures.
Therefore, maintenance is dependent on IT professionals.
Collaboration between different departments might not always be possible.
Fact:Facts are the business events that you want to measure
Quantitive information about the business process.Also called as
measurements/metrics.It holds numeric data.it has the foreign key of
dimensions.Fact table seen as a table which captures the interaction between
these different dimensions ex: quantity,sales amount,profit,turn over etc
Dimensional modelling:
a process of arranging data into dimension and facts
dimensions and facts are building block of dimensional model
Data Model: The data models are used to represent the data and how it is
stored in the database and to set the relationship between data items
Data model tells how the logical structure of a database is modelled.
Pictorial representation of tables
represents the releationship between the tables
1)conceptual data model
2)Logical data model
3)physical data model
schemas:it is a logical description of the entire database
a database uses relational model while a datewarehouse uses star,snowflake
and fact constellation schemas(galaxy)
In a data warehouse or a data mart, there are three areas of where data
integrity needs to be enforced:
Database level
We can enforce data integrity at the database level. Common ways of
enforcing data integrity include:
Referential integrity
The relationship between the primary key of one table and the foreign key of
another table must always be maintained. For example, a primary key cannot
be deleted if there is still a foreign key that refers to this primary key.
Primary key / Unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a
table can be uniquely identified.
Not NULL vs. NULL-able
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column
can only have positive integers, a value of '-1' cannot be allowed.
ETL process
For each step of the ETL process, data integrity checks should be put in place
to ensure that source data is the same as the data in the destination. Most
common checks include record counts or record sums.
Access level
We need to ensure that data is not altered by any unauthorized means either
during the ETL process or in the data warehouse. To do this, there needs to be
safeguards against unauthorized access to data (including physical access to
the servers), as well as logging of all data access history. Data integrity can only
ensured if there is no unauthorized access to the data.
ETL Process
SQL Commands
DDL-defines the db schemas
DML-maniputes the data in db
DCL-deals with rights,permision and other control of db
DQL
TCL-transaction of the db
DBMS:DBMS is a software application that interacts with users,application and
database itself to captures data and analyse the data.Data stored in the
database can be retrived,modified and deleted and data can be in any form
strings,images,numbers.
Types of DBMS:Hierarical,object-oriented,network and relational
constraints: constarints are used to specify the limit on the datatype of the
table.It can be specified while altering or creating the table
PrimaryKey: Primary key is used as unique identifier for each record in the
table.we cannot store NULL values.It supports only one primary key in the
table
Composite key:A composite key is the key having two or more column that
together can uniquely identify a row in a table.In score table,Primary key is the
composition of two columns(subjectid+studentid)
Triggers:
A trigger is a special type of stored procedure that automatically runs when an
event occurs in the database serve
1NF:Scalable table design which can be extended.It has four rules:
1)Each column should contains atomic values(single value)
2)must have same datatypes
3)Column name should be unique.Same name leads to confusion during
retrieval
4)Orders in which data is saved doesn't matter
BCF/3.5NF:Pre-requities are :It should be in 3NF and we have more than one
candidate key here comes BCF role then it divide the table and have one
candidate key
Types of Loading:
1.Syntax analysis
2.Semantic analysis
- Workflow : A set of instructions that specifies the way to execute the tasks to
Informatica
data mart.
An enterprise data warehouse can be divided into subsets, also called
data marts, which are focused on a particular business unit or
department. Data marts allow selected groups of users to easily access
specific data without having to search through an entire data
warehouse
E.g., Marketing, Sales, HR or finance of an organization
Data mining is considered as a process of extracting data from large
data sets.it looks for hidden patterns within the data set and try to
predict future behavior. Data mining is primarily used to discover and
indicate relationships among the data sets.Data mining aims to enable
business organizations to view business behaviors, trends relationships
that allow the business to make data-driven decisions. It is also known
as knowledge Discover in Database (KDD)
DWH strategies :
1)Top down
2)Bottom Up
1) top down:enterprises warehouse has to be created initially and later
derive independant subjects from enterprises warehouses
source data
environment validation
4NF:Pre-requities are :It should be in 3NF and it should not have any
multivalued dependencies
DWH features/characteristics
subject oriented:
Used to track or anlaysis the data for particular area .In
business terms,it should be built based on the business’s
functional requirements, especially in regard to a specific area
under discussion. ex:how many no of transactions happening
per day?(specific kind of data),customers,sales
Integrated
All the data from diverse sources must undergo the ETL
process, which involves cleaning junk for redundancy
Time-Variant:
Historical data is kept in a data warehouse. For example,
one can retrieve data from 3 months, 6 months, 12 months, or
even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is
kept. For example, a transaction system may hold the most
recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
Non-volatile:
Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.
once we dump the data, data is in static.we wont
change/modify the data
importance of ETL testing?
Ensure data is transformed efficiently and quickly from one
system to another.
Data quality issues during ETL processes, such as duplicate
data or data loss, can also be identified and prevented by ETL
testing.
Assures that the ETL process itself is running smoothly and is
not hampered.
Ensures that all data implemented is in line with client
requirements and provides accurate output.
Ensures that bulk data is moved to the new destination
completely and securely.
SCD Type
Type 0 Ignore any changes and audit the changes.
Type 1 Overwrite the changes
Type 2 History will be added as a new row.
Type 3 History will be added as a new column.
Type 4 A new dimension will be added
Type 6 Combination of Type 2 and Type 3
Once the test-cases are ready and approved, the next step is
to perform pre-execution check.
Unit testing
Integration testing
System testing
Unit Testing
In unit testing, each component is separately tested.
Integration Testing
In integration testing, the various modules of the application
are brought together and then tested against the number of
inputs.
CREATE,ALTER,DROP,TRUNCATE,RENAME,COMMENT
DELETE,UPDATE,INSERT,MERGE,CALL,LOCKTABLE
GRANT,REVOVKE
SELECT
COMMIT,ROLLBACK,SET TRANSACTION ,SAVEPOINT
A factless fact table is a fact table that does not have any
measures, i.e. any numeric fields that can be aggregated.
Table:collection of data in the form of rows and
columns.Table refers to a collection of data in an
organised manner in the form of rows and colums.
Field refers to number of columns in a table
Example: Suppose after normalization we have two
tables first, Student table and second, Branch table.
The student has the attributes as Roll_no, Student-
name, Age, and Branch_id.branch table is related to
the Student table with Branch_id as the foreign key in
the Student table.If we want the name of students
along with the name of the branch name then we need
to perform a join operation. The problem here is that if
the table is large we need a lot of time to perform the
join operations. So, we can add the data of
Branch_name from Branch table to the Student table
and this will help in reducing the time that would have
been used in join operation and thus optimize the
database
Advantages of Denormalization
Query execution is fast since we have to join fewer
tables.
Analyze Business Requirements: To perform ETL
Testing effectively, it is crucial to understand and
capture the business requirements through the use of
data models, business flow diagrams, reports, etc.
Identifying and Validating Data Source: To proceed, it is
necessary to identify the source data and perform
preliminary checks such as schema checks, table
counts, and table validations. The purpose of this is to
make sure the ETL process matches the business model
specification.
Design Test Cases and Preparing Test Data: Step three
includes designing ETL mapping scenarios, developing
SQL scripts, and defining transformation rules. Lastly,
verifying the documents against business needs to
make sure they cater to those needs. As soon as all the
test cases have been checked and approved, the pre-
execution check is performed. All three steps of our
ETL processes - namely extracting, transforming, and
loading - are covered by test cases.
Test Execution with Bug Reporting and Closure: This
process continues until the exit criteria (business
requirements) have been met. In the previous step, if
any defects were found, they were sent to the
developer for fixing, after which retesting was
performed. Moreover, regression testing is performed
in order to prevent the introduction of new bugs
during the fix of an earlier bug.
Summary Report and Result Analysis: At this step, a
test report is prepared, which lists the test cases and
their status (passed or failed). As a result of this report,
stakeholders or decision-makers will be able to
properly maintain the delivery threshold by
understanding the bug and the result of the testing
process.
Test Closure: Once everything is completed, the
reports are closed.
3. Name some tools that are used in ETL.
Primary key is a candidate key
Surrogate key is a primary key
we can either delete all the rows in one go or can delete rows one by one
Here we can use the “ROLLBACK” command to restore the tuple because it
Delete(DML) does not auto-commit.
Delete from table
Delete from table where condition
we can drop (delete) the whole structure in one go i.e. it removes the
named elements of the schema.Here we can’t restore the table by using the
Drop(DDL) “ROLLBACK” command because it auto commits.
Drop table;
It is used to delete all the rows of a table in one go. we can’t delete the
single row as here WHERE clause is not used. By using this command the
existence of all the rows of the table is lost.Here we can’t restore the tuples
Truncate (DDL) of the table by using the “ROLLBACK” command.
TRUNCATE table;
union It can be used to combine the result set of two different SELECT statement.
It removes duplicate rows between the various select statements.
Data type should be same as the result set of each select statement
Inner Join It can be used to retrieve only matched records between both tables
This function searches for a substring in a string, and returns the position.
CHARINDEX() Saranya- select charindex('n','saranya') =5
In patindex we can use wildcards but in charindex we cannot.function
returns the position of a pattern in a string. If the pattern is not found, this
PATINDEX function returns 0
select LTRIM(RTRIM('_name.','.'),'_')
give all the data in uppercase
UCASE()/upper SELECT UCASE(CustomerName)
Count(*),count(1), count(*) and count(1) both are same count(*) is speed ,count(columnname)
count(columnname) doesn't count the null values
display cuurentdate select getdate();
ROW_NUMBER Returns an increasing unique number for each row even it has duplicates
Returns an increasing unique number for each row but when there are
duplicates,same rank applied to all the duplicate rows but next row after
duplicate row will have the rank it would have been assigned if there had
RANK been no duplicates.So rank function skips the ranking,if there are duplicates
Returns an increasing unique number for each row but when there are
duplicates,same rank applied to all the duplicate rows but Dense_rank
function will not skips the ranking,if there are duplicates.Rank will be in
DENSE_RANK sequence
Where” clause is used to filter the records from a table that is based on a
specified condition, then the “Having” clause is used to filter the record from
WHERE &HAVING the groups based on the specified condition.
GROUP BY clause summarizes the rows into groups and the HAVING clause
GROUPBY &HAVING applies one or more conditions to these groups
Query statement placed inside a nother select statement also called as inner
query.
SUBQUERY Subquery executed first and pass the value to outer query
If the subquery depends on the outer query for its values,then its called
correlated subquery.Correlated subquery gets executed,once for every row
CORRELATED SUBQUthat is selected by outer query
select name,
(select SUM(sales) from sales where productid=tblproduct.id) as total,*
from tblproducts
select datepart(year,'2022-09-28')=2022
select datepart(quarter,'2022-09-28')=4
select datepart(month,getdate())= october
select datepart(dayofyear,'2022-09-28')=271
select datepart(weekday,getdate())= 4
select datepart(week,'2022-09-28')=40
DATEPART
1,2,3,4 1,2,3 are duplicates
1,1,1,4
1,1,1,2
A View is technically a virtual logical copy of the table