Multi-Dimensional Information-Driven Many-Objective Software Remodularization Approach
Multi-Dimensional Information-Driven Many-Objective Software Remodularization Approach
Multi-Dimensional Information-Driven Many-Objective Software Remodularization Approach
RESEARCH ARTICLE
Abstract Most of the search-based software remodularization search-based automated software clustering have been
(SBSR) approaches designed to address the software proposed (e.g., [2−7]). It has been found that for a large and
remodularization problem (SRP) areutilizing only structural complex software system, the search-based automated
information-based coupling and cohesion quality criteria. software clustering approaches are more effective for
However, in practice apart from these quality criteria, there modularizing the source code compared to the deterministic
require other aspects of coupling and cohesion quality criteria automated software clustering approaches [1−7].The concept
such as lexical and changed-history in designing the modules of of transforming the software remodularization problem (SRP)
the software systems. Therefore, consideration of limited as a search-based single/multi/many-objective optimization
aspects of software information in the SBSR may generate a problem makes it more promising and opens ample
sub-optimal modularization solution. Additionally, such opportunity for the application of metaheuristic optimization
modularization can be good from the quality metrics algorithms [1]. Such a technique of addressing the SRPs is
perspective but may not be acceptable to the developers. To generally regarded as search-based software remodularization
produce a remodularization solution acceptable from both (SBSR). The SBSR performance depends on the suitability of
quality metrics and developers’ perspectives, this paper the metaheuristic optimization algorithm and the effective
exploited more dimensions of software information to define design of the objective functions.
the quality criteria as modularization objectives. Further, these Most of the existing metaheuristic optimization algorithms
objectives are simultaneously optimized using a tailored many- designed to address the SRP have exploited the existing
objective artificial bee colony (MaABC) to produce a framework of the canonical metaheuristic optimization
remodularization solution. To assess the effectiveness of the algorithms (e.g., [7−11]). Overall, many SBSR approaches
proposed approach, we applied it over five software projects. have successfully exploited the framework of the existing
The obtained remodularization solutions are evaluated with the canonical metaheuristic optimization algorithm and provided a
software quality metrics and developers view of customized version of the SBSR algorithm. Similar to the
remodularization. Results demonstrate that the proposed considerable progress in the development of the customized or
software remodularization is an effective approach for tailored version of the SBSR algorithm, tremendous growth in
generating good quality modularization solutions. designing various objective functions reflecting the different
aspects of software quality has also been observed [2−7]. The
Keywords software restructuring, remodularization, multi- modularization quality (MQ) [2] is a widely used software
objective optimization, software coupling and cohesion quality metric for objective/fitness function in different single-
objective SBSR approaches [5,8]. The other multi-objective
1 Introduction SBSR approaches [6−7] use the refined version of
Frequentchanges made in the source code during maintenance modularization quality (MQ) [6]. The definition of MQ
often degrade the design of the software system [1]. To metrics uses only structural information;therefore, it only
improve the design of a degraded software system, represents the structural aspect of software quality. To
remodularization of source code is generally carried out [2]. In improve the meaningfulness of the generated remodularization
the previous two decades, many remodularization approaches solutions, the approaches [5,9,12−13] use the lexical aspects
based on deterministic automated software clustering and of software quality along with the structural aspects of
software quality. Recently, some studies [1,9,14] exploited the
Received August 16, 2021; accepted May 24, 2022
conceptual aspects of software quality along with the
E-mail: [email protected] structural and lexical aspects of software quality.
2 Front. Comput. Sci., 2023, 17(3): 173209
Even the existing SBSR approaches perform well in many-objective SRPs. Especially, five object-oriented
addressing a particular aspect of software remodularization; software projects having varying chrematistics are
still, many challenges remained untouched. For example, most considered as the test problems
of the software remodularization approaches either consider
limited dimensions of artefacts or give equal importance to The remaining part of the paper is organized as follows.
each dimension of information in defining the objective Section 2 presents related works based on the structural,
functions. However, it is commonly observed that the different lexical, and changed history remodularization. Section 3
dimensions of structural, lexical, and changed history provides a detailed description of proposed software
information generally contribute differently in the software remodularization methodology. Section 4 explains the
remodularization process instead they contribute equally. experimentation configuration designed for the experimenta-
tion. Section 5 presents a discussion of the results. Section 6
Apart from defining the objective functions, some issues
covers the various types of threats that can affect the
remain untouched in designing the search-based metaheuristic
validation of the results and their mitigation strategies. Section
optimization algorithm. Most of the SBSR approaches use the
7 concludes the work with the suggestion of future directions.
traditional multi-objective optimization algorithms in solving
the remodularization problems having a large number (more 2 Related works
than three) of objective functions. The traditional multi-
In the literature on search-based software engineering, a
objective optimization algorithms work well with optimization
variety of SBSR approaches addressing the different aspects
problems having a small number (less than three) of objective
of software design improvements have been proposed (e.g.,
functions and do not work well with a large number of [1−11]). These SBSR approaches exploit the framework of the
objective functions. existing metaheuristic optimization algorithms and tailor them
To address the aforementioned issues concerning the according to the suitability of the SRPs. Based on the number
definition of the objective functions and designing the of objectives involved in the SRP, the SBSR approaches can
metaheuristic optimization algorithm for the SRP, we be divided into single-objective SBSR approaches, multi-
introduce an improved definition of objective functions and objective SBSR approaches, or many-objective SBSR
many-objective metaheuristic optimization algorithm. In approaches. Further, based on the type of information used in
particular, in the existing definition of software package designing the objective functions these SBSR approaches can
coupling and cohesion, we incorporate the different be further divided into 1) structural information-based SBSR
dimensions of structural, lexical, and changed-history approaches (S-SBSR approaches), 2) structural + lexical
information and introduced different categories of coupling information-based SBSR approaches (SL-SBSR approaches),
and cohesion metrics. In each type (i.e., structural, lexical, and 3) structural + lexical + changed history information-based
changed-history) of source code information used for defining SBSR approaches (SLC-SBSR approaches).
the coupling and cohesion metrics, we also consider different
dimensions of structural information, different dimensions of 2.1 S-SBSR approaches
lexical information, and different dimensions of changed- The designing of the objective functions in many of the SBSR
history information with their relative importance. Apart from approaches is based on the source code’s structural informa-
different types of object functions defined in terms of tion. Over the last two decades, many SBSR approaches based
structural, lexical, and changed-history information, we also on structural information have been proposed to address the
consider the other supportive objectives, i.e., the number of different aspects of SRPs. The availability of various tools for
clusters, and the difference between the minimum and the extracting the different dimensions of structural information
maximum number of modules in a cluster. The major from the source code made the structural information based
contributions of this paper are summarized as follows: SBSR approaches more popular in the research community.
In the structural information-based SBSR, the source code
● A variety of class coupling methods based on the implementation’s structural information is used to define the
different dimensions and combinations of structural, different modularization criteria as objective functions. The
lexical, and changed-history information is introduced different programming paradigms use different construction to
and further, these class couplings are used to define the realize the different implementation requirements. Therefore,
different types of software coupling and cohesion for the structural information can vary according to the underlying
the remodularization of software systems. programming language used for system implementation. For
● In the computation of class coupling, the different example, in Java-based object-oriented software, the classes
dimensions of structural, lexical, and changed-history can be considered as modules and method calls and
information are given different weightage according to inheritance relationships can be used to compute the class
their importance in the coupling. coupling. On the other hand, in a procedural programming
● To optimize the different objectives to produce the language such as C programming, the source file can be
software remodularization solutions, an existing many- considered as modules and function calls can be used to
objective artificial bee colony (MaABC) has also been compute the module coupling. The structural information-
tailored by incorporating various strategies. based SBSR approaches based on the number of quality
● The proposed approach’s supremacy is validated by criteria used as the objective function can be divided into
applying it over the different instances of 1) single-objective structural information-based SBSR,
Amarjeet PRAJAPATI et al. Multi-dimensional information-driven many-objective software remodularization approach 3
2) multi/many-objective structural information-based SBSR. Praditwong et al. (2011) [6] is regarded as the base work. This
In the literature on SBSR, many approaches have used work defines two sets of remodularization criteria as objective
structural information to define the modularity criteria as functions. It uses the evolutionary-based multi-objective
objective/fitness functions for the formulation of SRP as metaheuristic optimizer to optimize each set of objectives
single or multi-objective SRP. In the context of single- simultaneously to improve the software modular structure.
objective SRP, the structural information is used to define a To evaluate the effectiveness of composite objectives [6]
single modularization quality criterion as a single objective with inclusion and exclusion of some supportive objectives in
function. Mancoridis et al. (1998) [2] in the direction of the multi-objective formulation of software module clustering
single-objective structural information-based SBSR is the first problem, Barros (2012) [16] presented an empirical analysis.
approach, where the structural source code information is used Their study concluded that the exclusion of some objective
to define the single-objective function and optimized using the (i.e., supportive objective) from the composite objective can
genetic-based metaheuristic optimization algorithm. In also help in generating effective and efficient results. To
particular, the approach uses the structural information and improve the metaheuristic optimization algorithm’s
defined modularization quality (MQ) metrics to assess the effectiveness and efficiency for the multi-objective software
software modularity quality and further it is used as an module clustering problem, Kumari and Srinivas (2016) [17]
objective function in the genetic algorithm-based software presented a hyper-heuristic based evolutionary algorithm. The
remodularization framework. Moreover, this approach’s input approach generated a better module clustering solution
is in the form of a module dependency graph (MDG) and the compared to the existing GA-based approaches under the
MQ is defined in terms of inter-connectivity and intra- same set of composite objectives [6]. In the direction of
connectivity of the graph partition. A similar definition of MQ improving the metaheuristic optimization algorithm for the
has also been used in [8,15] to quantify the quality of software multi-objective SRP, Prajapati and Geem (2020) [18]
partitioning for software remodularization. Doval et al. (1999) proposed a Harmony Search (HS) based software remodulari-
[15] treated the software partitioning problem as an zation approach for software architecture reconstruction
optimization problem and used the Genetic Algorithm (GA) to activity.
find a good modularization solution from the search space
based on the MQ as fitness. Mahdavi et al. (2003) [8] 2.2 SL-SBSR approaches
transformed the software system into the weighted and un- The different dimensions of the structural relationship existing
weighted MDG and used the multiple hill-climbing optimizers among the software entities are the most widely used source
to find a good partitioning solution based on the MQ as the code information in software remodularization. Many software
fitness function. remodularization approaches have used structural relationships
Abdeen et al. (2011) [7] also treated the SRP as a in defining the software modularity criteria as fitness/objective
restructuring of the existing package structure of an object- functions evaluating the modularization solution. However,
oriented software system. Their approach optimizes software other source code information such as linguistic information
modularization by minimizing the direct package cyclic embedded in the form of comments and identifiers has not
dependencies. For this, a fitness function based on package gained more attention in the search-based software
coupling and cohesion is used to guide the Simulated engineering community. Some researchers have demonstrated
Annealing (SA) driven optimization process of the software the usefulness of lexical information of the source in
remodularization. To compute the package coupling and restructuring the software systems for different purposes.
package cyclic dependencies different types of class In some work, the researchers have combined the lexical
relationships such as method calls and class inheritance have information with the structural information to guide the
been used. Prajapati and Chhabra (2017) [9] uses the different remodularization process. In some work, researchers have
dimensions of structural information to quantify the used the lexical information separately. In the combined
modularization quality of object-oriented software systems. structural and lexical information based SBSR approaches, the
Further, they used the modularization quality as a fitness single or various dimensions of structural and lexical
function to optimize the system’s modular. To explore and information are together used to define the objective functions.
exploit the search space to find a good modularization Further, they are optimized using single or multi-objective
solution, their optimization approach uses a Harmony Search metaheuristic optimization algorithms. The structural and
(HS) based metaheuristic optimization algorithm. lexical information based SBSR approaches has been found
Apart from the single-objective structural information-based more effective in generating remodularization solutions as
SBSR approaches, there also has been tremendous growth in intended by the software developers compared to the separate
the direction of multi-objective structural information-based structural information-based SBSR.
SBSR approaches In the multi-objective formulation of SRP, To evaluate the effectiveness of combined structural and
the different types of structural information are used to define lexical similarity in software remodularization, Anquetil and
the different aspects of software quality metrics and they are Lethbridge (1999) [5], uses the various features corresponding
optimized simultaneously using the multi-objective to the different types of structural and lexical information.
metaheuristic optimization algorithm for the software Especially, they used the formal and non-formal features of
remodularization. In the direction of multi-objective structural the source code to quantify the coupling of entities and uses
information-based SBSR, the approach reported by the Bunch framework to restructure the software systems. To
4 Front. Comput. Sci., 2023, 17(3): 173209
generate the meaningful decomposition of the software exploited the structural, lexical, and changed history
packages, the approach presented in [13] uses the semantic information to compute the class coupling strength for the
and structural relationships existing between the packages’ computation of object-oriented software package coupling and
classes. They combined the structural coupling measures, i.e., cohesion.
information-flow-based coupling (ICP), and semantic
coupling measure, i.e., the conceptual coupling between 3 Proposed approach
classes (CCBC) with their relative weights into a single In the proposed SBSR, the major contributions are divided
coupling measure. Even if the approach does not use the into two parts. The first part focuses on designing the
search-based software engineering concepts, the coupling remodularization objective model and the second part
measures can be easily used as fitness functions. concentrates on the developmentof the customized many-
To improve the accuracy of meaningfulness of objective optimization algorithm.
remodularization solution of the multi-objective software
remodularization approach, Prajapati and Chhabra (2017) [9] 3.1 SRP formulation
suggested the use of the different aspects of structural and Software remodularization is the problem of optimizing the
lexical class dependency information to compute the package software entities’ distributions into existing modules to
coupling and cohesion for the object-oriented software improve the quality of software systems. The definition of
systems. They used the NSGA-II based multi-objective software entities and modules can vary according to the
evolutionary algorithm as the optimization technique for programming paradigms used for the implementation and
software remodularization. In case of unavailability of tool to abstractions. In this work, we are mainly focusing on the
extract the structural information, Kargar et al. (2017) [19] object-oriented software systems where classes are assumed as
suggested the use of lexical information embedded in the software entities and modules as packages. A well-distribution
source code of the software system. The authors proposed the of source code classes into packages have to satisfy many
semantic dependency graph as an alternative for the call quality criteria (often conflicting). The number and definition
dependency graph (CDG) based on the lexical information. To of quality criteria used in remodularization process vary
partition the SDG to generate a clustering solution, they used according to the purpose and quality requirements. The quality
the hill-climbing algorithm as a search optimizer. Their results requirements for the large and complex systems are getting
demonstrate that SDG can be a good alternative if the software very large. Moreover, well-modularized software systems
systems are developed in different programming languages have to satisfy various dimensions of software quality. To
and the extraction of CDG is not feasible. make the remodularization solution more useful, we consider
the following software quality for the remodularization based
2.3 SLC-SBSR approaches on structural, lexical, and changed history information.
It is commonly observed that the software developers Structural software package coupling and cohesion The
generally modularize the different software elements based on structural package coupling and structural software package
the various types of information not only the single aspects of cohesion of an object-oriented software project measure the
design information. Keeping these facts in mind, various degree of inter-relatedness of packages and the degree of intra-
software remodularization approaches have been proposed by relatedness of packages corresponding to the structural
utilizing different types of artefacts. Even the structural and relationships, respectively. To compute the structural package
lexical-based software remodularization approaches perfor- coupling and cohesion, using the structural coupling of the
ming well to produce a good quality software remodulariza- classes is a common and effective approach. To capture the
tion solution, incorporating some other artefacts such as structural coupling between the classes, many coupling
changed-history, can make the modularization solution more metrics have been proposed in object-oriented software
effective. engineering literature. In this direction, the study conducted by
In the literature on SBSR, only a few papers have Li and Henry [21] has formulated various types of structural
considered the combined structural, lexical, and changed information-based class coupling metrics (e.g., Data
history information in the formulation of SBSR problem. Mkaouer abstraction coupling (DAC) and message passing coupling
et al. (2015) [1] presented the first study, where the structural, (MPC)) to compute the class coupling.
lexical, and changed history information is exploited to define The response for a class (RFC) and coupling between the
the different types of modularity metrics. They used the object (CBO) introduced by Chidamber and Kemerer [22] are
concept of many-objective optimization to optimization all the other two most widely used class coupling metrics to
conflicting modularization quality criteria. Later some other compute the structural class coupling. Martin [23] also
researchers and academicians have also provided some introduced Afferent Coupling (Ca) and Efferent Coupling (Ce)
contributions in this direction. metrics based on the structural class dependency information
Rathee and Chhabra (2018) [20] utilized the structural, to compute the class coupling. Briand et al. [24] designed
lexical, and changed history information with their relative several structural information-based coupling metrics to
importance in defining the modularization criteria. Additio- compute the coupling between the classes. Further, Briand
nally, they explored the multiple dimensions of structural and et al. [25] build a unified framework based on the coupling
lexical information that can improve modularization criteria’ metrics developed in the literature [26−27] to computation
effectiveness. Recently, Prajapati et al. (2020) [14] also class coupling of object-oriented software systems. To exploit
Amarjeet PRAJAPATI et al. Multi-dimensional information-driven many-objective software remodularization approach 5
the importance of polymorphism in coupling computation, Lee and vector representation is used for efficiently processing
et al. [28] introduced an information flow-based coupling recognized domain vocabulary. The next step is to compute
(ICP) metric. concept relatedness using vector representation for identified
The aforementioned coupling metrics can be useful to domain vocabulary. This paper considers determining
quantify and evaluate the quality of object-oriented software relatedness using a freely available software package called
systems. But they cannot be effective if used as a fitness/ WordNet. WordNet is a widely acknowledged approach to
objective function to guide the remodularization process in determining the semantic similarity between two words. It is
search-based software modularization. This software coupling an extensive lexical database (containing semantic/ conceptual
metrics uses only limited types of structural class information relations among different words) of English that organizes
to compute the coupling metrics. However, the different verbs, adverbs, nouns, and adjectives into different
remodularization process requires class coupling metrics as hierarchical structures called synsets based on underlying
fitness/objective function that consists of various dimensions semantic relations such as synonymy, autonomy, and
of class coupling information. Because the software hyponymy. The degree of semantic/ lexical relatedness among
developers generally use various dimensions of the class
two words/ concepts says c1 and c2 are measured using the
coupling information according to their relative importance in
concept of path length between concepts (synsets) represented
modularizing the software modules in the software
by c1 and c2, respectively. The path length based lexical
components. Therefore, in this study, we also exploit various
relatedness in this paper is measured using the metric as
structural information dimensions to compute the class
proposed by Wu et al. [30]. The following expression gives
coupling for the use of the software package coupling and
cohesion computation. The definitions of different types of the mathematical formulation for measuring this conceptual/
structural class dependencies are derived from these studies lexical similarity between two concepts c1 and c2:
[9,29]. These are the 1) extends relationship (EX), 2 × dep (c)
2) implementsrelationship (IM), 3) is-of-type relationship (IT), S im (c1 , c2 ) = , (4)
len (c1 , c) + len (c2 , c) + 2 × dep (c)
4) reference (RE), 5) method callsrelationship (CA), 6) has
Here, c in the considered mathematical formulation
parameterrelationship (HP), 7) returnsrelationship (RT), and
represents the lowest common subsumer (LCS) between any
8) throws relationship (TH).
two different considered concepts c1 and c2. LCS is commonly
The structural software package coupling and cohesion are
the two most important remodularization quality criteria which defined as the most common ancestor/ parent of both c1 and
are highly required and widely used in SBSR as objective c2. Moreover, len(c1 , c) and len(c2 , c) represents the total
functions. The definition of these two quality criteria can vary number of edges that exist in the path described by two nodes.
according to the considered structural information. The The source code of a software system is a rich source of
package coupling and cohesion are defined as follows. acquiring domain-specific vocabulary [20]. The authors of this
chapter think that this domain-specific vocabulary can be
∑
n ∑ ( ) easily built by tokenizing six main parts of the underlying
S Pcou (M) = ∅ ci, c j , (1)
i=1 j<T i
source code of a class. These essential parts include comments
sections introduced by developers for elaborating the
∑
n ∑ ( ) particular section, identifiers used for naming classes, member
S Pcoh (M) = ∅ ci, c j , (2) variables declaration section, signatures used to describe
i=1 j∈T i different methods, identifiers used as parameter names, and
where M is the particular modularization containing n number body section of methods. Further, the authors in this paper
of classes. The T i is the set of
( classes
) that are in the same believe that the semantic information represented by these
package as of class ci. . The ∅ ci, c j computes the connection considered six parts can is of different relevance. Therefore,
( )
strength between class ci and c j . The value of ∅ ci, c j is tokens extracted from each part are considered different and a
computed as follows. unique weight is assigned to each part while combining
( ) ∑ ( ) overall lexical strength. The weights assigned to different parts
∅ ci, c j = wr × nr ci, c j . (3) are estimated by formulating a probabilistic lexical model and
r∈R
R is the set of different types of structural relationships i.e., utilizing the Expectation-Maximization (EM) algorithm as
R={EX, IM, IT, RE, CA, HP, RT, TH}, wr is the relative used in [31].
weight of a particular relationship r ∈ R, and nr is the Based on the concept of building an efficient domain
frequency of r-type relationship. vocabulary model (by extracting tokens from six different
To compute the relative weights of the relationships is an parts and assigning unique weights to them) and determining
essential and challenging task because it has a significant lexical similarity between any two concepts using the
impact on class coupling value and quantification of proposed Eq. (4). The next step is to measure the overall
remodularization quality. In this study, we use the relative values for considered two quality parameters viz cohesion, and
weight computation approach, as presented in [9]. coupling at the package level, say P. The mathematical
Lexical software package coupling and cohesion Once the formulation used for measuring these parameters is as
underlying domain vocabulary at package level is identified following:
6 Front. Comput. Sci., 2023, 17(3): 173209
∑
N ∑
T Additionally, Amarjeet et al. [14] also used structural and
S em(ci , c j ) change coupling information to re-modularise the software
i=1 j=1; j,i system. While traditional dependency measures are widely
LexS trengthk (N, T ) = , (5)
N ∗ (N − 1) utilized to measure the quality of the software system. But in
6 the present development scenario, it is necessary to analyze
∑ the change repositories to predict the various types of
S emCoh (P) = Average wk ∗ LexS trengthk (N, N) , (6)
dependencies to compute coupling and cohesion and structural
k=1
and semantic measures. Through this, a developer can have a
6
∑ view of the historical co-change pattern of software entities.
S emCup (P) = Average wk ∗ LexS trengthk (N, T ) . (7)
The literature has observed that none of the studies has fully
k=1 utilized the structural, semantic, and change history for
Here, N is the total number of lexemes present in the DV of dependency measurements to apply search-based techniques
package P extracted from the kth part and T is the total further. Through this, the maintainer can implement future
number of lexemes present in the DV of the rest of the changes and also restructure the system effectively.
subsystem (the obtained modular structure except the package Considering the importance of past co-change behaviour of
P being considered) extracted from the same kth part. software entities, change-coupling metrics have been
Changed-history software package coupling and computed based on software change history. We have
cohesion During the life cycle of the software, system changes explored the classes that are changed together in the past and
are made to the software components. The developer made may have some proximity. If several change-commits indicate
changes and record them as change commits in the version co-change pattern of a few classes, then such co-change
history or repository [3,4]. So, the frequent co-change pattern pattern should be considered as possible change-coupling
among the components like classes can be considered as among these classes. Here, we intend to measure change-
change-coupling among them. In literature, most of the coupling pattern at attribute, method and class levels. For this
coupling and cohesion measurement is computed statically by purpose, metrics namely Attribute-level change coupling,
analysing the source code [12,22,32]. The majority of the Method-level change coupling, Class-level change coupling
source code metrics include LOC, coupling, cohesion, are computed. These coupling measures are further used to
function points, inheritance [12,32−34]. Static analysis of the calculate the overall coupling and cohesion among the
source code sometimes shows stagnation as it does not packages as mentioned below:
entirely reflect software evolution. In the present development
ChC (ci ) = α1 ∗ AChC (ci ) + α2 ∗ MChC (ci ) + α3 ∗ CChC (ci ) ,
scenario, software evolution is recorded using version or (8)
configuration management systems as software repositories
like Github and Sourceforge [3,4]. Such software repositories ChCoH (ci ) = β1 ∗ AChC (ci ) + β2 ∗ MChC (ci ) + β3 ∗CChC (ci ) ,
consist of valuable information about the evolution of the (9)
software applications in terms of their change history. Change where ChC (ci )- Change-coupling of class ci , AChC (ci )-
history lists the changes that are made in the past and collects Attribute level change-coupling of ci , MChC (ci ) - Method
the software changelogs. level change coupling of class ci , CChC (ci )- Class level
Apart from the structural and semantic analysis, change change-coupling of class ci , and α1, α2 , and α3 are the
commits should be analyzed to compute the dependency coupling weights assigned to each type of the coupling. β1, β 2,
metrics that can contribute to the re-modularization, and β 3 are the cohesion weights assigned respectively. The
restructuring, and clustering of the legacy or existing software demonstration of change coupling is given Fig. 1.
system modules. Various studies have investigated The two classes ci and c j may have attribute level change-
evolutionary data for computing cohesion and coupling coupling if some attributes of c j and ci are frequently changed
[32−34], co-change pattern [35], change impact analysis etc. together. Such information types need to be extracted from the
[3,4,36−38] exploited the change history. It computed changelogs or commits recorded in the version or change
different metrics to further restructure the software systems by history of the software system. The definition of AChC (ci ) is
applying various classification and clustering techniques. mentioned below.
∑
n
ci realizes interface of class c j . Such classes may frequently
AChC (ci ) = ACh prox (ci , c j ), (10) co-change together in the past, and this type of coupling
j=1, j,i
information could be extracted from the change logs or
where, AChC (ci )- Attribute level change coupling of class ci , commits recorded in the version or change history of the
ACh prox ci , c j )- Attribute level change coupling (proximity) software system. The definition of CChC (ci ) is mentioned
between the classes ci and c j . And n is the total number of below.
classes. The metrics to compute the Attribute level change
∑
n
coupling (proximity) i.e.,ACh prox ci , c j ) between the class’s ci CChC (ci ) = CCh prox (ci , c j ), (14)
and c j is described as under: j=1, j,i
( ∑i NAc
) NAc ∑j |ci ck ∩ c j cl | where CChC
( (c
) i )- Class level change coupling of class ci ,
ACh prox ci , c j = , (11) CCh prox ci , c j - Class level change coupling (proximity)
k=1 l=1
|ci ak ∪ c j al |
between the classes ci and c j , n-total number of classes. The
where NAci and NAc j are the number of attributes of the metrics to compute (the )class level change coupling
class’s ci and c j , respectively. ci ak is the list of attributes of (proximity) i.e.,CCh prox ci , c j between the class’s ci and c j is
class ci (k=1 to NAci ), c j al is the list of attributes of class c j described as under:
(k=1 to NAci ). |ci ak ∪ c j al | gives the number of changes
commits in which the attributes of classes ci and c j that are co- ( ) ∑ n
ci ∩ c j
changed together. |ci ak ∪ c j al | gives the number of changes CCh prox ci , c j = . (15)
c ∪cj
j=1, j,i i
commits in which the attributes of classes ci or c j or both are
changed. Here, ci ∩ c j gives the number of changes commits in which
The two classes ci and c j may have method level change the classes ci and c j that are co-changed together. ci ∩ c j gives
coupling is few methods of ci and c j are frequently changed the number of changes commits in which the classes ci or c j or
together in the past. Such types of information need to be both are changed.
extracted from the change logs or commits recorded in the The overall process of computing the different levels of
version or change history of the software system. The change-coupling involvethe following three significant steps:
definition of MChC (ci ) is mentioned below. Step-1: Mining software repository to extract all the
changelogs/commits of the subjected software application.
∑
n
MChC (ci ) = MCh prox (ci , c j ). (12) Step-2: Filtration and pre-processing of the available change-
j=1, j,i
commits. Step-3: The relevant change-commits are explored
to extract the co-change pattern to compute AChC (ci ),
Here, MChC (ci ) - Method level change coupling of class ci , MChC (ci ) , CChC (ci ) and ChC (ci ) change coupling metrics
MCh prox ci , c j )- Method level change coupling (proximity) as described above. It is demonstrated in Fig. 2.
between the classes ci and c j . n-Total number of classes. The Software remodularization as many-objective
metrics to compute the method level change coupling optimization problem Inmany-objective optimization model
(proximity) i.e., MCh prox ci , c j ) between the class’s ci and c j is of any problem, a large number (i.e., more than three) of
described as under: objective functions (often conflicting) along withsome
( ) N∑Mci N∑
Mc j equality and inequality constraints has to be optimized to
|ci mk ∩ c j ml |
MCh prox ci , c j = , (13) generate the solutions.The many-objective optimization
k=1 l=1
|ci ak ∪ c j al | definition of the software remodularization problem can be
where N Mci and N Mc j are the number of methods of the given as follows:
classs’sci and c j , respectively. ci mk is the list of methods of min F (d) = [ f1 (d) , f2 (d) , . . . , f M (d) ]T , M > 3,
class ci (k=1 to N Mci ), c j al is the list of methods of class c j U pper (16)
diLower ⩽ di ⩽ di , i = 1, ..., n.
(k=1 to N Mci ). |ci mk ∩ c j ml | gives the number of changes
commits in which the methods of classes ci and c j that are co- In the context of software remodularization problem,
changed together. |ci ak ∪ c j al | gives the number of changes f1 (d) , f2 (d) , . . . , f M (d) are the quality criteria defined in terms
U pper
commits in which the methods of classes ci or c j or both are of decision variable d . The diLower and di are the ith
changed. decision variable’s lower and upper bound and n is the number
The two classes ci and c j may have class level change of decision variables.In software remodularization, we have
coupling, a) if class ci inherits from another class c j , b) if class defined the different set of objective functions based on the
structural, lexical, and changed-history dependency decision variable’s set. The particular instance of each
information. The definition of each group of many-objective decision variable represents a specific solution of candidate.
SRP (i.e., variants of the many-objective optimization To generate all possible candidate solutions for a specific
problems) is given as follows: problem of optimisation, an effective encoding mechanism for
the representation of candidate solution is required. For the
● MaSRV-1: In this variant the objective functions are: software remodularization optimization problems, the integer
changed-history coupling (to maximize), changed- vector-based encoding mechanism (Bavota et al. 2010,
history cohesion (to maximize), changed-history MQ Praditwong et al. 2011; Prajapati and Chhabra 2017) is a
(to maximize), number of clusters (to maximize), and widely accepted representation technique. Therefore, we have
the difference between minimum and maximum number
also used this encoding mechanism in this work.
of modules in a cluster (to minimize).
● MaSRV-2: In this variant the objective functions are: 3.2 MaABC
lexical coupling (to maximize), lexical cohesion (to A variety of metaheuristic optimization approaches have been
maximize), lexical MQ (to maximize), number of proposed to generate a set of a good representative sample of
clusters (to maximize), and difference between approximation of Pareto optimal solutions for many-objective
minimum and maximum number of modules in a cluster optimization problems. These approaches are widely
(to minimize). categorized as 1) Dimensionality reduction approach [39],
● MaSRV-3: In this variant the objective functions are: 2) Relaxed dominance approach [40], 3) Diversity injection-
structural coupling (to maximize), structural cohesion based approach [41], 4) Aggregation-based approach [42],
(to maximize), structural MQ (to maximize), number of 5) Indicator-based approach [43], 6) Reference set-based
clusters (to maximize), and difference between the approach [44], 7) Preference-based approach [45,46],
minimum and maximum number of modules in a cluster 8) Decomposition based approach [47].
(to minimize). Even after huge development in designing various
● MaSRV-4: In this variant, the objective functions are: categories of many-objective algorithms, their applications to
changed-history coupling + structural coupling (to real-world optimization problems gained little attention. The
maximize), changed-history coupling + structural divergent characteristics of real-world optimization problems
cohesion (to maximize), changed-history coupling + make the application of many objective algorithms difficult
structural MQ (to maximize), number of clusters (to and challenging. In the past few years, some researchers have
maximize), and difference between minimum and tried to tailor the metaheuristic algorithms for the different
maximum number of modules in a cluster (to real-world problems such as scheduling of various industrial
minimize). tasks [48], calibration optimization of the automotive [49],
● MaSRV-5: In this variant, the objective functions are: optimization of hybrid car controllers [50]. The flexibility in
lexical + structural coupling (to maximize), lexical + the transformation of various software engineering tasks as a
structural cohesion (to maximize), lexical + structural search-based optimization problem creates a huge opportunity
MQ (to maximize), number of clusters (to maximize), for metaheuristic optimization algorithms. In the past two
and difference between minimum and maximum decades, under the umbrella of search-based software
number of modules in a cluster (to minimize).
engineering (SBSE) [51], a large number of metaheuristic
● MaSRV-6: In this variant, the objective functions are:
optimization algorithms have been developed to address
changed-history + lexical + structural coupling (to
various software engineering problems.
maximize), changed-history + lexical + structural
Most of the software engineering problems have been
cohesion (to maximize), changed-history + lexical +
addressed using the conventional multi-objective
structural MQ (to maximize), number of clusters (to
metaheuristic optimization algorithms. However, an increase
maximize), and difference between minimum and
in the number of objective functions in many software
maximum number of modules in a cluster (to
engineering problems demanding more advanced multi-
minimize).
objective metaheuristic optimization algorithms, i.e., many-
The different set of many-objective software remodulari- objective metaheuristic optimization algorithms. To address
zation formulations (i.e., MaSRV-1, MaSRV-2, MaSRV-3, the many-objective software remodularization, the studies
MaSRV-4, MaSRV-5, and MaSRV-6) represents the various [10,11] proposed many-objective optimization algorithm by
aspects of SRPs. Therefore, the optimization of each aspect of incorporating the various strategies in the artificial bee colony
SRP can produce different remodularization solutions. (ABC) algorithm [52]. Even though these approaches perform
Software remodularization solution encoding In the effectively, there are still various improvements corresponding
reformulation of any problem as a search-based optimization to the generation of more effective remodularization solutions.
problem, the designing of objective functions and encoding of In this work, we exploit the basic framework of MaABC
the candidate solution plays an important role. The generation [10,11] to optimize the software quality. The basic framework
of all possible solutions for the optimization problem is a of MaABC remains similar to the original MaABC algorithm
challenging task, and for a real-world optimization problem, it with some minor changes in the selection techniques.
becomes more difficult. In an optimization problem, the The MaABC framework consists of four major components,
candidate solution is commonly defined in terms of the and each component are responsible for performing
Amarjeet PRAJAPATI et al. Multi-dimensional information-driven many-objective software remodularization approach 9
specialized activities. The components are: 1) initialization the set of all possible remodularizationsolutions in the search
phase, 2) send employed bees, 3) send onlooker bees, and 4) space are t, i.e., s = {s1 , s2 , ..., st }. The particular software
send scout bees. The flowchart of the MaABC is presented in remodularization solution si can be represented as a set of n
Fig. 3. The detailed operations involved in each of these decision variable, i.e., si = {s1i , s2i , ..., sni }. The range of a
phases are given in the following paragraphs. j
particular decision variable, i.e., si ∈ [U B j − LB j ], where U B j
Initialization phase The initialization phase of the MaABC j
and LB are the upper and lower bound of the jth decision
performs many activities that set up a strong basis for the
variable of si candidate solution. For example, the
smooth working of the next phases. The first activity of this
remodularization solution demonstrated in Fig. 1 can be
phase is the initialization of candidate solutions for the first
represented as s ={1, 2, 2, 1, 3, 1, 3, 3} where the range of
generation’s population. To perform this, we generate a set of
candidate solutions of population size belonging to the search each decision variable is between 1 to 8. In this example, the
space of the remodularization problem. At the beginning of classes {s1, s4, s6} belong to package {m1}, the classes {s2, s3}
the algorithm, there requires a set of initial candidate belongs to package {m2}, and classes {s5, s7, s8} belong to
solutions, i.e., a population that can proceed with the package {m3}. In the context of the ABC algorithm, the
optimization process. To generate the initial candidate solution candidate solution is viewed as food source, and the candidate
for the population, we use the random initialization approach, solution’s fitness is considered quality of the food source.
where the decision variables value for each of the candidate After initialization of the population, the objectives functions
solutions is selected randomly. Apart from the population and fitness function corresponding to each candidate solutions
initialization, the various parameters of the algorithm also are computed. Based on the non-domination relations, the
need to be initialized. Therefore, in this phase, the appropriate external archives CA and DA are updated according to their
values for each of the parameters are assigned so that the updation rules. The details of the updation rules for the CA
optimization process can proceed toward the intended optimal and DA archives are given in the further paragraphs. The trial
direction. value, i.e., TR associated with each candidate solutions of the
To demonstrate the population initialization, let’s consider population is set to zero (i.e., T R1 = T R2 = · · · = T RN = 0 ).
Send employed bees After the initialization of the food food source having the better food source quality, will attract a
sources/candidate solutions in the population, the employed more significant number of onlooker bees and food source
bees staying in the hive fly towards the food sources. For each having the poor food source quality may have a smaller
food source an individual employed bee is assigned. Each number of onlooker bees. The movement probability of
individual employed bee starts searching the new food sources onlooker bee towards a particular food source is computed in
based on the information about the food sources they have in terms of the candidate solutions’ fitness values. The searching
the memory with the hope to find a better food source. In this nature of the onlooker bees around the food source is the same
study, to create the new solutions for the employed bees, we as the employed bees’ searching behaviour. The only
use the information of the candidate solutions of the current difference between the employed bees and onlooker bees is
population and the candidate solutions of the CA and DA the number of bees deploying on a particular food source. The
archives. The reason of considering the candidate solutions of pseudocode of the onlooker bees is given in Algorithm 2.
CA and DA is that these solutions help in guiding the
employed bees towards more promising candidate solutions.
To balance convergence and diversity in the resultant
population, we fixed the consideration probability of the
current population and the CA+DA candidate solutions 0.4
and 0.6, respectively. The objective functions and fitness of
the newly generated candidate solution is computed. If the
newly generated candidate solution’s fitness is found better to
the current solution, then the current solution of the population
is replaced with the newly generated candidate solution;
otherwise, the newly generated candidate solution is discarded
and preserved the current solution for the next generation. The
trial value of the unimproved candidate solutions of the
population is incremented by 1. The detailed description of the
procedure is given in Algorithm 1.
4 Experimentation setup
To evaluate the effectiveness of the proposed approach, we
conducted an empirical analysis. To perform this, an
experimentation setup is designed. The details of the
Since the CA and DA archives’ size is fixed, they can experimental setup are given as follows.
accommodate only a limited number of candidate solutions. If System configuration-To implement the proposed approach;
the number of candidate solutions increases to CA’s size, then we use the Java programming language. The implemented
the extra candidate solutions from the CA are deleted based on proposed approach is executed on the personal computer
12 Front. Comput. Sci., 2023, 17(3): 173209
having Intel Core i7-1160G7 4.4 GHz CPU hardware remodularization approaches. The significant difference
configuration and Microsoft Windows 10 Home operating between our approach and the existing approaches
system. [1,6,9,14,56] are related to the information used to define the
Problem instances The selection of problem instances to objective functions and the customized metaheuristic
evaluate the prosed approach is an important task as it helps to optimization algorithm. The brief description of the existing
generalize the results. We select the sample of those problem software remodularization approaches is given in Table 2.
instances that can represent of whole problem instances of the Authoritative modularization In the past three decades,
population. The major characteristics of the chosen problem many software remodularization approaches have been
sets are given in Table 1. proposed. The different software remodularization approaches
generally suggest different remodularization solution. The
Table 1 Description of the selected problem instances unavailability of benchmark or ground truth software
Software systems Version #Classes #Relationships #Modules modularization evaluates these remodularization approaches a
JFreeChart 0.9.21 401 1420 50 challenging task. The limited availability of software
JHotDraw 6.0b1 398 2125 17 engineers for the verification of remodularization solution
JavaCC 1.5 154 722 6 makes it more challenging. All these things can lead
JUnit 3.81 100 276 6
1) difficulty in assessing the modularization techniques,
DOM 4J 1.5.2 195 930 16
2) trusting on a particular modularization technique can be a
risk, 3) researchers may follow flawed strategies in improving
The selected problem instances are open-source software the accuracy of the methods.
systems developed in Java programming language and easily To evaluate the proposed approach, we create a ground truth
available on the web. These software projects have also been software modularization of the selected software projects. In
by different researchers and academicians to validate similar this study, the ground truth software modularization is the
approaches. modularization solution verified by the software developers
Results collection The proposed approach works in the and system designers who have a good understanding of the
randomization environment (i.e., metaheuristic optimization project and problem domain. The creation of ground truth
algorithm). Therefore, it is not guaranteed that the proposed software modularization where the applications or software
work will produce the same results on different execution over projects own developers not involved is known as
the same problem instance. To validate the stochastic results, authoritative modularization. It is easy for an academician and
we collect the sample results of 31 run for each problem researchers to create authoritative modularization instead of
instance. The other challenge is that each of the sample results ground truth software modularization. Even it is very
is a set of trade-off solutions (i.e., Pareto front) not a single challenging to create the authoritative modularization without
solution due to the multi-objective nature of the problem. To the involvement of applications own developers, because there
select a single best solution from the Pareto front, we use the can be chance of omitting original design decision and
trade-off worthiness metric [45] approach. inclusion of spurious one. But in this study, we try to
Evaluation criteria To evaluates the performance of the overcome these challenges.
proposed approach, we use the internal quality metrics and Creating an authoritative modularization with the help of
external quality metrics. The internal quality metric is used to software developers who are not a part of system development
evaluate the software remodularization solution from the team is a time-consuming task. On the other hand, software
design of the software structure perspective while the external engineers do not show interest in the system in which they
quality metric is used to compare the software have not worked. Moreover, such outsider software
remodularization solution from the authoritative and developers do not have the correct, complete, and idealized
developer’s perspective. For the internal quality metrics, we picture of the system’s modular structure. To reduce the
use the modularization quality (MQ) [6] a highly used software developers’ effort while ensuring the generation of
software quality metrics to evaluate the modularization accurate, authoritative modularization, we first extract the
quality. On the other hand, we use the MoJoFM [55] metric to original modularization of the software system. Then we
compare the software remodularization solution with the involve the software developers in completing the process.
authoritative and developer’s perspective of software From the software systems we extract the bundle definition
remodularization. files for the original modularization. If the bundle definition
Competitive approaches To justify the proposed approach’s files are not available, we extract the existing package
supremacy, we compare the results with the existing organization as the original modularization.
Developer’s view for software remodularization Apart comparison, the Wilcoxon rank sum test (0.05 significance
from the authoritative software remodularization, we also level)statistical test is applied. The symbols “−”, “+”, and “≈”
evaluated the proposed approach with the modularization are used to represent whether the MQ value of different
solution suggested by the developers. In this case, we only variants is significantly worse than, better than, and not
select the 20 modules of each software project and show the significant different than the proposed approach, i.e., MaSRV-6.
developers and collect their suggestions. To conduct this task, If we see the mean MQ values of each variant of the
we involve 21 software developers who are not the part of the software remodularization, the mean MQ values of the variant
system development. Because finding the original application’s MaSRV-6 is greater than the rest of the variants, MaSRV-1,
developers is not feasible for the academicians. The selected MaSRV-2, MaSRV-3, MaSRV-4, and MaSRV-5 in all cases.
developers are four PhD students, five M.Tech students, and If we rank the performance of each variant in terms of MQ
six B.Tech final year students, and six software developers values, the variant MaSRV-6 will get rank 1, and the variants
working on similar projects. These students and software MaSRV-1, MaSRV-2, MaSRV-3, MaSRV-4, and MaSRV-5
developers are selected based on their knowledge and will get the rank 4, 6, 5, 3, and 2, respectively corresponding
understanding of the selected software projects. to the all cases. These results indicate that the proposed
approach MaSRV-6 is the best performer, and the MaSRV-2
5 Experimental results and discussion is the worst performer in generating the remodularization
To make the results more comprehensive, we broadly divided solution in terms of MQ values. If we see the Wilcoxon rank
the experimentation results into two parts: 1) the results sum test results, the proposed MaSRV-6 variant performs
related to the internal quality metric (i.e., MQ results) and 2) significantly better to the MaSRV-1, MaSRV-2, and MaSRV-
the results related to the external quality metric (i.e., MoJoFM 3 in all cases. There is only one case (i.e., JavaCC) where the
results). The details of the results and the discussion is MaSRV-6 is not significantly better to the MaSEV-4, and
provided in the following subsections. there are only two cases (i.e., JHotDraw and DOM4J) where
the MaSRV-6 is not significantly better to the MaSEV-5.
5.1 MQ results The Wilcoxon rank-sum test results obtained by comparing
The mean MQ values of the remodularization solutions the proposed approach and the existing approaches with
obtained through the different variants of the remodularization respect to the MQ values are provided in the Table 4. The
approaches are presented in Table 3. As the variant MaSRV-6 description of the statistical test results presented in Table 5 is
represents our proposed approach, we compared the our as follows: The entries of first row and first column are the
proposed MaSRV-6 with the other variants, i.e., MaSRV-1, name of existing and proposed approach (i.e., MaSRV-6) and
MaSRV-2, MaSRV-3, MaSRV-4, and MaSRV-5. To this the rest of rows and columns are the comparison results in
Table 5 Mean and Std of MoJoFM obtained through comparison of variants andauthoritative modularization
Software systems MaSRV-1 MaSRV-2 MaSRV-3 MaSRV-4 MaSRV-5 MaSRV-6
35.213 57.487 65.341 85.384 91.742 95.459
JFreeChart [2.426] [5.353] [6.732] [6.845] [8.328] [7.462]
27.348 52.485] 67.395 82.394 90.304 93.754
JHotDraw [2.173] [5.374] [6.395] [6.784] [8.374] [7.971]
38.384 49.948 61.485 80.406 89.418 96.803
JavaCC [2.943] [4.394] [5.309] [6.471] [8.485] [9.312]
35.942 53.492 63.485 8.404 89.578 93.456
JUnit [3.294] [4.582] [5.394] [6.482] [7.405] [8.348]
29.994 51.394 66.320 82.482 90.493 92.403
DOM 4J [2.384] [4.394] [5.942] [6.042] [8.483] [8.193]
14 Front. Comput. Sci., 2023, 17(3): 173209
terms of significance difference corresponding to all five authoritative modularization. Here, the larger MoJoFM value
problem instances. There are five symbols in each table entry, indicates that the remodularization obtained through the
and each symbol is corresponding to each of the problem different approaches is more similar or closer to each other. In
instances (i.e., JFreeChart, JHotDraw, JavaCC, Junit, and comparison, the smaller MoJoFM value indicates that the
DOM4J). The “+” and “−” symbols are used to show the remodularization obtained through the different approaches is
“significant” differences between the approaches and the “≈” more dissimilar to each other.If we see the MoJoFM results of
symbol used to denote the “no significant” difference between each software remodularization variants, the variant MaSRV-6
the approaches. Specially, the “+” denotes that the approach (i.e., proposed approach) has achieved the better values than
depicted in row performs significantly better compared to the the rest of the software remodularization variants, i.e.,
approach depicted in column over corresponding problem MaSRV-1, MaSRV-2, MaSRV-3, MaSRV-4, and MaSRV-5
instance. Similarly, the “−” denotes that the approach depicted in most of the cases. On the other hand, the variant MaSRV-1
in column performs significantly better compared to the has the worst MoJoFM values than the rest of the variants in
approach depicted in row corresponding to that position most cases.The such good value of the variant MaSRV-6
problem instance. The “≈” symbol indicates no significant indicate that the proposed approach can generate more similar
difference between the row’s approach and the column’s modularization corresponding to the authoritative
approach corresponding to that position problem instance. For modularization. Overall pattern of the MoJoFM values of each
example, consider the symbols [≈ + + ≈ +] placed in second software remodularization variant is MaSRV-1<MaSRV-2<
row and seventh column of Table 4. The first symbol, i.e., “≈” MaSRV-3< MaSRV-4< MaSRV-5< MaSRV-6 in all problem
denotes no significant difference between the proposed instances.
approach and the PRAJ-2020corresponding to the JFreeChart To test the similarity of the remodularization solution
problem instance. The second symbol i.e., “+” denotes that the produced through the different variants of the
proposed approach is performing significantly better to the remodularization approaches with the remodularization
PRAJ-2020 corresponding to the JHotDraw problem instance. solution suggested by the developers, we also computed the
Even though the Table 4 presents the comparison results of MoJoFM values by applying over these two modularizations.
each approach with other approaches, but here we are only Table 6 presents results of MoJoFM obtained through
focusing the results of the proposed approach and the other comparison of different software remodularization variants of
existing approaches. If we see the results of the proposed the proposed approach and the modularization suggested by
approach, it demonstrates that the proposed approach performs the developers. Similar to the case of authoritative
significantly better in most cases than the existing approaches. modularization, the variant MaSEV-6 can achieve the better
The proposed approach performs significantly better to the MoJoFM values than the rest of the variants, i.e., MaSRV-1,
PRAD-2011 and PRAJ-2017 in all problem instances, while MaSRV-2, MaSRV-3, MaSRV-4, and MaSRV-5 in most of
the proposed approach has few cases where it is not the cases. The variant MaSRV-1 has shown the lowest
performing significantly better than the MKAO-2015 and MoJoFM values in almost all cases and regarded as worst
PRAJ-2020. Overall, these results validate that the proposed performer among all variants. The MoJoFM values of the
approach has the good capability of generating the MaSRV-5 are closer to the MoJoFM values of MaSRV-6 in
remodularization having the better MQ values. most cases compared to the rest of the variants. Hence, the
MaSRV-5 is the second-best performer in terms of similarity
5.2 MoJoFM results with the developers view of modularization. Overall, the order
The MoJoFM external quality metrics computes the similarity of the variants with respect to the MoJoFM values
between the remodularization obtained through the proposed corresponding to all problem instances is MaSRV-1<
approach and the authoritative remodularization or MaSRV-2< MaSRV-3< MaSRV-4< MaSRV-5< MaSRV-6.
developers’ view of remodularization. First, we present the These observations validate that the proposed MaSRV-6 has a
MoJoFM results of proposed and authoritative remodulari- strong capability to generate the remodularization solution that
zation, then we present MoJoFM results of proposed and can be easily acceptable to the software developers.
developers view of remodularization. Table 5 presentsthe Apart from comparing the remodularization results of the
mean and standard deviation of the MoJoFM after applying different software remodularization variants in terms of
over the different variants of remodularization and the MoJoFM values with respect to the authoritative and
Table 6 Mean and Std of MoJoFM obtained through comparison of variants and developers’ modularization
Software systems MaSRV-1 MaSRV-2 MaSRV-3 MaSRV-4 MaSRV-5 MaSRV-6
26.846 48.384 65.058 85.405 92.048 94.954
JFreeChart [2.394] [4.294] [5.395] [6.495] [6.495] [7.584]
35.485 53.223 67.645 83.565 90.475 93.475
JHotDraw [3.472] [4.385] [5.762] [6.351] [8.576] [8.221]
38.875 47.575 64.567 81.332 89.412 94.392
JavaCC [2.485] [3.472] [4.998] [6.305] [7.337] [7.894]
32.574 45.566 68.058 81.048 91.494 93.204
JUnit [2.495] [4.001] [5.485] [7.049] [7.595] [7.434]
36.048 46.954 66.495 83.204 89.496 95.312
DOM 4J [2.495] [3.595] [6.503] [7.596] [7.123] [8.563]
Amarjeet PRAJAPATI et al. Multi-dimensional information-driven many-objective software remodularization approach 15
developers view of software remodularization, we have also designing of the metaheuristic optimization algorithm
compared the best performing variants, i.e., MaSRV-6 (i.e., significantly contribute to finding the good remodularization
proposed approach) with the existing software remodulari- solutions. Apart from the evaluation of the software
zation approaches. Tables 7 and 8 present the Wilcoxon test remodularization solution quality, the objective functions help
results obtained by comparing the proposed and existing in guiding the remodularization process towards the expected
remodularization approaches. The meaning and description of solution. For the object-oriented software system, software
the symbols used in Table 7 are the same as provided in the package coupling and cohesion are the two main objective
Table 4. Table 7 presents the Wilcoxon results obtained by functions that are often used as objective functions and some
comparing the all considered approaches in terms of the other supportive objective functions in SBSR. The software
authoritative modularization. The results presented in Table 7 package coupling and cohesion are generally defined in terms
show that the proposed approach performs significantly better of class coupling of the software system. Therefore, the
than existing approaches in most cases. The second-best definition of class coupling has a major importance in SBSR.
performer among all approaches is the PRAJ-2020, and the To compute the coupling between the source code classes,
third best performer is the MKAO-2015. The approach various types of source code information such as structural,
PRAD-2011 is the worst performer, and JALA-2019 and lexical, and changed-history information are used. The
PRAJ-2017 are the average performer. different types of information used in computing the class
To validate the supremacy of the proposed approach (i.e., coupling for software package coupling and cohesion
MaSRV-6) with respect to the existing approaches of software computation have different importance in generating the
remodularization from the developers view, we compared the remodularization solution in the SBSR. For example,
obtained remodularization results of the proposed with the structural information help in guiding the remodularization
remodularization of the existing approaches. Table 8 presents process towards the remodularization solution which is better
the MoJoFM results after applying it between the from the structural quality point of view. The use of lexical or
remodularization of the proposed approach and the existing textual information can lead the remodularization process
approaches. These results are shown in terms of the Wilcoxon towards the remodularization solution which is more
rank sum test (i.e., whether the proposed approach perform semantically coherent with the developers perspective of
significantly better, worst, or no significant difference). software remodularization. Similarly, the changed-history
Wilcoxon results obtained through the comparison of the all information-based class coupling can lead remodularization
considered approaches in terms of the developers suggested process towards the remodularization solution, which is
modularization. The results show that the proposed approach logically coherent from the developers perspective.
performs significantly better than the existing approaches in To generate the sophisticated software remodularization
most cases. If we see the rest of the approaches’ results, we solution that can be more effective from the structural quality
can find that the Amarjeet et al. [14] approach is second and point of view and the developer’s perspective of software
MKAO-2015 is the third outperforming approach among all remodularization, the use of different types of structural,
approaches. The results of the PRAD-2011 is worst in most lexical, and changed history information in computation of
cases, and the results of JALA-2019 and PRAJ- 2017 are the class coupling is highly suggested. Many previous researchers
average. have used the various dimensions of structural, lexical, and
changed history information to compute the class coupling for
5.3 Discussion software remodularization. The common issue in all the
In the formulation of SRP as a many-objective optimization approaches is that they give equal importance to the different
problem, the definitions of the objective functions and the dimensions of the structural, lexical, and changed history
Table 7 Comparison of the proposed approach and the existing approaches in termsMoJoFM with respect to the authoritativeness
PRAJ-2020 PRAD-2011 JALA-2019 PRAJ-2017 MKAO-2015 Proposed
PRAJ-2020 NA [− − − − −] [− − − − −] [− − − − −] [− − − − −] [≈ + + + +]
PRAD-2011 [+ + + + +] NA [− ≈ + ≈ −] [+ + + + +] [+ + + + +] [+ + + + +]
JALA-2019 [+ + + + +] [+ ≈ − ≈ +] NA [+ + + + +] [+ ++ ≈ +] [+ + + + +]
PRAJ-2017 [+ + + + +] [− − − − −] [− − − − −] NA [− − − − −] [+ + + + +]
MKAO-2015 [+ + + + +] [− − − − −] [− − − ≈ −] [+ + + + +] NA [+ + + + +]
Proposed [≈ − − − −] [− − − − −] [− − − − −] [− − − − −] [− − − − −] NA
Table 8 Comparison of the proposed approach and the existing approaches in terms MoJoFM with respect to the developer’s view
PRAJ-2020 PRAD-2011 JALA-2019 PRAJ-2017 MKAO-2015 Proposed
PRAJ-2020 NA [− − − − −] [− − − − −] [− − − − −] [− − − − −] [+ ≈ + + +]
PRAD-2011 [+ + + + +] NA [− − ≈ − +] [+ + + ≈ +] [≈ − + ≈ +] [+ + + + +]
JALA-2019 [+ + + + +] [+ + ≈ + −] NA [+ + + + +] [+ −+ + +] [+ + + + +]
PRAJ-2017 [+ + + + +] [− − − ≈ −] [− − − − −] NA [− − ≈ ≈ +] [+ + + + +]
MKAO-2015 [+ + + + +] [≈ + − ≈ −] [− + − − −] [+ + ≈ ≈ −] NA [+ + + + +]
Proposed [− ≈ − − −] [− − − − −] [− − − − −] [− − − − −] [− − − − −] NA
16 Front. Comput. Sci., 2023, 17(3): 173209
information in class coupling computation. However, it well Internal validity The internal validity is concerned with the
known fact that the different dimensions of the structural, validity of the results corresponding to the various treatment
lexical, and changed history information would not contribute factors. In our approach, the different parameter settings of the
equally in computing the class coupling, i.e., each dimension algorithm can affect the results. To mitigate this threat, we use
of structural, lexical, and changed history information have the trial-and-error method to determine the values of the
their own importance in the contribution of class coupling parameters of the algorithms.
computation. Construct validity Construct validity is concerned with the
To reveal the importance of different types of source code relationship between theory and outcome. The output of the
information such as structural, lexical, and changed history approach must be synchronized with the theory. The major
information as well as their different dimensions in threat to this validity is the cost of the evaluation. The
computation of class coupling for the purpose of software different metaheuristic approaches may require a different cost
remodularization, an empirical study with various variants of to a particular iteration. To provide the equal cost of the
set of objective functions have been conducted.Each variant of assessment to each algorithm, we assigned the equal number
the many-objective software remodularization exploited the of fitness evaluations to each algorithm instead of an equal
different types and combinations of source code information number of iterations. The other threat can be the
for the computation of class coupling. The results presented in appropriateness of quality measures. In this approach, we use
Sections 5.1 and 5.2 demonstrate that the use of different the well-accepted quality measure to evaluate the quality of
structural, lexical, and changed history information with their the solutions.
relative importance helps produce remodularization solutions External quality This validity is concerned with the
that are highly acceptable to the software developers. On the generalization of the approach’s outcome in the broader
other hand, the individual information (i.e., either structural, perspective of the problem instances. In this approach, we
lexical, or changed-history) used to compute the class have considered the problem instances having different
coupling cannot generate the remodularization solution that characteristics ranging from small to large. These set of
can be acceptable to the software developers. So, the results problem instances have already been used by different
presented in this study support the assumption that the use of researchers in the area of software remodularization.
different dimensions of structural, lexical, and changed history
information with their relative importance will help produce 7 Conclusion and future works
remodularization solutions that will be highly acceptable to In this paper, we introduced a multi-dimensional information-
the software developers. The comparative results can also driven many-objective search-based software engineering
validate this assumption. Different existing approaches use approach for modularizing the software system. The approach
different types and dimensions of structural, lexical, and aims at generating the remodularization solution that improves
changed-history information with their equal importance in the the software quality by optimizing the different version of
computation of class coupling for software remodularization. coupling and cohesion measures such as structural-based
coupling and cohesion, lexical-based coupling and cohesion,
6 Threats to validity
and changed-history-based coupling and cohesion. To
The designed experimental setup used to evaluate the
optimize the different quality metrics to produce the
proposed approach is producing good results. However, there
remodularization solution, we used the MaABC algorithm, a
are many factors associated with the experimentation that can
many-objective search-based software engineering approach
affect the proposed approach’s outcome. Therefore, it
with some effective changes in selection strategies.
becomes necessary for the empirical study to identify the
Specifically, in this study, we addressed several issues of
possible threats that can affect the validity of the results and
the actions required to mitigate their impact. In this study, we existing software remodularization approaches that are
have identified the following four categories of threats and restricted to the use of particular types of structural or lexical
applied a suitable action to mitigate those types of threats. information-based cohesion and coupling. We created multiple
Conclusion validity In conclusion validity, there must exist variants of many-objective SRPs using the different sets of
a causal relationship between the experimentation and objective functions in this continuation. An extensive
treatment outcome. Our study’s major threats that can affect experiment for remodularization of five object-oriented
the conclusion validity are: 1) initialization of population-the systems is performed to validate the supremacy of the
random initialization of the population may bias the results if proposed contributions. The obtained results show that the
it is favourable initialization. To mitigate this threat, we proposed contributions can generate a more effective
executed the algorithm many times with different random remodularization solution than traditional SBSR approaches.
initialization. 2) use of a statistical approach to evaluate and In future work, the proposed work can be integrated with the
compare the results obtained through much execution the IDEs and configuration management system to automate
inappropriate use of the statistical method can mislead the remodularization recommendation system.
conclusion. To mitigate this threat, we used the non-
References
parametric statistical test, i.e., Wilcoxon rank-sum test. The
Wilcoxon rank-sum test produces a good result with data of 1. Mkaouer W, Kessentini M, Shaout A, Koligheu P, Bechikh S, Deb K,
non-normal distribution. Ouni A. Many-objective software remodularization using NSGA-III.
Amarjeet PRAJAPATI et al. Multi-dimensional information-driven many-objective software remodularization approach 17
ACM Transactions on Software Engineering and Methodology, 2015, 21. Li W, Henry S. Object-oriented metrics that predict maintainability.
24(3): 17 Journal of Systems and Software, 1993, 23(2): 111–122
2. Mancoridis S, Mitchell B S, Rorres C, Chen Y F R, Gansner E R. Using 22. Chidamber S R, Kemerer C F. A metrics suite for object oriented
automatic clustering to produce high-level system organizations of design. IEEE Transactions on Software Engineering, 1994, 20(6):
source code. In: Proceedings of the 6th International Workshop on 476–493
Program Comprehension. 1998, 45−52 23. Martin R. OO design quality metrics-an analysis of dependencies. In:
3. Parashar A, Chhabra J K. Mining software change data stream to predict Proceedings of the Workshop on Pragmatic and Theoretical Directions
changeability of classes of object-oriented software system. Evolving in Object-Oriented Software Metrics. 1994
Systems, 2016, 7(2): 117–128 24. Briand L, Devanbu P, Melo W. An investigation into coupling measures
4. Parashar A, Chhabra J K. Package-restructuring based on software for C++. In: Proceedings of the 19th International Conference on
change history. National Academy Science Letters, 2017, 40(1): 21–27 Software Engineering. 1997, 412−421
5. Anquetil N, Lethbridge T C. Experiments with clustering as a software 25. Briand L C, Daly J W, Wüst J K. A unified framework for coupling
remodularization method. In: Proceedings of the 6th Working measurement in object-oriented systems. IEEE Transactions on
Conference on Reverse Engineering. 1999, 235−255 Software Engineering, 1999, 25(1): 91–121
6. Praditwong K, Harman M, Yao X. Software module clustering as a 26. Hitz M, Montazeri B. Measuring coupling and cohesion in object-
multi-objective search problem. IEEE Transactions on Software oriented systems. In: Proceedings of the International Symposium on
Engineering, 2011, 37(2): 264–282 Applied Corporate Computing. 1995
7. Abdeen H, Ducasse S, Sahraoui H A, Alloui I. Automatic package 27. Eder J, Kappel G, Schreft M. Coupling and cohesion in object-oriented
coupling and cycle minimization. In: Proceedings of the 16th Working systems. Klagenfurt: University of Klagenfurt, 1994
Conference on Reverse Engineering. 2009, 103−112 28. Lee Y S, Liang B, Wu S, Wang F. Measuring the coupling and cohesion
8. Mahdavi K, Harman M, Hierons R M. A multiple hill climbing of an object-oriented program based on information flow. In:
approach to software module clustering. In: Proceedings of the Proceedings of the International Conference on Software Quality. 1995
International Conference on Software Maintenance. 2003, 315−324 29. Savić M, Ivanović M, Radovanović M. Analysis of high structural class
9. Amarjeet, Chhabra J K. Harmony search based remodularization for coupling in object-oriented software systems. Computing, 2017, 99(11):
object-oriented software systems. Computer Languages, Systems & 1055–1079
Structures, 2017, 47: 153–169 30. Wu Z, Palmer M. Verbs semantics and lexical selection. In: Proceedings
10. Amarjeet, Chhabra J K. FP-ABC: fuzzy-Pareto dominance driven of the 32nd Annual Meeting on Association for Computational
artificial bee colony algorithm for many-objective software module Linguistics. 1994, 133−138
clustering. Computer Languages, Systems & Structures, 2018, 51: 1−21 31. McLachlan G J, Krishnan T. The EM Algorithm and Extensions. 2nd
11. Amarjeet, Chhabra J K. Many-objective artificial bee colony algorithm ed. Hoboken: John Wiley & Sons, Inc., 2008
for large-scale software module clustering problem. Soft Computing, 32. Zimmermann T, Weißgerber P, Diehl S, Zeller A. Mining version
2018, 22(19): 6341–6361 histories to guide software changes. IEEE Transactions on Software
12. Marcus A, Poshyvanyk D, Ferenc R. Using the conceptual cohesion of Engineering, 2005, 31(6): 429–445
classes for fault prediction in object-oriented systems. IEEE 33. Beck F, Diehl S. Evaluating the impact of software evolution on
Transactions on Software Engineering, 2008, 34(2): 287–300 software clustering. In: Proceedings of the 17th Working Conference on
13. Bavota G, De Lucia A, Marcus A, Oliveto R. Software re- Reverse Engineering. 2010, 99−108
modularization based on structural and semantic metrics. In: 34. Oliva G A, Santana F W S, Gerosa M A, de Souza C R B. Towards a
Proceedings of the 17th Working Conference on Reverse Engineering. classification of logical dependencies origins: a case study. In:
2010, 195−204 Proceedings of the 12th International Workshop on Principles of
14. Prajapati A, Parashar A, Chhabra J K. Restructuring object-oriented Software Evolution and the 7th annual ERCIM Workshop on Software
software systems using various aspects of class information. Arabian Evolution. 2011, 31−40
Journal for Science and Engineering, 2020, 45(12): 10433–10457 35. Beyer D, Noack A. Clustering software artifacts based on frequent
15. Doval D, Mancoridis S, Mitchell B S. Automatic clustering of software common changes. In: Proceedings of the 13th International Workshop
systems using a genetic algorithm. In: Proceedings of the 9th on Program Comprehension. 2005, 259−268
International Workshop Software Technology and Engineering Practice. 36. Fluri B. Assessing changeability by investigating the propagation of
1999, 73−81 change types. In: Proceedings of the 29th International Conference on
16. de Oliveira Barros M. An analysis of the effects of composite objectives Software Engineering. 2007, 97−98
in multiobjective software module clustering. In: Proceedings of the 37. D'Ambros M, Lanza M, Robbes R. On the relationship between change
14th Annual Conference on Genetic and Evolutionary Computation. coupling and software defects. In: Proceedings of the 16th Working
2012, 1205−1212 Conference on Reverse Engineering. 2009, 135−144
17. Kumari A C, Srinivas K. Hyper-heuristic approach for multi-objective 38. Sun X, Li B, Zhang Q. A change proposal driven approach for
software module clustering. Journal of Systems and Software, 2016, changeability assessment using FCA-based impact analysis. In:
117: 384–401 Proceedings of the 36th International Conference on Computer Software
18. Prajapati A, Geem Z W. Harmony search-based approach for multi- and Applications. 2012, 328−333
objective software architecture reconstruction. Mathematics, 2020, 8: 39. Saxena D K, Duro J A, Tiwari A, Deb K, Zhang Q. Objective reduction
1906 in many-objective optimization: linear and nonlinear algorithms. IEEE
19. Kargar M, Isazadeh A, Izadkhah H. Semantic-based software clustering Transactions on Evolutionary Computation, 2013, 17(1): 77–99
using hill climbing. In: Proceedings of the International Symposium on 40. Yuan Y, Xu H, Wang B, Yao X. A new dominance relation-based
Computer Science and Software Engineering Conference. 2017, 55−60 evolutionary algorithm for many-objective optimization. IEEE
20. Rathee A, Chhabra J K. Clustering for software remodularization by Transactions on Evolutionary Computation, 2016, 20(1): 16–37
using structural, conceptual and evolutionary features. Journal of 41. Yang S, Li M, Liu X, Zheng J. A grid-based evolutionary algorithm for
Universal Computer Science, 2018, 24(12): 1731–1757 many-objective optimization. IEEE Transactions on Evolutionary
18 Front. Comput. Sci., 2023, 17(3): 173209
Computation, 2013, 17(5): 721–736 55. Andritsos P, Tzerpos V. Information-theoretic software clustering. IEEE
42. Díaz-Manríquez A, Toscano-Pulido G, Coello C A C, Landa-Becerra R. Transactions on Software Engineering, 2005, 31(2): 150–165
A ranking method based on the R2 indicator for many-objective 56. Jalali N S, Izadkhah H, Lotfi S. Multi-objective search-based software
optimization. In: Proceedings of the IEEE Congress on Evolutionary modularization: structural and non-structural features. Soft Computing,
Computation. 2013, 1523−1530 2019, 23(21): 11141–11165
43. Zitzler E, Künzli S. Indicator-based selection in multiobjective search.
In: Proceedings of the 8th International Conference on Parallel Problem
Solving from Nature. 2004, 832−842 Amarjeet Prajapati is currently working as
44. Wang H, Jiao L, Yao X. Two_Arch2: an improved two-archive Assistant Professor in Department of Computer
algorithm for many-objective optimization. IEEE Transactions on Science Engineering & Information Technology at
Evolutionary Computation, 2015, 19(4): 524–541 Jaypee Institute of Information Technology, India.
45. Rachmawati L, Srinivasan D. Multiobjective evolutionary algorithm He received his PhD degree from National
with controllable focus on the knees of the Pareto front. IEEE Institute of Technology, India. He has published
Transactions on Evolutionary Computation, 2009, 13(4): 810–824
many research papers in reputed journals and
46. Rachmawati L, Srinivasan D. Preference incorporation in multi-
conferences. His area of interest includes software engineering,
objective evolutionary algorithms: a survey. In: Proceedings of the
IEEE International Conference on Evolutionary Computation. 2006, metaheuristic algorithms, machine learning, and soft computing.
962−968
47. Zhang Q, Li H. MOEA/D: a multiobjective evolutionary algorithm Anshu Parashar is currently working as Assistant
based on decomposition. IEEE Transactions on Evolutionary Professor in Department of Computer Science &
Computation, 2007, 11(6): 712–731
Engineering at Thapar Institute of Engineering &
48. Yuan Y, Xu H. Multiobjective flexible job shop scheduling using
Technology (TIET), India. He received his PhD
memetic algorithms. IEEE Transactions on Automation Science and
degree from National Institute of Technology,
Engineering, 2015, 12(1): 336–353
49. Lygoe R J, Cary M, Fleming P J. A real-world application of a many- India. He has published many research papers
objective optimisation complexity reduction process. In: Proceedings of papers in reputed journals and various national
the 7th International Conference on Evolutionary Multi-Criterion and international conferences. His area of interest includes software
Optimization. 2013, 641−655 engineering, machine learning, blockchain and cloud computing for
50. Narukawa K, Rodemann T. Examining the performance of evolutionary
sustainable development.
many-objective optimization algorithms on a real-world application. In:
Proceedings of the 6th International Conference on Genetic and
Evolutionary Computing. 2012, 316−319 Amit Rathee is currently working as Assistant
51. Harman M, Jones B F. Search-based software engineering. Information Professor in the Department of Computer Science,
and Software Technology, 2001, 43(14): 833–839 Government College Barota, India. He was
52. Karaboga D. An idea based on honey bee swarm for numerical awarded a PhD degree from National Institute of
optimization. Kayseri: Erciyes University, 2005 Technology, India in 2020. He has presented and
53. Li K, Chen R, Fu G, Yao Z. Two-archive evolutionary algorithm for published many research papers in reputed
constrained multiobjective optimization. IEEE Transactions on
journals and various national and international
Evolutionary Computation, 2019, 23(2): 303–315
54. Prajapati A. Two-archive fuzzy-Pareto-dominance swarm optimization
conferences. His research interests include software engineering,
for many-objective software architecture reconstruction. Arabian human-computer interaction, soft computing, algorithm, and artificial
Journal for Science and Engineering, 2021, 46(4): 3503–3518 intelligence.