1 s2.0 S1570870522000051 Main PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Ad Hoc Networks 128 (2022) 102786

Contents lists available at ScienceDirect

Ad Hoc Networks
journal homepage: www.elsevier.com/locate/adhoc

Data analytics of social media 3.0: Privacy protection perspectives for


integrating social media and Internet of Things (SM-IoT) systems
Sara Salim ∗, Benjamin Turnbull, Nour Moustafa
School of Engineering and Information Technology, University of New South Wales, Canberra, Australia

ARTICLE INFO ABSTRACT

Keywords: With the rapid evolution of web technologies, Web 3.0 aims to expand on current and emerging social media
Social media platforms such as Facebook, Twitter, and TikTok, and integrate emerging computing paradigms, including the
Internet of Things (ioT) Internet of Things (IoT), named social media 3.0. The combinations of these platforms in Web 3.0 promises
Data analytics
consumers greater integration, interaction, and more seamless movement between physical spaces. However,
Privacy preservation
ensuring the privacy of data across such systems is a potential challenge in this space. In this study, we
propose a new privacy-preserving social media 3.0 framework that illustrates the interaction of SM and IoT
services and estimates how this interaction could impact users’ behaviors. The framework consists of three main
components. First, a new relational dataset, named SM-IoT, is designed to dynamically connect users with their
IoT services and assist in processing data heterogeneity. Second, a data pre-processing module is employed
for filtering heterogeneous data and providing a certain level of privacy preservation on the data. Third, data
analytics using different statistical and machine/deep learning methods are applied to examine data complexity
and identify users’ behaviors. The results revealed that our proposed framework can efficiently identify users’
behaviors from social media 3.0 data sources. The outcomes of comparing our SM-IoT dataset with two other
well-known SM datasets, namely Pokec and Renren, as well as Activity Recognition with Ambient Sensing
(ARAS) IoT dataset reveals the fidelity of our dataset to be used for future evaluations of privacy-preserving
and machine learning-based decision-making techniques.

1. Introduction The concept of web services is not new, as it exists across many
modern web properties. However, in the context of Web 3.0, it becomes
Web 2.0 has fundamentally changed how the Internet is used, from the overwhelming focus [1]. Web 3.0 promises the potential for SM
a source of data about the world to be principally utilized for communi- platforms that can speak to each other directly, providing integration
cation, user-generated content, data sharing, and society building. Web and ease of use to consumers. Web 3.0 achieves this by combining
2.0 platforms have been at the center of this change [1]. According to semantic markups and technologies that aim to bridge the divide
Berners-Lee [2], the first execution of the web, representing Web 1.0, between human web users and computerized applications, and web
can be considered as the read-only web. In other words, the early web services and direct API integration between disparate systems [1].
permitted users to only search for and read information. Early instances Complex SM platforms, such as Twitter and Facebook, where the data
of shopping cart applications, mostly used by e-commerce websites, are the basis for sophisticated models of the relationships between
basically fall under the Web 1.0 category, aiming at introducing the
users, can be leveraged in combinatory ways to increase the relevance
products to potential clients without any active communication or
of what is shown to end-users. This has significant advantages for the
information flow between clients and producers. As a result of the lack
recommendation algorithms that underpin these platforms, and also for
of active user engagement with web applications, Web 2.0 emerged,
advertising purposes.
marking the start of the ‘‘Read-Write-Publish’’ era [1]. Web 2.0 has
Recently, SM platforms have been advanced as a principal producer
dramatically altered the web environment, allowing users to contribute
content and communicate with other users. New concepts such as of big data, analysis of which have shown significant impacts [3].
Blogs, Social Media (SM), and Video Streaming were introduced to web In [4], some practices and factors for which SM data analysis have been
service users during this era. Facebook, YouTube, Flickr, TikTok, and applied to innovative planning. Also, based on the fact that a user can
Twitter are a few examples of Web 2.0 innovations. be a participant in several social networks simultaneously, a composite

∗ Corresponding author.
E-mail address: [email protected] (S. Salim).

https://doi.org/10.1016/j.adhoc.2022.102786
Received 15 February 2021; Received in revised form 16 November 2021; Accepted 14 January 2022
Available online 31 January 2022
1570-8705/© 2022 Elsevier B.V. All rights reserved.
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

social network in which the user can exhibit different behaviors may be recommender systems is a significant issue, which will be addressed in
structured but, at the same time, reveal some common latent concerns this study.
and preferences. Wong et al. [5] planned an enhanced model for Research Contribution — This study presents a comprehensive
learning multi-view user representation using the knowledge gained framework for generating SM-IoT datasets, a simulated dataset inte-
from various social networks to anticipate human behaviors, which grating the data of SM and the social Internet of Things (IoT) in a
can improve social advertising, predictions of preferences, and service privacy-preserving manner. For the SM data, individual user infor-
recommendations. mation of the type found on common SM platforms is generated.
User representation and modeling have been widely used for various Such data includes given name, gender, age, preferences, SM usage
applications, such as community recognition and recommendation [3]. patterns, and personality traits. Social IoT data includes information
One of the key challenges faced is that integrating multiple data about how a user’s smart home IoT devices are, and the indicative
sources, such as user profiles and their sensing data collected by IoT preferences generated. Both sets of data are combined in the complete
devices, causes difficulty in managing big data on a single SM platform. SM-IoT dataset and sanitized in an integrated framework to realize
This is often available to the platform owners only. In real life, numer- the principle of privacy preservation and empower the whole dataset
ous SM users are members of several social platforms simultaneously. as well as IoT-based SM networks platforms with a certain level of
However, constrained by the features within each platform, any single privacy. Aside from integrating this simulated SM users’ profiles and
one provides only a partial aspect of a user based on the data collected IoT data, this framework attempts to protect data privacy in ML-
and the purpose of the SM platform. Therefore, integrating the data based decision-making against any potential attacks in IoT-based SM
of various SM platforms is essential for enhancing user modeling and networks. We do this by applying a privacy preservation mechanism
developing accurate decision-making systems [3]. Unlike traditional to the generated data. Privacy preservation, in this regard, preserves
network-embedding datasets, where either an entire structure is a sin-
the original data of IoT-based SM networks from being published or
gle platform or each platform involved in it is a homogeneous network,
revealed by unauthorized users through changing the original data
there is a need to focus on multiple social platforms. However, it is hard
into different data distribution that does not allow end-users to infer
to adequately combine knowledge in this setting as it relies not only
original data. Here, applying a privacy preservation mechanism in this
on linking disparate data sources but also on objective applications for
framework empowers its privacy preservation capability and provides
improving recommendation engines or evaluating other models such as
a preserved and more statistically considered data version of the gen-
privacy-preserving ones.
erated SM-IoT dataset to the ML-based applications. Since only the
For preserving the data privacy of users, SM data has significant
preserved version of the dataset is used while the original data is being
potential for use as a tool to evaluate privacy models [6]. In [7], three
preserved with a privacy preservation mechanism, the IoT-based SM
different Facebook datasets have been frequently used for studying
data privacy against any potential inference attack could be guaranteed
privacy-preserving challenges. The first is the SNAP dataset which com-
in this framework. It is the authors’ opinion that our framework will
prises user relationships and a set of node features, including gender,
be particularly useful for SM, social IoT, and medical data consumers
birthday, employer, and location [8]. Each attribute is specified by a
who want to conduct ML-based decision-making but are dealing with
binary value, which represents the absence or presence of a correspond-
highly sensitive data and where inaccurate training data could have
ing feature. The other two datasets, from Caltech and MIT, contain the
relationships or links between users at the California Institute of Tech- significant implications. The proposed framework will coordinate such
nology who were active at a specific date in 2005 [7]. Those datasets data utilization by ensuring that original/private information is not
contain node features that represent specific information, including shared. Moreover, the newly generated SM-IoT dataset would have
whether they are a member of student/faculty, gender, graduation year, several diverse analysis use-cases, including evaluations of the credi-
and scholastic major [9]. Although the datasets have been widely used bility of ML and privacy-preservation models, and predictions of users’
for evaluating various privacy models, they are very limited and single preferences for recommendation services.
platform-based datasets [10]. The key contributions of this study are structured as follows:
Research Motivation — Despite the continuous evolution of SM
1. We propose a new privacy-preserving social media 3.0 frame-
datasets, existing benchmarks do not suit collaborative machine
work, revealing the interaction of SM and IoT services and
learning-enabled privacy-preserving algorithms and recommender sys-
measuring how this would model users’ behaviors.
tems [10]. This makes it more difficult to perform meaningful com-
2. We generate a new realistic SM-IoT dataset that integrates so-
parisons and improvements of different structures which is critical for
cial media and IoT data, including user profiles, IoT profiles,
enhancing the efficiency of both privacy frameworks and recommenda-
properties and links, and connected simulated IoT devices.
tion systems. As stated in [11], this problem persuaded several authors
3. We employ the correlation coefficient matrix to statically ana-
to compose their programs and crawl social networks to accumulate
lyze the correlations of the dataset’s proposed features.
data for experimental purposes. Structuring a benchmark suite to eval-
4. We evaluate the performance of privacy preservation approaches
uate the performance of a privacy model and enhance recommendation
and decision makings of ML and deep learning, using the pro-
systems is a matter of concern to researchers [10].
posed dataset compared with three benchmarking ones.
Benchmarking SM data sources are considerably more challenging
than ever as social networks are still in their infancy, as well as The remainder of this paper is organized as follows: the background
constantly in flux, and consequently, not completely understood. Also, to this study and related work are discussed in Section 2, Section 3
SM data are reasonably complicated and heterogeneous compared with presents the framework for generating the SM-IoT dataset, the dataset’s
those of previous systems, for example, with single-node processing properties, and comparison with those of other state-of-the-art datasets
stages [12]. These approaches also have issues related to ethics. Al- are explained in Section 4, and finally, the conclusion to this study and
though some researchers have begun to develop benchmarks for SM future directions are described in Section 5.
data, it is still unclear whether these benchmarks can be used to
precisely evaluate the credibility of Machine Learning (ML). More-
over, existing datasets are not robust for effectively assessing privacy 2. Background and related work
preservation and predicting user’s preferences [10]. More specifically,
the development of a new dataset that can be used to train and This section discusses the background to and previous studies of SM,
validate new variations of privacy-preserving-based ML methods and the usability of its data and its data collection techniques.

2
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

2.1. Background 2.2. Related work

The amount of user-generated SM data is exploding. On Twitter, the Since its inception in 1989 [2], the World Wide Web has made
significant strides as the world’s largest data structure. The World Wide
popular microblogging platform, comprised of profiles, connections,
Web has gone through different stages of growth during its life cycle,
short messages, and images, 500 million tweets were generated per day
from Web 1.0 to Web 4.0. In the literature [1], Web 1.0 is defined as
and 330 million monthly active users in April 2020 [13]. Facebook, the
a web of information connections, Web 2.0 is described as a web of
largest general-use SM, has up to 2.6 billion active monthly users in
people connections, Web 3.0 is described as a web of knowledge con-
May 2020 and generates petabytes of data per day [14]. Also, SM has
nections, and Web 4.0 is explained as a web of intelligence connections.
shaped our interactions as it is where people can communicate directly
Going by the trend of constant evolution, the web is now slowly but
with friends, future employers, and popular people/celebrities. steadily transitioning to a more data-centric phase in the context of
SM is an Internet-based technology that allows people to share Web 3.0.
ideas, feelings, and information by creating virtual networks and com- In 2006 [1], the New York Times’s John Markoff proposed Web
munities [15]. While SM has a significant role in connecting people 3.0 as the third generation of the web. Web 3.0’s core concept is
to the world, the social IoT links a large number of heterogeneous to structure data and connect it to improve discovery, automation,
physical devices to the Internet [16]. By design, SM provides fast integration, and reuse through applications [17]. Web 3.0 aims to
transmission of data such as personal information, videos, and photos. connect, incorporate, and analyze data from different SM platforms
Large-scale big data, such as that generated by SM, has a wide range to create a new knowledge stream. It can improve data management,
of potential applications for various data consumers and is one of the promote mobile internet usability, stimulate creativity and innovation,
assets that underpin the profitability of SM platform businesses [17]. facilitate globalization phenomena, increase customer satisfaction, and
Such data allows the microtargeting of advertising, as well as feeding help in the organization of SM-related big data collaboration.
the recommendation engines the majority of social media platforms Early research into SM-related big data had researchers collecting
rely on. SM data from individuals through the use of questionnaires, interviews,
There are also significant third-party benefits, for example, many and surveys [23]. This had several disadvantages; these processes are
e-commerce applications also collect this information, and wherever labor-intensive, difficult to operate at scale, and usually conducted
in small communities, thereby limiting their scope for analysis. The
possible link to customer SM profiles as a means of collecting infor-
advancement of online SM platforms has introduced a great shift in
mation such as locations, interests, lifestyles, and personality traits,
SM data research [24] as it has significantly improved the availability,
as well as IoT data on social platforms. This allows them to study
quantification, and available data of SM. Many online SM datasets
customer behaviors, utilize the viral aspects of SM platforms to prop-
have been collected automatically through programs or scripts [25–
agate their advertising reach, and also understand long-term trends in
27] while some SM platforms, including Facebook and Twitter, provide
their customer base [18,19]. Public policymakers investigate SM data to
APIs for data collection [28].
acquire demographic information that can be used to influence strategic
The main challenges of collecting SM data are the issues of limited
decisions. The use of SM at a governmental level has both advantages processing power, storage capacity and users’ privacy concerns [23]. As
and disadvantages, with SM considered inaccurate compared with paid SM platforms are usually huge in terms of the size of their user popula-
polling services, but may be available when these services are not, tions, volumes of user-generated content, and update velocities which
are far more timely and have benefits of scale that traditional polling continue to grow rapidly [3], crawling them requires a robust system
services cannot compete with. with huge storage capacity and great processing power. Moreover, they
Although SM and social IoT data are the key drivers for compa- usually put restrictions on their data access rates and the information
nies that develop and operate these decision-making systems, data available. There are also licenses for the data collected, which impedes
privacy is still an issue in this space. Specifically, there is a need the concept of open and complete datasets [28].
to protect such data from privacy breaches, such as deanonymisation Given these issues, it is often infeasible to collect complete SM
through inference attacks [19,20]. Attackers may be able to exploit datasets [11]. Some studies, such as those in [11,27], used various SM
and misuse users’ data using sophisticated and targeted hacking tech- crawling methods to collect users’ profile data from a large SM platform
niques [21,22]. This includes targeted attempts for phishing, scams, for analysis purposes while others, such as that in [26], obtained sam-
spear-phishing, opinion-shaping, and mechanisms for deanonymisation ples of users’ interaction data through non-rate-limited APIs. However,
and private data leakage. Using such hacking techniques, sensitive their success in terms of data quality is still a challenge and the degree
information about users and their relationships could be illegally pub- of representation of the collected data to those in the original dataset is
lished in SM and violate the users’ privacy; thus, this has become still ambiguous [10]. This lack of considering the quality and potential
bias of collected data for a single stand-alone SM platform reduces the
challenging [20]. Social networks are unique in this respect, as they
effectiveness and validity of results produced. These approaches have
occupy a unique connection to the world, combinations of private data,
also highlighted the ethical considerations posed in such data collection
and are heavily interconnected.
methods.
SM datasets’ availability is vital in all disciplines covering social
While numerous research studies [26,27,29–31] have been con-
users, such as research on privacy preservation and predicting interests.
ducted to generate SM datasets, the creation of realistic SM datasets
As well as for analysis, SM data are studied in such diverse fields as data that include recent data features scenarios remains an unexplored
mining, network science, privacy preservation, and recommendation subject. More significantly, some datasets lack the inclusion of IoT-
systems. The majority of research being conducted in these fields takes associated data, while others neglected to include any new features.
the form of analyzing SM data and its usage as the basis for further In some cases, the generation environment was not realistic, and the
investigation. Therefore, increasing numbers of SM datasets are used in privacy preservation mechanisms were not diverse in others. This study
the literature, with more regularly becoming publicly available. Since aims to solve these flaws by developing the new SM-IoT dataset and
the accomplishment of SM platforms, such as Facebook, Twitter, and evaluating it using a privacy preservation mechanism, based on ML and
the social IoT, by the general public, they are increasingly being studied deep learning algorithms.
and there is more demand for ethically acquired and ground-truth Since the IoT has raised concerns about data privacy and security,
datasets. large datasets are needed for analyzing network flows, distinguishing

3
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

between normal and abnormal traffic, and detecting malicious con- 3. Framework for generating SM-IoT dataset
duct. The development of a realistic IoT dataset is critical for privacy-
preserving models. Over the years, research involving IoT smart device The framework for integrating the data of simulated SM users’
datasets are generally one of three types. The first type is comprised of profiles and IoT smart devices is shown in Fig. 1. It contains three
studies that utilize specially-developed laboratories that have existing main steps, data generation, data pre-processing, and data analysis, as
sensors and actuators, across multiple platforms [32]. In many cases, described in the following sub-sections.
these are developed with use-cases in mind, to either replicate homes,
commercial or industrial systems. Studies from this category are often 3.1. Data generation
focused on the human interaction elements and research questions
that are prevalent in understanding smart environments, and revolve
In this stage, the SM-IoT dataset is generated by simulating the
around how people work with multitudes of these interconnected sys-
data of SM users and the ground-truth of the IoT of the two major SM
tems. The second category of IoT smart device dataset creation focuses
platforms, Facebook and the social IoT. In this dataset, newly generated
on individuals and the Personal Area Network. This has applicability
profile and network data consisting of 𝐸 edges, 𝑆 IoT devices, linked
for monitoring health and wellbeing [33,34]. The third category of
research is in simulated datasets; using tested data, it is possible to to 90% of 𝑈 of users data. To do this, Python scripts are developed
emulate large numbers of interconnected devices. There are advantages to generate the users’ data, with those of the IoT data collected using
and disadvantages to this approach; the scale and ease of creation are Node-RED which is a flow-based development tool originally developed
large. However, simulation has limitations, and may not accurately to link hardware devices, APIs, and online services as part of an IoT. It
reflect the implementation of physical devices. does this using JavaScript functions.
In this regard, the literature has introduced a variety of datasets to For this work, the Node-RED tool was used on Ubuntu VMs to
assist researchers in simulating IoT devices and producing IoT datasets, simulate a variety of IoT sensors linked with a public IoT hub. Within
including [33–36]. While several datasets remain private for various the Node-RED program, we developed JavaScript code that simulated
reasons, including privacy issues and a lack of privacy protection several IoT sensors associated with SM users, including atmospheric
mechanisms, others have become publicly accessible. The stated aim pressure, air quality, temperature, and humidity. The association be-
of these studies and the datasets associated with them was to generate tween the SM users and IoT devices is generated using Python code
evidence of malicious behavior. Although this is significant, we chose based on the produced preferences in Algorithm 2 with approximately
to concentrate our efforts on making social IoT data smart enough 8% of the IoT devices are linked to each interested user of the total of
by using it to provide long-term classification and prediction of users’ 90% of the SM users. The code for each of these linked devices utilized
preferences as well as behaviors through monitoring the IoT devices. the subscribe and publish functions using MQTT protocol. The MQTT
Consequently, the inclusion of the IoT data in the SM dataset is the protocol is a lightweight communication protocol for linking Machine-
proposed dataset’s key novelty. To be specific, we used common mid- to-Machine (M2M) and is a popular option for IoT systems. MQTT
dleware (Node-RED) [37] to simulate the presence of IoT in SM data. operates with a publish/subscribe model, where each device publishes
Also, Message Queuing Telemetry Transport (MQTT) protocol [35], a data to an MQTT broker, which is a server-based system, with a topic
publish–subscribe communication protocol widely used for lightweight and a payload.
network communication such as IoT, is used to simulate IoT traffic. The topic is manipulated to manage the published data and provides
Unlike previous studies [26,27], IoT data are generated as part of our the mechanism for other systems to link to the broker and get informa-
SM-IoT dataset. tion from the topics they wish to collect information from. As shown
Combining real information from SM into an IoT dataset has other in the example of Fig. 2, we applied the following IoT scenarios in the
considerations, particularly that it is possible to infer sensitive infor- simulation environment of the IoT dataset:
mation. Additionally, if data is collected for research purposes, it is
possible for a third party to infer the true preferences of any individual 1. Air_Quality Service (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐴𝑖𝑟_𝑄𝑢𝑎𝑙𝑖𝑡𝑦), which mea-
from the data source. If the data from the individuals are obscured, sures the air quality and generates a health effects statement
the analysis results will be inaccurate which needs privacy preservation (Healthy or Unhealthy for sensitive groups) based on the mea-
techniques to resolve. Although numerous studies [38–40] have pre- surements.
sented privacy-preserving mechanisms in SM data and social IoT data, 2. Dishwasher (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐷𝑖𝑠ℎ𝑤𝑎𝑠ℎ𝑒𝑟), which turns on or
providing an accurate and trustworthy data utility level from such data off the dishwasher and notifies the users with the remaining time
while maintaining a high privacy level remains a major challenge [39]. if under processing.
To fill this research need, this study generates, sanitizes, and com- 3. DoorBell (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐷𝑜𝑜𝑟𝐵𝑒𝑙𝑙), which notifies the user
bines simulated SM data of Facebook users with the social IoT of
when someone rings any bell in the home.
smart home devices. It does this to realize the principle of privacy
4. Dryer (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐷𝑟𝑦𝑒𝑟), that informs the user about
preservation while enhancing the efficiency of data analytics and mon-
the status of the dryer whether on or off or even displays the
itoring applications. Such data sanitization significantly impacts pri-
error message if idle.
vacy preservation as it conflicts with its own mechanisms — various
5. Fan (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐹 𝑎𝑛), that displays the status of the
privacy-preserving mechanisms have different functionality for preserv-
existing fans and turns it off after a certain time.
ing sensitive information from the original data. To understand its
representation, its compatibility with other benchmark datasets, and 6. Fridge_Sensor (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐹 𝑟𝑖𝑑𝑔𝑒_𝑆𝑒𝑛𝑠𝑜𝑟), which mea-
its potential for use in data mining, exploratory analysis is performed. sures the temperature of the fridge and when required adapts
Moreover, considering big data privacy preservation in the context of it lower than a threshold.
SM platforms and Social IoT systems, our framework mainly describes 7. Garage_door (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐺𝑎𝑟𝑎𝑔𝑒_𝑑𝑜𝑜𝑟), which closes or
the way of integrating those data from the data utility perspective, opens based on a probabilistic input.
simulates the logical association among the SM’s users and IoT devices 8. GPS_Tracker (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐺𝑃 𝑆_𝑇 𝑟𝑎𝑐𝑘𝑒𝑟), which gener-
based on their preferences, and formulate the optimal privacy preserva- ates longitude and latitude information of predefined entities.
tion scheme that enables the existing SM as well as social IoT systems to 9. Heating_System (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐻𝑒𝑎𝑡𝑖𝑛𝑔_𝑆𝑦𝑠𝑡𝑒𝑚), which ad-
achieve a high utility level from their data while maintaining a certain justs the heating system based on the weather status (cold/hot).
level of data privacy. To some extent, this framework makes up for 10. Home_Weather (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝐻𝑜𝑚𝑒_𝑊 𝑒𝑎𝑡ℎ𝑒𝑟), which pro-
providing an accurate and trustworthy data utility level from IoT-based duces information about humidity, air pressure, and temperature
SM networks while maintaining a high privacy level. of the home.

4
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Fig. 1. Framework for the generation of SM-IoT dataset.

Fig. 2. Flowchart example of water_system simulation in the Node-RED.

11. Motion_Light (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑀𝑜𝑡𝑖𝑜𝑛_𝐿𝑖𝑔ℎ𝑡), which turns off 19. Smart_Window (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑎𝑟𝑡_𝑊 𝑖𝑛𝑑𝑜𝑤), which
or on upon a pseudo-random generated signal. opens or closes the window and notifies the users when opens
12. Security_System (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑒𝑐𝑢𝑟𝑖𝑡𝑦_𝑆𝑦𝑠𝑡𝑒𝑚), which for a long time exceeding the threshold.
warns the user if there is a detection of any undefined persons 20. Smoke_Alarm (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑜𝑘𝑒_𝐴𝑙𝑎𝑟𝑚), which warns
with their exact location and time. the users once smoke is detected.
13. Smart_AC (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑎𝑟𝑡_𝐴𝐶), which adjusts the 21. Sound_System (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑜𝑢𝑛𝑑_𝑆𝑦𝑠𝑡𝑒𝑚), which turns
home’s temperature by starting up the Air-conditioning system on or off the sound system and controls its volume.
14. Smart_Bulb (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑎𝑟𝑡_𝐵𝑢𝑙𝑏), which turns on or 22. Swimming_Pool (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑤𝑖𝑚𝑚𝑖𝑛𝑔_𝑃 𝑜𝑜𝑙), which reg-
off based on the time intervals. ulates the automated pool pump based on the reading of the
15. Smart_Doors (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑎𝑟𝑡_𝐷𝑜𝑜𝑟𝑠), which opens or water level of the pool.
closes the door or even warns the user based on the generated 23. TV_Sensor (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑇 𝑉 _𝑆𝑒𝑛𝑠𝑜𝑟), which turns on or
signal. off the TV.
24. Washer (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑊 𝑎𝑠ℎ𝑒𝑟), which turns it on or off
16. Smart_Kitchen_Appliances (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑎𝑟𝑡_𝑘𝑖𝑡𝑐ℎ𝑒𝑛_
and generates information about the current washing cycle and
𝐴𝑝𝑝𝑙𝑖𝑎𝑛𝑐𝑒𝑠), which turns on or off based on a predefined setting
the remaining time.
and generates information on the status of the appliance.
25. Watering_System (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑊 𝑎𝑡𝑒𝑟𝑖𝑛𝑔_𝑆𝑦𝑠𝑡𝑒𝑚), which
17. Smart_Plug (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑎𝑟𝑡_𝑃 𝑙𝑢𝑔), which turns on or
regulates the irrigation system of the garden based on the chance
off upon the inputs demands.
of rain.
18. Smart_Vacuum (Topic:∕𝑆𝑚𝑎𝑟𝑡ℎ𝑜𝑚𝑒∕𝑆𝑚𝑎𝑟𝑡_𝑉 𝑎𝑐𝑢𝑢𝑚), which
turns it on or off or even recharges when necessary, also gen- In the IoT simulation environment, we designed a standard smart
erates information about its current location. home device configuration. Initially, 25 IoT devices were mimicked to

5
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Fig. 3. Flowchart of data flow in the Node-RED.

operate locally. MQTT messages were broadcast on a regular basis from sub-classified as winter, racing, team sports, etc. Also, each sub-
all users to brokers in the test environment. As shown in Fig. 3, the con- class is further categorized more specifically, for example, team
nections permitted us to mimic normal IoT traffic because the MQTT sports like soccer, basketball, hockey, volleyball, etc. As the im-
brokers operated as mediators, linked smart devices to smartphone pact of the gender, age, and personality traits of an individual are
applications related to his/her differences, they have been used as the subject
of various empirical investigations [41,42] which highlight the
3.1.1. Descriptions of SM-IoT tables controlling roles of such individual’s features in shaping his/her
The Entity-Relationship Diagram (ERD) of the proposed SM-IoT differences. In our framework and inspired by these studies, the
dataset is shown in Fig. 4 and designed using a Microsoft SQL Server. user 𝑥𝑖 preferences will be assigned based on his/her 𝑔𝑒𝑛𝑑𝑒𝑟, 𝑎𝑔𝑒
As shown in the figure, there are eleven tables in the SM-IoT dataset and personality traits with putting more weight for preference/s
which is/are supported by any attached smart device/s to 𝑥𝑖 or
which contain all the following entities of the SM and IoT relationships.
any followed page or group, as explained in Algorithm 2.
• ‘𝑈 𝑠𝑒𝑟𝑠_𝐷𝑎𝑡𝑎’ — consists of the generated biographical data of SM • ‘𝑃 𝑟𝑒𝑓 _𝐿𝑜𝑜𝑘𝑢𝑝_𝑇 𝑎𝑏𝑙𝑒’ (preferences) — includes names of prefer-
users, such as name, date of birth, gender, and age. Also, other ences logged in the 𝑈 𝑠𝑒𝑟𝑠_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 table and serves as a lookup
personal information, such as relationship status, family informa- table containing ‘𝑝𝑟𝑒𝑓 _𝑖𝑑’ and its corresponding ‘𝑝𝑟𝑒𝑓 _𝑛𝑎𝑚𝑒’.
tion, preferences as well as personality traits, mimic a user’s data • ‘𝑆𝑢𝑏𝑝𝑟𝑒𝑓 _𝐿𝑜𝑜𝑘𝑢𝑝_𝑇 𝑎𝑏𝑙𝑒’ (sub-preferences) — includes the sub-
on real SM platforms. This table is dynamically designed to store classifications of the main preferences logged in the 𝑈 𝑠𝑒𝑟𝑠_
users’ and their friends’ data, for example, a user could be a friend 𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 table and is a lookup table containing ‘𝑆𝑢𝑏𝑝𝑟𝑒𝑓 _𝑖𝑑’
of others or vice versa. For every user 𝑥𝑖 in the dataset, the user’s and its corresponding ‘𝑆𝑢𝑏𝑝𝑟𝑒𝑓 _𝑛𝑎𝑚𝑒’.
𝑎𝑔𝑒 is randomly generated within the predefined upper and lower • ‘𝑃 𝑎𝑔𝑒𝑠_𝑎𝑛𝑑_𝐺𝑟𝑜𝑢𝑝𝑠’ — presents generic information regarding
bounders. This user’s 𝑎𝑔𝑒 will be used to define the corresponding both the groups and pages SM users might follow. It includes
the page category indicating the main interest of a page/group,
date of birth and together with the randomly generated 𝑔𝑒𝑛𝑑𝑒𝑟
page email and website, signup date with an integer (public)
type will help in suggesting a proper name for the user. While
indicating whether it is a public or private group and its number
the zodiac will be determined based on the generated date of
of followers.
birth and the personality traits, as well as the user’s usage pattern
• ‘𝑈 𝑠𝑒𝑟𝑠_𝐹 𝑜𝑙𝑙𝑜𝑤𝑝𝑎𝑔𝑒𝑠’ — states the links between users and the
will be assigned well-considered values based on a predefined
following pages in an SM network. Every user has generated
spread probability and the other user’s data. The pseudo-code for
values for links to the following pages/groups which could be
generating the users’ data is shown in Algorithm 1.
zero (i.e., no), single or multiple. These links are 70% based on
• ‘𝑈 𝑠𝑒𝑟𝑠_𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠’ — stores the links and relationships between the user’s preferences and 30% randomly generated. From this
users and their friends and families on an SM platform. Every user table, it can be estimated to what degree users and their friends
has well-generated values for its relationships with other users are interested in specific preferences.
which could be zero (i.e., no), single or multiple based on the • ‘𝑈 𝑠𝑒𝑟𝑠_𝑆𝑚𝑎𝑟𝑡𝐷𝑒𝑣𝑖𝑐𝑒𝑠’ — displays the links between SM users and
other user’s data such as a statue, family members, close friends, smart IoT devices. Every user has randomly generated values for
and child as well as the users’ personality traits. From this table, the smart IoT devices installed at home which could be zero
the degree to which users and their friends share their preferences (i.e., no), single or multiple. From this table, more about a user’s
and maybe have the same ones can be estimated. interest regarding the new technologies of smart IoT devices can
• ‘𝑈 𝑠𝑒𝑟𝑠_𝑃 𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠’ — shows users’ and their friends’ preferences be determined.
that reflect real life, such as interest in nature, art, and athletics, • ‘𝑆𝐷𝑒𝑣𝑖𝑐𝑒_𝐿𝑜𝑜𝑘𝑢𝑝_𝑇 𝑎𝑏𝑙𝑒’ — includes names of smart devices and
with two more sub-classifications providing a total of two clas- serves as a lookup table containing ‘𝑆𝐷𝑒𝑣𝑖𝑐𝑒_𝑖𝑑’ and its corre-
sification levels for each user; for example, athletic interests are sponding ‘𝑆𝐷𝑒𝑣𝑖𝑐𝑒_𝑛𝑎𝑚𝑒’.

6
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Fig. 4. Diagram of proposed entity relationships in SM-IoT dataset.

• ‘𝑆𝐷𝑒𝑣𝑖𝑐𝑒_𝑆𝑡𝑎𝑡𝑢𝑠’ — shows detailed information about the smart those trees, as explained in sub- Section 4.1. For the other IoT dataset,
IoT devices linked to a specific user, including the times at which the data of 25 smart home IoT devices, each with several observations
notifications about the devices were sent, the status of the devices over a certain period, are obtained from Node-RED and attached to
at that time as well as a message/messages from the devices their corresponding users based on the relationships among the users
for updating the user with their status. Also, other information and smart devices. Table 1 shows detailed specifications of the SM-IoT
related to specific smart IoT devices, such as the air quality index dataset’s features.
attribute which holds values obtained from an/the air quality
sensor, is provided. 3.2. Data pre-processing

3.1.2. Relationships of SM-IoT tables Since several real-world datasets contain missing or incomplete
There are two specific types of relationships among all the tables data, they should be cleaned and filtered to improve ML model clas-
for the proposed SM-IoT dataset, one-to-one and one-to-many, in which sification accuracy [43]. Typically, SM datasets often do not allocate
each table participates in only one at any given time, as shown in Fig. 4. values for all their useful features, for example, a user’s preferences
might be intentionally left blank. This is exhibited in the SM-IoT dataset
• One-to-many relationship — exists between 𝑈 𝑠𝑒𝑟𝑠_𝐷𝑎𝑡𝑎 and 𝑈 𝑠𝑒𝑟𝑠_ as, to mimic reality, a large amount of data for which users did not
𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠 as the user may have zero (i.e., no), one or many assign feature values are generated. While privacy-aware users have
relationships, with their types exhibited in 𝑈 𝑠𝑒𝑟𝑠_𝐷𝑎𝑡𝑎 and 𝑈 𝑠𝑒𝑟𝑠_ the option of making their profiles’ data private and accessible only by
𝑃 𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠, 𝑈 𝑠𝑒𝑟𝑠_𝐷𝑎𝑡𝑎 and 𝑈 𝑠𝑒𝑟𝑠_𝐹 𝑜𝑙𝑙𝑜𝑤𝑝𝑎𝑔𝑒𝑠, and 𝑈 𝑠𝑒𝑟𝑠_𝐷𝑎𝑡𝑎 listed friends, Table 2 shows that nearly 49% of our users’ data was
and 𝑈 𝑠𝑒𝑟𝑠_𝑆𝑚𝑎𝑟𝑡_𝐷𝑒𝑣𝑖𝑐𝑒𝑠. Also, 𝑈 𝑠𝑒𝑟𝑠_𝑆𝑚𝑎𝑟𝑡_𝐷𝑒𝑣𝑖𝑐𝑒𝑠 and maintained as ‘public’.
𝑆𝐷𝑒𝑣𝑖𝑐𝑒_𝑆𝑡𝑎𝑡𝑢𝑠 are related to each other in a one-to-many re- Typically, conventional supervised ML models are expected to be
lationship as, for every user, there can be zero or more than trained using a large number of labeled (training data) to achieve
one smart device, for each of which there are multiple data and elevated accuracy. Therefore, the performance of ML algorithms can
messages captured at different times. only be improved through filtering datasets to handle missing values
• One-to-one relationship — exists between 𝑈 𝑠𝑒𝑟𝑠_𝑃 𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 and and remove irrelevant features [43]. To deal with incomplete data
𝑃 𝑟𝑒𝑓 _𝐿𝑜𝑜𝑘𝑢𝑝_𝑇 𝑎𝑏𝑙𝑒 and 𝑈 𝑠𝑒𝑟𝑠_𝑃 𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 and 𝑆𝑢𝑏𝑝𝑟𝑒𝑓 _ values in the SM-IoT dataset, an imputation technique consistent with
𝐿𝑜𝑜𝑘𝑢𝑝_𝑇 𝑎𝑏𝑙𝑒 as there are no multiple names for the same pref- that outlined in [43] is used, in which missing values are replaced
erence or sub-preference. There is also one more relationship of with statistical measures such as feature vector means and medians,
this type between 𝑈 𝑠𝑒𝑟𝑠_𝐹 𝑜𝑙𝑙𝑜𝑤𝑝𝑎𝑔𝑒𝑠 and 𝑃 𝑎𝑔𝑒𝑠_𝑎𝑛𝑑_𝐺𝑟𝑜𝑢𝑝𝑠 as or static values such as zero. Missing values are replaced with zeros
there are no multiple names for the same page while 𝑈 𝑠𝑒𝑟𝑠_𝑆𝑚𝑎𝑟𝑡_ to ensure that the data is not skewed towards central data points.
𝐷𝑒𝑣𝑖𝑐𝑒𝑠 and 𝑆𝐷𝑒𝑣𝑖𝑐𝑒𝑠_𝐿𝑜𝑜𝑘𝑢𝑝_𝑇 𝑎𝑏𝑙𝑒 are related to one another in Furthermore, since this dataset has different types of features, including
a one-to-one relationship as, for the same smart device, there is a both categorical and numeric values, there is a need to pre-process this
single name. data before conducting the data analysis phase, as well as to preserve
the privacy of the users’ data as explained in the following.
3.1.3. Specifications of features
In the SM-IoT dataset, the data of the users’ profiles are tree- • Feature mapping: As the SM-IoT dataset is comprised of more
structured as features are constructed by inheriting new paths from than only numeric data, a mapping function is necessary to

7
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Algorithm 1: Generate Users_Data Algorithm 2: Generate Users_Preferences


1: Input: 𝑃 𝑆 (population size), 𝐹 (number of features), Upper and lower values of 1: Input: 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠, 𝑎𝑔𝑒, 𝑔𝑒𝑛𝑑𝑒𝑟, 𝑈 𝑠𝑒𝑟𝑠_𝑆𝑚𝑎𝑟𝑡𝐷𝑒𝑣𝑖𝑐𝑒𝑠,
features to be generated 𝑈 𝑠𝑒𝑟𝑠_𝐹 𝑜𝑙𝑙𝑜𝑤𝑝𝑎𝑔𝑒𝑠
2: Output: Dataset file with 𝑃 𝑆 rows and 𝐹 columns 2: Output: Users’ preferences and sub-preferences
3: Initialize: 𝐹 _𝑇 𝑦𝑝𝑒=[𝑢𝑠𝑒𝑟_𝑖𝑑, 𝑎𝑐𝑐𝑜𝑢𝑛𝑡_𝑛𝑎𝑚𝑒, 𝑛𝑎𝑚𝑒_𝑔𝑖𝑣𝑒𝑛,𝑓 𝑎𝑚𝑖𝑙𝑦_𝑛𝑎𝑚𝑒
, 𝑛𝑎𝑚𝑒_𝑚𝑖𝑑𝑑𝑙𝑒, 𝑔𝑒𝑛𝑑𝑒𝑟, 𝑑𝑜𝑏_𝑦𝑒𝑎𝑟, 𝑑𝑜𝑏_𝑚𝑜𝑛𝑡ℎ, 𝑑𝑜𝑏_𝑑𝑎𝑦, 𝑎𝑔𝑒, 𝑠𝑖𝑔𝑛_𝑖𝑛_𝑧𝑜𝑑𝑖𝑎𝑐
3: 𝑎𝑔𝑒_𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 ← A [<21], B [21–35], C [36–45], D [46–55], E [56–70], F
,. . . , 𝑠𝑡𝑎𝑡𝑢𝑠], 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠=[𝐻𝑜𝑛𝑒𝑠𝑡𝑦 − 𝐻𝑢𝑚𝑖𝑙𝑖𝑡𝑦, 𝐸𝑚𝑜𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦, . . . , 𝑖𝑛𝑓 𝑙𝑢𝑒𝑛𝑐𝑒_𝑣𝑖𝑒𝑤𝑒𝑟], [71–80], G [>81]
𝑆𝑀_𝑢𝑠𝑎𝑔𝑒_𝑝𝑎𝑡𝑡𝑒𝑟𝑛=[ 𝑢𝑠𝑎𝑔𝑒, 𝑎𝑣𝑒𝑟𝑎𝑔𝑒_𝑝𝑒𝑟_𝑑𝑎𝑦_𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑎𝑡𝑖𝑜𝑛_𝑠𝑡𝑎𝑟𝑡_𝑤𝑖𝑡ℎ_𝑓 𝑎𝑚𝑖𝑙𝑦, . . . , 4: Define 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠, 𝑔𝑒𝑛𝑑𝑒𝑟_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠, 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 by
𝑐ℎ𝑎𝑛𝑐𝑒_𝑜𝑓 _𝑙𝑖𝑘𝑖𝑛𝑔_𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙_𝑓 𝑎𝑚𝑖𝑙𝑦_𝑝𝑜𝑠𝑡] distributing set of preferences based on psychology studies on the
4: for 𝑖 = 1 to 𝑃 𝑆 do relationship between the 𝑎𝑔𝑒, 𝑔𝑒𝑛𝑑𝑒𝑟, 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠, and the 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑠,
5: for 𝑗 = 1 to 𝐹 do
6: 𝑎𝑔𝑒𝑖𝑗 ← 𝑟𝑎𝑛𝑑𝑜𝑚[𝑚𝑎𝑥_𝑎𝑔𝑒 − 𝑚𝑖𝑛_𝑎𝑔𝑒]
respectively.
7: 𝑑𝑜𝑏_𝑦𝑒𝑎𝑟𝑖𝑗 ← 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑦𝑒𝑎𝑟 − 𝑎𝑔𝑒𝑖𝑗 5: for 𝑖 = 1 to 𝑢𝑠𝑒𝑟𝑠 do
8: 𝑑𝑜𝑏_𝑚𝑜𝑛𝑡ℎ𝑖𝑗 ← 𝑟𝑎𝑛𝑑𝑜𝑚 month within 𝑑𝑜𝑏_𝑦𝑒𝑎𝑟𝑖𝑗 6: 𝑢_𝑎𝑔𝑒𝑖 ← Read user’ s age
9: 𝑑𝑜𝑏_𝑑𝑎𝑦𝑖𝑗 ← 𝑟𝑎𝑛𝑑𝑜𝑚 day within 𝑑𝑜𝑏_𝑚𝑜𝑛𝑡ℎ𝑖𝑗 % Preferences based on age interval
10: 𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑗 ←randomly generated with equal probability.
7: if 𝑢_𝑎𝑔𝑒𝑖 ≤ 20 then
11: 𝑠𝑡𝑎𝑡𝑢𝑠𝑖𝑗 ← randomly generated based on 𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑗 and 𝑎𝑔𝑒𝑖𝑗
12: 𝑓 𝑎𝑚𝑖𝑙𝑦_𝑛𝑎𝑚𝑒𝑖𝑗 ← randomly generated based on 𝑠𝑡𝑎𝑡𝑢𝑠𝑖𝑗 and 𝑑𝑜𝑏𝑦 𝑒𝑎𝑟𝑖𝑗
8: 𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ← 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐴 ;
13: 𝑛𝑎𝑚𝑒_𝑔𝑖𝑣𝑒𝑛𝑖𝑗 ← generate based on 𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑗 , 𝑎𝑔𝑒𝑖𝑗 , 𝑑𝑜𝑏_𝑦𝑒𝑎𝑟𝑖𝑗 9: else if 𝑢_𝑎𝑔𝑒𝑖 > 20 and 𝑢_𝑎𝑔𝑒𝑖 ≤ 35 then
14: chance of middle name (𝑐ℎ𝑎𝑛𝑐𝑒) ← set a random probability as 20% non, 70% 10: 𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ← 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐵 ;
1 middle name, 5% 2 middle names, and 5 % 3 middle names 11: else if 𝑢_𝑎𝑔𝑒𝑖 > 35 and 𝑢_𝑎𝑔𝑒𝑖 ≤ 45 then
15: 𝑐ℎ𝑎𝑛𝑐𝑒 ← 𝑟𝑎𝑛𝑑𝑜𝑚[0%-100%]
12: 𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ← 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐶 ;
16: if 𝑐ℎ𝑎𝑛𝑐𝑒 ≤20% then
17: 𝑛𝑎𝑚𝑒_𝑚𝑖𝑑𝑑𝑙𝑒𝑖𝑗 ← 𝑛𝑜𝑛
13: else if 𝑢_𝑎𝑔𝑒𝑖 > 45 and 𝑢_𝑎𝑔𝑒𝑖 ≤ 55 then
18: else if 𝑐ℎ𝑎𝑛𝑐𝑒 >20% and ≤ 90% then 14: 𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ← 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐷 ;
19: 𝑛𝑎𝑚𝑒_𝑚𝑖𝑑𝑑𝑙𝑒𝑖𝑗 ← random generate 1 middle name based on 𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑗 , 𝑎𝑔𝑒𝑖𝑗 15: else if 𝑢_𝑎𝑔𝑒𝑖 > 55 and 𝑢_𝑎𝑔𝑒𝑖 ≤ 70 then
and 𝑛𝑎𝑚𝑒_𝑔𝑖𝑣𝑒𝑛𝑖𝑗 16: 𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ← 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐸 ;
20: else if 𝑐ℎ𝑎𝑛𝑐𝑒 > 90% and ≤ 95% then
17: else if 𝑢_𝑎𝑔𝑒𝑖 > 70 and 𝑢_𝑎𝑔𝑒𝑖 ≤ 80 then
21: 𝑛𝑎𝑚𝑒_𝑚𝑖𝑑𝑑𝑙𝑒𝑖𝑗 ← random generate 2 middle names based on
𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑗 , 𝑎𝑔𝑒𝑖𝑗 , 𝑛𝑎𝑚𝑒_𝑔𝑖𝑣𝑒𝑛𝑖𝑗
18: 𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ← 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐹 ;
22: else if 𝑐ℎ𝑎𝑛𝑐𝑒 > 95% then 19: else if 𝑢_𝑎𝑔𝑒𝑖 > 80 then
23: 𝑛𝑎𝑚𝑒_𝑚𝑖𝑑𝑑𝑙𝑒𝑖𝑗 ← random generate 3 middle names based on 20: 𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ← 𝑎𝑔𝑒_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐺 ;
𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑗 , 𝑎𝑔𝑒𝑖𝑗 , 𝑛𝑎𝑚𝑒_𝑔𝑖𝑣𝑒𝑛𝑖𝑗 21: end if
24: end if
% Preferences based on gender
25: if 𝐹 _𝑇 𝑦𝑝𝑒𝑖𝑗 == 𝑎𝑐𝑐𝑜𝑢𝑛𝑡_𝑛𝑎𝑚𝑒 or 𝑔𝑒𝑛𝑑𝑒𝑟 then
26: 𝑥𝑖𝑗 ← Generated value based on 𝐹 _𝑇 𝑦𝑝𝑒[𝑛𝑎𝑚𝑒_𝑔𝑖𝑣𝑒𝑛 and 𝑓 𝑎𝑚𝑖𝑙𝑦_𝑛𝑎𝑚𝑒]
22: 𝑢_𝑔𝑒𝑛𝑑𝑒𝑟𝑖 ← Read user’ s gender
27: else if 𝐹 _𝑇 𝑦𝑝𝑒𝑗 == 𝑎𝑔𝑒 or 𝑠𝑖𝑔𝑛_𝑖𝑛_𝑧𝑜𝑑𝑖𝑎𝑐 then 23: if 𝑢_𝑔𝑒𝑛𝑑𝑒𝑟𝑖 is Male then
28: 𝑥𝑖𝑗 ← Generated value based on 𝐹 _𝑇 𝑦𝑝𝑒[𝑑𝑜𝑏_𝑦𝑒𝑎𝑟, 𝑑𝑜𝑏_𝑚𝑜𝑛𝑡ℎ, 𝑑𝑜𝑏_𝑑𝑎𝑦] 24: 𝑢_𝑔𝑒𝑛𝑑𝑒𝑟_𝑝𝑟𝑒𝑓𝑖 ← 𝑔𝑒𝑛𝑑𝑒𝑟_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝑀
29: else if 𝑗 ∈ 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠 then 25: else
30: 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑠𝑝𝑟𝑒𝑎𝑑𝑖𝑗 ← set a random probability for 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠𝑖𝑗
26: 𝑢_𝑔𝑒𝑛𝑑𝑒𝑟_𝑝𝑟𝑒𝑓𝑖 ← 𝑔𝑒𝑛𝑑𝑒𝑟_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠𝐹
31: 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑣𝑎𝑙𝑢𝑒𝑖𝑗 ← generate a random personality value based on
𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑠𝑝𝑟𝑒𝑎𝑑𝑖𝑗
27: end if
32: 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠𝑖𝑗 ← 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑣𝑎𝑙𝑢𝑒𝑖𝑗 % Preferences based on personality traits
33: else if 𝑗 ∈ 𝑆𝑀_𝑢𝑠𝑎𝑔𝑒_𝑝𝑎𝑡𝑡𝑒𝑟𝑛 then 28: 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠𝑖 ← Read user’ s personalities
34: 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑠𝑐𝑜𝑟𝑒𝑖𝑗 ← 𝑟𝑎𝑛𝑑𝑜𝑚[0.01-7.00] % Find the maximum personality score which mostly affect the
35: 𝑆𝑀_𝑢𝑠𝑎𝑔𝑒𝑖𝑗 ← 𝑟𝑎𝑛𝑑𝑜𝑚 [1-10] %1 means low usage, 5 is a moderate and 10
behavior of the user
is high usage
36: 𝑆𝑀_𝑢𝑠𝑎𝑔𝑒_𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑖𝑗 ← 𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑠𝑐𝑜𝑟𝑒𝑗 × 𝑆𝑀_𝑢𝑠𝑎𝑔𝑒𝑖𝑗
29: 𝑢_𝑝𝑒𝑟𝑜𝑛𝑎𝑙𝑖𝑡𝑦𝑖 ← 𝑚𝑎𝑥𝑖𝑚𝑢𝑚(𝑃 𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑡𝑟𝑎𝑖𝑡𝑠𝑖 )
37: else 30: 𝑢_𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑝𝑟𝑒𝑓𝑖 ← 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑝𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 based on the selected
38: 𝑥𝑖𝑗 ← Randomly generated value according to the 𝐾_𝑇 𝑦𝑝𝑒𝑖𝑗 personality traits
39: end if % Find the common preferences
40: end for
31: 𝑐𝑜𝑚𝑚𝑜𝑛𝑝 𝑟𝑒𝑓𝑖 ← 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡(𝑢_𝑎𝑔𝑒_𝑝𝑟𝑒𝑓𝑖 ,𝑢_𝑔𝑒𝑛𝑑𝑒𝑟_𝑝𝑟𝑒𝑓𝑖 ,𝑢_𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑡𝑦_𝑝𝑟𝑒𝑓𝑖 )
41: end for
% weight the user’s usage of smart devices
32: 𝑢_𝑆𝐷𝑒𝑣𝑖𝑐𝑒𝑖 ← Read smart devices attached to user 𝑖 from
𝑈 𝑠𝑒𝑟𝑠_𝑆𝑚𝑎𝑟𝑡𝐷𝑒𝑣𝑖𝑐𝑒𝑠
% weight the user’s followed pages and groups
transform categorical features into numeric ones. For instance, 33: 𝑢_𝑃 𝑎𝑔𝑒𝑖 ← Read pages and groups followed by user 𝑖 from
the preferences and sub references features are transformed to 𝑈 𝑠𝑒𝑟𝑠_𝐹 𝑜𝑙𝑙𝑜𝑤𝑝𝑎𝑔𝑒𝑠
ordered numbers (as 1, 2, etc.). The complexity of such mapping 34: 𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑_𝑐𝑜𝑚𝑚𝑜𝑛_𝑝𝑟𝑒𝑓𝑖 ← put more weight on 𝑐𝑜𝑚𝑚𝑜𝑛_𝑝𝑟𝑒𝑓𝑖 which
is 𝑂(𝑁), where 𝑁 is the number of instances for each categorical is/are supported by a smart device and followed page or group.
feature. 35: end for
• Privacy preservation mechanism (Data normalization): as the SM-
IoT dataset is generating significant volumes of data, there is a
need to effectively achieve the principle of privacy preservation
for keeping the private data secure while improving the data normalization approach that enables a linear transformation for
utility level and thus the performance of data applications such the original range of data features while preserving the statistical
as data analytics. Applying a data normalization step significantly relationships among them [44] maps a value (𝑥𝑘 ) of 𝑣 into 𝑥′𝑘 in
impacts privacy preservation processes, as they have their func- a range of [0,1] for each feature as formulated in Eq. (1).
tionality for preserving sensitive information of the original data
using a new transformed/scaled shape. This phase aids ML models 𝑥𝑘 − 𝑚𝑖𝑛(𝑣)
𝑥′𝑘 = (1)
in convergent and achieving their goals as scaling the data within 𝑚𝑎𝑥(𝑣) − 𝑚𝑖𝑛(𝑣)
a particular range eliminates bias from it without modulating the where 𝑚𝑖𝑛(𝑣) and 𝑚𝑎𝑥(𝑣) refer to the minimum and maximum
data statistical properties or dramatically decreasing the utility
values of a feature, respectively. In this step, normalization is used
level. Additionally, it transforms the numeric values of the feature
for enforcing a certain level of data privacy on the SM-IoT data.
mapping step to a particular scale without changing the variances
of the original features and thus guarantees a certain level of It also transforms the values of the previous mapping function of
privacy and maintains a significant level of utility. In our case, we the dataset to a specific scale, without changing original features’
are scaling the data between 0 and 1. A simple Min–max (MM) variances.

8
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Table 1
Specifications of features of the dataset.
Feature Description of features
User_id String, users’ nicknames mapped to hashed numbers
Account_name String, user’s display name
Name_given String, user’s first name
Name_middle_1 String, user’s middle name
Name_middle_2 String, user’s middle name
Name_middle_3 String, user’s middle name
Name_family String, user’s family name
Signup_date Datetime, time at which user registered at the site
Last_login Datetime, the last time at which a user logged in
Public Bool, 1 — all friendships public
Gender Bool, 0 — male, 1 — female
Dob_year Integer, user’s year of birth
Dob_month Integer, user’s month of birth
Dob_day Integer, user’s day of birth
Age Integer, user’s age
Sign_in_zodiac String, user’s zodiac based on birth date
Status/0 String, user’s relationship (married, unmarried, divorce
Status/1 String, user’s relationship (unmarried, divorce)
Mood String, user’s mood representation
Spouse_id String, user’s spouse’s id
Personality/honesty-humility Integer, user’s honesty-humility degree
Personality/emotionality Integer, user’s honesty-humility degree
Personality/eXtraversion Integer, user’s Emotionality degree
Personality/agreeableness Integer, user’s eXtraversion degree
Personality/conscientiousness Integer, user’s Agreeableness degree
Personality/openness to experience Integer, user’s Conscientiousness degree
Personality/influence_ideastarter Integer, user’s Openness to Experience degree
Personality/influence_amplifier Integer, user’s influence_ideastarter degree
Personality/influence_curator Integer, user’s influence_curator degree
Personality/influence_commentator Integer, user’s influence_commentator degree
Personality/influence_viewer Integer, user’s influence_viewer degree
Pref String, user’s preference
Subpref String, user’s sub-preference
Social_media_patterns/usage Integer, user’s SM usage rate
Average_per_day_conversation_start_with_family Integer, user’s conversation rate
Average_per_day_conversation_start_with_friend Integer, user’s conversation rate
Average_per_day_conversation_start_with_close_friend Integer, user’s conversation rate
Chance_of_liking_individual_close_friend_post Integer, user’s liking rate
Chance_of_liking_individual_family_post Integer, user’s liking rate
Parent/0 String, user’s parent’s id
Parent/1 String, user’s parent’s id
Father String, user’s father’s id
Mother String, user’s mother’s id
Friend_id String, user’s friend’s id
Friends_close_id String, user’s close friend’s id
Child_id String, user’s child’s id
Page_category String, the main interest of this page/group
(education, communities, public figures, artists, etc. . . )
Page_id Integer, the id of user’s following page
SDevice_name String, names of user’s smart devices
Sensor_no Integer, no. of the sensor of the smart device
Timestamp Datetime, time at which smart device observed
SDevice_state String, the status of smart device (ON, OFF or ideal)
SDevice_message String, a message from the smart device to the linked
user with detailed updates of its states

Table 2 Table 3
Numbers and percentages of users’ privacy Statistics of SM-IoT dataset.
representations. Property Count
Public feature No. of users (%)
No. of users (𝑈 ) 1,000,000
Privacy-aware 51 No of edges (𝐸) 30,000,000
Non-privacy-aware 49 Nodes in largest WCC/SCC 1,000,000
Edges in largest WCC/SCC 30,000,000
No. of categories of pages/groups 1.000,000
No. of pages/groups 60
3.3. Data analysis No. of attributes/features 53
No. of preferences 7
No. of sub-preferences 55
No. of smart devices (𝑆) 25
An analysis of the generated SM-IoT dataset is conducted. Table 3 No. of sensors 1250
shows its basic properties, such as the numbers of users 𝑈 , edges 𝐸,
feature representations, preference classifications, smart IoT devices 𝑆
Also, in Table 4 and Fig. 5, a breakdown of the SM-IoT users’
that may be linked to SM users, and representations of pages and groups genders based on age distributions is shown, with the dominant group
users may follow. aged above 50 approximately 52% of SM users.

9
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Table 4
Genders of users based on distribution by age.
Gender/age_range ≤20 21–30 31–40 41–50 >50 Total
All users (%) 7.73% 16.63% 16.10% 7.30% 52.26% 100.00%
Male (%) 3.87% 8.32% 8.05% 3.65% 26.11% 49.99%
Females (%) 3.87% 8.32% 8.05% 3.66% 26.12% 50.01%

Fig. 5. Breakdown of users’ genders based on distribution by age.

Fig. 6. Percentages of users following/not following pages and groups.

The percentages of users following and not following public pages 4. Comparison of SM-IoT and other datasets
and groups depicted in Fig. 6, which might indicate their interests,
shows an approximate 2% participation rate for every user. The SM-IoT dataset has a variety of properties that collectively
Also, the involvement of IoT devices in the dataset is represented in distinguish it from other existing datasets, three of which are partic-
Fig. 7, where the percentages of users with and without smart devices ularly significant: (1) the data are generated using a well-simulated
indicate that, of the total 90% of users interested in these devices, framework to produce exact representative models of the SM users
approximately 8% are linked to each user. with the help of the Python script and Node-RED; (2) SM’s user-centric
As the SM-IoT dataset contains simulated important SM dates, such property includes all the data that might be related to those users,
as the last sign-in and sign-up ones for every user’s generated data, such as IoT data which provide a full representation of all the users’
these data can be weighed to gain a complete understanding of the interests and activities; and 3), it incorporates demographic, relational
user’s involvement in SM. Other profile data are represented as 51 and highly indicative features of users to not overload the dataset with
features, such as personality characteristics, age, gender, name, social irrelevant/redundant ones.
preferences, lifestyle habits, data, and status of IoT devices and mes- These properties are highlighted and their qualities and restrictions
sages. Based on this information, detailed analyzes of users’ preferences are discussed in the next sub-section. While the utility of each feature
and recommendations are segmented by, for example, age, gender, and certainly depends on the purpose of the dataset, how this dataset is
personality traits. useful for future research on unexplored areas is emphasized, and

10
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

In this dataset, the users’ profiles are tree-structured as features


are constructed by inheriting new paths in trees, as shown in Fig. 8.
While this property adds an overload to the dataset’s generation, as
it is necessary to determine the weight of each feature before adding
it to the dataset, it enables a highly feature-considered dataset to be
produced without any additional processing power or storage capacity.

4.2. Data comparison

In Table 5, a comparative analysis of the SM-IoT, Pokec [26],


Renren [27] and ARAS [34] datasets is presented. It consists of ten
parameters, that is, the number of nodes (users), number of edges,
number of features, an average of profiles filled, type of data generation
Fig. 7. Percentages of users using/not using smart IoT devices. and its output format, and percentage of IoT data representations with
the number of the smart devices and sensors, of each dataset. It can be
observed that the SM-IoT dataset has a different collection technique
considering its properties, is compared with two well-studied ones for IoT data representations which ultimately reflects complete models
in [26,27]. of users’ representations.
Overall, the conducted comparisons indicate a high feature —
4.1. Data properties considered user representative property of our SM-IoT dataset in both
SM and Social IoT data as the users’ data in SM-IoT are tree-structured.
4.1.1. Generation property In our dataset and unlike the other beforementioned datasets, the
By directly generating data, the effects of an interviewer and his/her
features are added by inheriting previously weighted paths (feature)
questions as well as the labor-intensive overload of completing surveys
in trees to avoid any extreme features irrelevant to the data purpose.
are avoided [23] which can eliminate the necessity for high processing
Also, as this dataset is user-centric, IoT devices are linked to 90%
power, storage capacity, and access rate, and reduce other sources of
of the SM users to make them be monitored over time. This results
weighted error that may be associated with sampling/crawling research
in a complete representative model of users and develops a dataset
studies [28]. At the same time, the Python script provides a standard-
that can serve as a benchmark for e-commerce applications to send
ized users’ data template that facilitates data pre-processing, analysis,
and comparison. While data generation frequently offers the above appropriate advertisements to users. For instance, preferences, which
advantages, its primary trade-off is its great difficulty to obtain personal are not products of users’ entry values, can be interpreted from their
and relational data, for instance, preferences are not products of users’ other relatively important features and activities in SM-IoT data with a
entry values but interpreted from their other relatively important data percentage up to 94.73%.
and activities which have percentages of between 23.05% and 94.73%.
Also, friendships on SM certainly mean different things to different 4.3. Statistics and machine learning methods
users. Such issues should be taken into consideration when generating
SM data. This section describes the evaluation of the consistency of SM-IoT
data while training a classifier using statistical measures and ma-
4.1.2. User-centric property
chine/deep learning models.
As this dataset is user-centric, it includes all the data related to
users as IoT data, which produce dynamic representations of users’
information for predicting their behaviors in real-time, and can only 4.3.1. Statistical analysis technique
be applied using the data collected regarding the association between Pearson Correlation Coefficient [45] is used to determine the lin-
smart IoT devices and the relevant social network. This results in ear relationship between the SM-IoT dataset’s features. Its production
a complete representative model of users for developing a dataset ranges from [−1,1], with the magnitude indicating the degree of cor-
that can serve as a benchmark for e-commerce applications to send relation between two features and the sign indicating whether the
appropriate advertisements to individuals/users. As shown in Fig. 7, correlation is positive or negative.
because IoT devices are linked to 90% of the SM users and can be To measure the correlation coefficient between our dataset’s fea-
tracked over time and compared with other data, our dataset can be tures, we developed a code in R to rank the strengths of the dataset’s
used to provide additional insight on this period of an SM’s life-cycle features into a range of [−1, 1]. After computing the correlation co-
from an IoT perspective. The disadvantage of including many IoT data efficient matrix, we calculate the correlation average for each feature
is that they are not representative of all the SM users as 10% of them to determine the most associated features that can be used to enhance
do not use IoT devices while the other 90% use different combinations the performances of ML models. In Fig. 9, the correlation coefficient
of 25 available devices, each of which has distinct features related to average of our SM-IoT dataset’s features varies in a range of −1 and
its status. 1 with a positive sign and a total average of 0.50358. Based on the
highest correlations, the representative features are only employed to
4.1.3. Feature construction property
predict users’ preferences while other irrelevant features are neglected
SM platforms are, by nature, very rich in features and data [3]. For
which improves the quality and processing time of data analysis.
some analysis purposes, numerous relevant features and even a few
which are suspected to be particularly redundant can be identified [4].
Any extreme features irrelevant to its purpose may result in a high- 4.3.2. Machine and deep learning analysis models
dimensional overloaded dataset which may affect the analysis results. To assess the accuracy/quality of the proposed dataset when utilized
Therefore, in the SM-IoT dataset, a feature construction procedure, to train a classifier, we employed four standard classification models
which transforms datasets into reduced ones containing important fea- in our experiments. These include Gradient Boosting (GB), Random
tures obtained through inferring or creating relevant features, is used to Forest (RF), Naive Bayes (NB) [46] and Feed Forward (FF) learning
enhance understanding and improve the performance of data analysis. models [47], which are briefly described as follows.

11
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Fig. 8. Construction of features of SM-IoT dataset.

Table 5
Comparison of SM-IoT, Pokec, Renren and ARAS.
Parameter Pokec Renren ARAS SM-IoT
No. of nodes 1,637,068 19,567 4 1,000,000
No. of edges 30,622,564 4,500,410 NG 30,000,000
No. of features 58 15 27 53
No. of pages/groups NG 2778 NG 1,000,000
Average of profile filling 40.33% 46% NG 84%
Data generation technique Crawling Crawling Monitoring Simulating
Format of data collected Txt files Csv files NG Csv files
IoT representation No No 100% 90%
No. of Smart Devices NG NG 13 25
No. of sensors NG NG 40 1250

NG: Not Given.

• GB — combines distinct and weak hypotheses. It is an itera- • NB — is a family of simple probabilistic classifiers based on
tive algorithm that creates a highly accurate prediction rule by the Bayes theorem and data point independence assumptions.
combining parameterized functions with poor results. This GB It is highly scalable, as maximum probability training can be
classifier had the following parameters: a learning rate of 0.01, accomplished by evaluating a closed-form expression in linear
4-fold cross-validation, and a maximum depth of 5. For the com- time rather than the costly iterative approximation used by many
pared datasets, similar settings were chosen. The aforementioned other classifiers. It is much quicker than KNN due to KNN’s real-
setting was practically adjusted to determine the GB model’s time execution. In it, the same number of folds and 3 numbers for
best results. The GB classifier was initially trained using default Laplace the NB classifier was trained.
parameters, however, it was later revealed that increasing the • FF — also known as the multi-layer perceptron. This classifier is
tree parameters, especially for the preferences prediction task, mostly used for supervised ML tasks where we already know the
increased the training time. With regards to the number of folds, target function and are extremely important for practicing ML and
we found that a higher number of folds resulted in a loss of form the basis of many commercial applications, areas such as
accuracy. computer vision and language processing were greatly affected by
• RF — is an ensemble algorithm that produces several Decision the presence of such a classifier. The main goal of a feed-forward
Trees (DT). It is a robust algorithm with high classification ac- network is to approximate some functions. The parameters were
curacy since it divides a training set into a few subsets and selected by experimenting, as with the aforementioned classifiers.
then builds a tree for each subset, rather than building a single Higher values of epochs affected the model’s performance, as such
tree and combining their decisions. It is less prone to overfitting we experimented with lower parameter values. Thus, the model
and provides a more generalized solution than DT. As with the was trained in 10 epochs with a binary classification network
aforementioned classifier, the RF parameters were chosen through consisting of a 15-neuron input layer and a single output neuron.
experimentation. So, the RF model was trained with 4-fold cross-
validation and 3 stopping rounds. Also, the specificity of the For comparison purposes, two classification tasks have been con-
model was shown to improve when the number of trees was ducted. The first task was to classify the 𝑔𝑒𝑛𝑑𝑒𝑟 of SM users, while
increased to 15. the second one was to predict the 𝑝𝑟𝑒𝑓 _𝑛𝑎𝑚𝑒 and ℎ𝑜𝑏𝑏𝑦 in Pokec and

12
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Fig. 9. Correlation plots of SM-IoT dataset.

Table 6 Pokec and Renren datasets, our dataset features are more relative for
Learning models evaluation metrics.
classifying the 𝑔𝑒𝑛𝑑𝑒𝑟 attribute.
Metric Equation
Moreover, the performance of the models in predicting the prefer-
𝑇 𝑃 +𝑇 𝑁
Accuracy 𝑇 𝑃 +𝐹 𝑃 +𝑇 𝑁+𝐹 𝑁
(2) ences (𝑝𝑟𝑒𝑓 _𝑛𝑎𝑚𝑒) of SM users is shown in Table 8. In the same way, we
𝑇𝑃
Precision 𝑇 𝑃 +𝐹 𝑃
(3) compare the performances of the four models on the aforementioned
Recall rate 𝑇𝑃
(4) datasets. As can be seen in Table 8, the GB model achieves the best
𝑇 𝑃 +𝐹 𝑁
accuracy of 98.77% and precision of 98.98% when trained on the SM-
IoT dataset, while the best recall of 100% is achieved by NB using
SM-Iot dataset. In the case of the preferences prediction task, training
Renren datasets. In this regard, the four models are trained on the SM- on the Renren dataset could achieve better performance with higher
IoT, Pokec and Renren datasets only. The ARAS dataset is mainly used accuracy and precision measures for the four models than training on
to predict the activity that is occurring in the smart home and being the Pokec dataset. While in terms of recall rate, GB and FF models on
observed by the ambient sensors. It includes the features which are the Pokec dataset yield a better rate than on the Renren dataset.
related to the activities of daily living that are wished to be captured for The results in Tables 7 and 8 confirm that our dataset yields no-
the four voluntaries in this dataset. Therefore, we could not include the ticeable performance improvements. In the level of the trained models,
ARAS dataset in this comparison. To measure the performance of the the GB model with SM-IoT dataset performs significantly better than
trained models, the following standard metrics from the literature have other models for predicting the 𝑔𝑒𝑛𝑑𝑒𝑟, as well as, the preferences
been applied: accuracy, precision and, recall rate. Also, true positive (𝑝𝑟𝑒𝑓 _𝑛𝑎𝑚𝑒) of the SM users.
(TP), false positive (FP), true negative (TN), and false-negative (FN) are Compared with the Pokec and Renren datasets, the results showed
four terms that are used to describe these metrics, as shown in Table 6. that our dataset can be used to precisely recognize users’ behaviors
All experiments were executed on a PC with an i7 processor and in the smart environments and enhance the predictions of SM users’
16 GB memory, and our code is implemented in R. preferences to obtain a robust recommendation. Furthermore, like any
The performance of the models on classifying the 𝑔𝑒𝑛𝑑𝑒𝑟 of the SM real-world user datasets, which are of great importance in the devel-
users is given in Table 7. These results show that our dataset signifi- opment of novel ML [34], our dataset shows the capability to evaluate
cantly outperforms the other datasets in terms of accuracy, precision, the credibility of ML and the influence of privacy preservation models.
and recall. Specifically, the highest gain in accuracy is observed for While this advantage adds an overload to the data’s generation, it
classifying users’ 𝑔𝑒𝑛𝑑𝑒𝑟 attribute using GB in the case of the SM-IoT enables a highly users’ representative dataset to be produced without
dataset (83.20%). Also, in the same way, the GB could achieve the extremely processing power or storage capacity.
best precision of 80.24%, when trained on the SM-IoT dataset, while
the best recall value is recorded as 99.67% by the RF model for the 5. Conclusion
same dataset. However, the overall performance varies concerning the
trained model and the dataset used. The RF classifier on the Pokec This work has proposed a new SM-IoT dataset based on two popular
dataset shows better performance than other models in classifying SM platforms, Facebook and a social IoT one. It is the first dataset
the 𝑔𝑒𝑛𝑑𝑒𝑟 attribute. The highest accuracy using the Renren dataset of its type to be made publicly available and is intended to appeal
is achieved with the GB model. In terms of precision and recall, the to researchers with diverse interests, including those keen on investi-
performance of the NB and FF models using the Pokec and Renren gating the relationships between IoT and genuine spaces. This work
datasets are nearly the same. This shows that compared with the has stimulated and generated this data. This work has outlined the

13
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

Table 7
Accuracy, Precision and Recall comparisons for classifying users’ gender.
Metric Accuracy (%) Precision (%) Recall (%)
Model/dataset SM-IoT Pokec Renren SM-IoT Pokec Renren SM-IoT Pokec Renren
GB 83.20 69.21 78.68 80.24 63.12 74.82 87.71 82.19 86.13
RF 76.34 72.77 59.68 70.62 67.87 55.85 85.86 82.33 80.10
NB 58.82 50.04 49.87 53.51 49.76 49.65 99.67 89.42 87.64
FF 61.68 51.15 49.94 54.47 50.28 49.70 98.23 94.65 93.35

Table 8
Accuracy, Precision and Recall comparisons for predicting users’ preferences.
Metric Accuracy (%) Precision (%) Recall (%)
Model/dataset SM-IoT Pokec Renren SM-IoT Pokec Renren SM-IoT Pokec Renren
GB 98.77 28.77 43.00 98.98 32.92 43.04 99.36 88.53 77.95
RF 91.74 52.47 41.85 93.71 47.53 44.50 95.29 66.70 57.10
NB 74.14 18.21 30.56 74.14 21.28 34.17 100 32.65 65.65
FF 77.88 39.64 39.60 78.97 30.56 34.98 95.72 76.81 67.24

data collection method used, and data descriptions are explained. Then, [7] Z. Cai, Z. He, X. Guan, Y. Li, Collective data-sanitization for preventing sensitive
the focal features of the dataset are discussed, with its possibilities information inference attacks in social networks, IEEE Trans. Dependable Secur.
Comput. 15 (4) (2016) 577–590.
and limitations apropos other types of SM data highlighted. These
[8] J. Leskovec, A. Krevl, SNAP datasets, SNAP Datasets: Stanf. Large Netw. Dataset
features exemplify the kinds of research that can be handled using Collect. (2014) http://snap.stanford.edu/data.
such a dataset and provide a point of departure for upcoming studies. [9] Z. He, Z. Cai, J. Yu, Latent-data privacy preserving with customized data utility
Finally, three metrics were used to compare the dataset’s validity: for social network data, IEEE Trans. Veh. Technol. 67 (1) (2017) 665–673.
accuracy, precision, and recall. The statistical comparisons showed that [10] D. Mouris, N.G. Tsoutsos, M. Maniatakos, Terminator suite: Benchmarking
the proposed dataset can significantly improve the performance of the privacy-preserving architectures, IEEE Comput. Archit. Lett. 17 (2) (2018)
122–125.
trained models. In the level of the models, the GB model, which was
[11] M. Siddula, L. Li, Y. Li, An empirical study on the privacy preservation of online
trained on the SM-IoT dataset, had the highest accuracy and precision, social networks, IEEE Access 6 (2018) 19912–19922.
while the NB model of the SM-IoT dataset had the highest recall rate. [12] S. Stieglitz, M. Mirbabaie, B. Ross, C. Neuberger, Social media analytics–
We believe that by further optimizing these models, better results could challenges in topic discovery, data collection, and data preparation, Int. J. Inf.
be obtained. Manage. 39 (2018) 156–168.
In the future, this dataset will be used in validating various machine [13] 2020, 24 June, URL https://www.wordstream.com/blog/ws/2020/04/14/
twitter-statistics, 2020.
learning-based cybersecurity problems, such as intrusion detection,
[14] 2020, 24 June, URL https://www.oberlo.com/blog/facebook-statistics, 2020.
privacy preservation, fake news detection, and recommender systems [15] J. Peng, A. Agarwal, K. Hosanagar, R. Iyengar, Network overlap and content
in social media and its IoT systems. sharing on social media platforms, J. Market. Res. 55 (4) (2018) 571–585.
[16] M. Roopa, S. Pattar, R. Buyya, K.R. Venugopal, S. Iyengar, L. Patnaik, Social
6. Dataset access internet of things (SIoT): Foundations, thrust areas, systematic review and future
directions, Comput. Commun. (2019).
[17] P. Geetha, C. Naikodi, S.L.N. Setty, Design of big data privacy framework—A
Our dataset is maintained at the UNSW and can be located at the
balancing act, in: Advances in Data Sciences, Security and Applications, Springer,
URL http://handle.unsw.edu.au/1959.4/resource/collection/resdatac_ 2020, pp. 253–265.
1112/1. [18] C. Ju, J. Wang, C. Xu, A novel application recommendation method combining
social relationship and trust relationship for future internet of things, Multimed.
Declaration of competing interest Tools Appl. 78 (21) (2019) 29867–29880.
[19] M. Seliem, K. Elgazzar, K. Khalil, Towards privacy preserving iot environments:
A survey, Wirel. Commun. Mobile Comput. 2018 (2018).
The authors declare that they have no known competing finan- [20] J. Zhang, J. Sun, R. Zhang, Y. Zhang, X. Hu, Privacy-preserving social media
cial interests or personal relationships that could have appeared to data outsourcing, in: IEEE INFOCOM 2018-IEEE Conference On Computer
influence the work reported in this paper. Communications, IEEE, pp. 1106–1114.
[21] D. Yang, B. Qu, P. Cudré-Mauroux, Privacy-preserving social media data pub-
References lishing for personalized ranking-based recommendation, IEEE Trans. Knowl. Data
Eng. 31 (3) (2018) 507–520.
[22] K.K. Mohbey, S. Kumar, V. Koolwal, Advertisement prediction in social media
[1] N. Choudhury, World wide web and its journey from web 1.0 to web 4.0, Int.
environment using big data framework, in: Multimedia Big Data Computing For
J. Comput. Sci. Inf. Technol. 5 (6) (2014) 8096–8100.
IoT Applications, Springer, 2020, pp. 323–341.
[2] T.J. Berners-Lee, The world-wide web, Comput. Netw. ISDN Syst. 25 (4–5) (1992)
454–459. [23] Y. Wang, Data preparation for social network mining and analysis, 2014.
[3] A. Gupta, A. Deokar, L. Iyer, R. Sharda, D. Schrader, Big data & analytics [24] M. Lytras, Visvizi, Big data research for social science and social impact,
for societal impact: Recent research and trends, Inf. Syst. Front. 20 (2) (2018) Sustainability 12 (2020).
185–194. [25] S. Gella, M. Lewis, M. Rohrbach, A dataset for telling the stories of social media
[4] A.A. Alalwan, Investigating the impact of social media advertising features on videos, in: Proceedings of The 2018 Conference On Empirical Methods in Natural
customer purchase intention, Int. J. Inf. Manage. 42 (2018) 65–77. Language Processing, pp. 968–974.
[5] W. Wang, H. Yin, X. Du, W. Hua, Y. Li, Q.V.H. Nguyen, Online user repre- [26] Y. Ding, S. Yan, Y. Zhang, W. Dai, L. Dong, Predicting the attributes of social
sentation learning across heterogeneous social networks, in: Proceedings of The network users using a graph-based machine learning method, Comput. Commun.
42nd International ACM SIGIR Conference On Research and Development in 73 (2016) 3–11.
Information Retrieval, pp. 545–554. [27] L. Takac, M. Zabovsky, Data analysis in public social networks, in: Interna-
[6] M.A. Ferrag, L. Maglaras, A. Ahmim, Privacy-preserving schemes for ad hoc social tional Scientific Conference and International Workshop Present Day Trends of
networks: A survey, IEEE Commun. Surv. Tutor. 19 (4) (2017) 3015–3045. Innovations, Vol. 1.

14
S. Salim et al. Ad Hoc Networks 128 (2022) 102786

[28] K. Areekijseree, R. Laishram, S. Soundarajan, Guidelines for online network Sara Salim is a Ph.D. student at School of Engineering
crawling: A study of data collection approaches and network properties, in: and IT (SEIT), University of New South Wales (UNSW) at
Proceedings of The 10th ACM Conference On Web Science, pp. 57–66. Canberra. She received her bachelor’s in computer science
[29] W. Xiong, J. Wu, H. Wang, V. Kulkarni, M. Yu, S. Chang, X. Guo, W.Y. Wang, in 2012 from Faculty of Computer and Information, Zagazig
TWEETQA: A social media focused question answering dataset, 2019, arXiv University, Egypt and her Masters in Optimization and
preprint arXiv:1907.06292. Operation Research Applications in 2016 from Faculty of
[30] S. Salim, B. Turnbull, N. Moustafa, A Blockchain-Enabled Explainable Federated Computer and Information, Menoufia University, Egypt. She
Learning for Securing Internet-of-Things-Based Social Media 3.0 Networks, IEEE enrolled in UNSW Canberra to initiate her Ph.D. studies in
Trans. Comput. Soc. Syst. (2021) 1–17, http://dx.doi.org/10.1109/TCSS.2021. the field of Privacy Preservation with a particular interest
3134463. in Social Networks and the Internet of Things. Her research
[31] D. Van Bruwaene, Q. Huang, D. Inkpen, A multi-platform dataset for detecting interests include Cyber Security, Privacy Preservation and
cyberbullying in social media, Lang. Resour. Eval. 54 (4) (2020) 851–874. Artificial Intelligence techniques.
[32] S.S. Intille, K. Larson, J. Beaudin, J. Nawyn, E.M. Tapia, P. Kaushik, A living
laboratory for the design and evaluation of ubiquitous computing technologies,
in: CHI’05 Extended Abstracts On Human Factors in Computing Systems, pp. Benjamin Turnbull is an Associate Professor at the Uni-
1941–1944. versity of New South Wales at the Australian Defence
[33] M.K. O’Brien, N. Shawen, C.K. Mummidisetty, S. Kaur, X. Bo, C. Poellabauer, Force, Canberra. He is the Program Director for UNSW
K. Kording, A. Jayaraman, Activity recognition for persons with stroke using Online Masters (Cyber), and Honours Coordinator (Cyber).
mobile phone technology: toward improved performance in a home setting, J. His research focuses on the intersection of cyber-security,
Med. Internet Res. 19 (5) (2017) e184. simulation, scenario-based learning, and the security of
[34] H. Alemdar, H. Ertan, O.D. Incel, C. Ersoy, ARAS human activity datasets in heterogeneous devices and future networks. He is also a
multiple homes with multiple residents, in: 2013 7th International Conference Certified Information Systems Security Professional (CISSP).
On Pervasive Computing Technologies For Healthcare and Workshops, IEEE, pp. Ben has been working in digital forensics, network secu-
232–235. rity, and simulation for 17 years. His previous work as a
[35] N. Koroniotis, N. Moustafa, E. Sitnikova, B. Turnbull, Towards the development defence research scientist saw him develop and deploy new
of realistic botnet dataset in the internet of things for network forensic analytics: technologies to multiple clients, globally.
Bot-iot dataset, Future Gener. Comput. Syst. 100 (2019) 779–796.
[36] Y. Al-Hadhrami, F.K. Hussain, Real time dataset generation framework for
intrusion detection systems in IoT, Future Gener. Comput. Syst. 108 (2020) Nour Moustafa is Coordinator of Postgraduate Cyber
414–423. Discipline & Leader of Intelligent Security at School of
[37] 2021, 27 January, URL https://nodered.org/. Engineering & Information Technology (SEIT), University of
[38] C.C. Aggarwal, S.Y. Philip, Privacy-Preserving Data Mining: Models and New South Wales (UNSW)’s UNSW Canberra, Australia. He
Algorithms, Springer Science & Business Media, 2008. was a Post-doctoral Fellow at UNSW Canberra from June
[39] R. Mendes, J.P. Vilela, Privacy-preserving data mining: methods, metrics, and 2017 till December 2018. He received his Ph.D. degree in
applications, IEEE Access 5 (2017) 10562–10582. the field of Cyber Security from UNSW Canberra in 2017.
[40] L. Zhang, X. Zhu, X. Han, J. Ma, Differentially privacy-preserving social IoT, He obtained his Bachelor and Master degree of Computer
in: 2019 11th International Conference On Wireless Communications and Signal Science in 2009 and 2014, respectively, from the Faculty
Processing (WCSP), IEEE, pp. 1–6. of Computer and Information, Helwan University, Egypt.
[41] A. Ion, C.D. Nye, D. Iliescu, Age and gender differences in the variability of His areas of interest include Cyber Security, in particular,
vocational interests, J. Career Assess. 27 (1) (2019) 97–113. Network Security, IoT security, intrusion detection systems,
[42] P. Adamopoulos, A. Ghose, V. Todri, The impact of user personality traits on statistics, Deep learning and machine learning techniques.
word of mouth: Text-mining social media platforms, Inf. Syst. Res. 29 (3) (2018) He has several research grants with totalling over AUD
612–640. 1.2 Million. He has been awarded the 2020 prestigious
[43] A. Aleryani, W. Wang, B. De La Iglesia, Dealing with missing data and Australian Spitfire Memorial Defence Fellowship award. He
uncertainty in the context of data mining, in: International Conference On Hybrid is also a Senior IEEE Member, ACM Distinguished Speaker,
Artificial Intelligence Systems, Springer, pp. 289–301. as well as CSCRC and Spitfire Fellow. He has served
[44] C. Saranya, G. Manikandan, A study on normalization techniques for privacy his academic community, as the guest associate editor of
preserving data mining, Int. J. Eng. Technol. (IJET) 5 (3) (2013) 2701–2704. IEEE transactions journals, including IEEE Transactions on
[45] H. Akoglu, User’s guide to correlation coefficients, Turk. J. Emerg. Med. 18 (3) Industrial Informatics, IEEE IoT Journal, as well as the
(2018) 91–93. journals of IEEE Access, Future Internet and Information
[46] S. Salim, N. Moustafa, B. Turnbull, Privacy-encoding models for preserving utility Security Journal: A Global Perspective. He has also served
of machine learning algorithms in social media, in: 2020 IEEE 19th International over seven conferences in leadership roles, involving vice-
Conference On Trust, Security and Privacy in Computing and Communications chair, session chair, Technical Program Committee (TPC)
(TrustCom), IEEE, 2020, pp. 856–863. member and proceedings chair, including 2020–2021 IEEE
[47] T.T. Truong, D. Dinh-Cong, J. Lee, T. Nguyen-Thoi, An effective deep feedforward TrustCom and 2020 33rd Australasian Joint Conference on
neural networks (DFNN) method for damage identification of truss structures Artificial Intelligence.
using noisy incomplete modal data, J. Build. Eng. 30 (2020) 101244.

15

You might also like