1 Introduction

Nowadays, people increasingly rely on location based services (LBS), such as Google Map and AMAP. As an essential technique for LBS systems, spatial keyword search [8, 17, 31, 33, 41, 44, 45, 47, 56, 57, 60] has been widely known as an important topic. As a result, a great deal of efforts [22, 38, 42, 43, 46] have been made so far to enable the location based services to be efficient and of high-quality.

In recent years, extensive studies have been carried out on processing Spatial Keyword Queries (SKQ) [10, 14, 28, 34, 36, 46, 48, 50, 51]. Pioneer works [18, 39, 52] mainly deal with SKQs with boolean or approximate keyword match. Though some classical indexing structures, such as IR-tree [18], S2I [39] and MHR-tree [52], are successful to reduce the computational and I/O cost, they are unable to retrieve the objects that are synonyms but literally different to query keywords. More recently, some semantic-aware SKQ approaches [15, 19, 37] have been proposed to capture the semantic meanings of the keywords learned from the natural language process tools (e.g. word embedding [1]), so that more meaningful results can be finally returned. However, since the semantic vectors are high dimensional. These indices, e.g. NIQ-tree proposed in [37], show unsatisfactory pruning effect in query processing due to the phenomenon of ‘curse of dimensionality’. More effective solutions are thus needed to facilitate both effective and efficient querying.

Therefore, this paper aims to design a more effective indexing and querying framework for semantic-aware SKQs. It is yet a challenging problem due to the following main aspects: firstly, the semantic understanding of keywords requires large scale latent features (e.g. parsed by word embedding), with each phrase represented by a high dimensional vector [1, 26], leading to not only more memory space in storage, but also higher I/O cost in data access. Secondly, each spatial object is associated with large-scale latent features in semantic space, which severely deteriorates the pruning effect of the multi-dimensional spatial keyword search algorithms [37]. This explains the poor performance of existing semantic-aware indexing methods when the dimensionality is large (for describing semantics). A key problem is thus how to index high dimensional data in relatively low dimensions while enabling accurate semantic-aware SKQ results to be successfully found.

Motivated by the effectiveness of pivot-based metric index [3, 35, 49], we propose a novel hybrid indexing structure called S2R-tree to address all the above issues. The S2R-tree adopts a hierarchical structure to seamlessly integrate information in spatial and semantic domains. A pivot-based space mapping mechanism is designed, so that high dimensional vectors can be transformed to low dimensional coordinates, in which the vectors tend to have high variance. In this way, the S2R-tree indexes the semantic vectors using the low dimensional pivot-based coordinates, rather than the original, so that large dead space can be avoided. A query processing algorithm on top of the S2R-tree tree is further proposed to prune the search space based on some theoretical bounds. We also exploit pivot-based principles of data partitioning and filtering, so as to prune in the original semantic space while using the low dimensional S2R-tree only. To sum up, the contributions of the paper can be summarized as:

  1. 1.

    We design a set of pivot-based principles for space transformation and partitioning, so that high dimensional semantic vectors can be rationally mapped to low dimensional coordinates with high data variance.

  2. 2.

    We propose a novel hybrid indexing structure S2Rtree, which not only integrates the spatial and semantic information seamlessly, but also represents semantic information by pivot-based coordinates in low dimensions, so that pruning effect can be improved significantly.

  3. 3.

    We design an efficient and accurate SKQ processing algorithm on top of the S2R-tree, which can greatly prune the high dimensional search space based on some theoretical bounds.

  4. 4.

    We conduct an extensive experiment analysis and make comparisions with baseline algorithms, then demonstrate the efficiency of our method.

The remainder of this paper is organized as follows. Section 2 surveys the related works of our issue. Section 3 formalizes the problem we are trying to work out. Section 4 introduces the baseline algorithm and our solution to handle this problem. Section 5 presents experiment results. Finally, Section 6 gives the conclusions and our future work.

2 Related work

Spatial keyword query is widely used in location-based devices and services which puts both spatial and textual relevance between a query and objects in the dataset into consideration. A range of contributions are already made in the literature that study different aspects of spatio-textual querying [5,6,7, 9, 11, 12, 16, 23,24,25, 30, 32, 53]. Some efforts are made to support the Spatial Keyword Boolean Query (SKBQ) [18, 20] that requires exact keywords match, which may lead few or no results to be found. To overcome this problem, lots of work have been done to support the Spatial Keyword Approximate Query (SKAQ) [29, 39, 52], which ensures the query results are no longer sensitive to spelling errors and conventional spelling differences. Numerous works study the problem of spatial keyword query on road network based query, collective query [4, 21], diversified querying [54], why-not questions [13, 58], interactive querying [48, 59] and so on.

Many classical indexing structures are proposed to support spatial keyword query, like IR-tree [18], MHR-tree [52], S2I [39], RCA [55] etc. All of them are lack of the semantics in objects and queries, which makes them unable to retrieve the objects that are synonyms but literally different to query keywords. For this purpose, we need a novel structure which can accurately return to users objects that have both high spatial and semantic similarities to query.

In [37], Qian proposes an iDistance based hybrid indexing structure, called NIQ-tree, which first incorporates semantic relevance into consideration. The structure of NIQ-tree is a combination of Quadtree for spatial, iDistance for semantics, and inverted lists for keywords. It is then organized in a hierarchical structure, so that pruning in all domains can be achieved simultaneously. Since semantics of user’s query are represented by high dimensional vectors obtained via word embedding [1] or LDA [2], it makes query processing inefficient due to the phenomenon of ‘curse of dimensionality’. This severely deteriorates the pruning effect of the multi-dimensional spatial keyword search algorithms. More effective methods that can enhance the efficiency of high-dimensional data management are thus highly sought after.

As pivot-based indexing method [3, 35, 49] has been proven to be successful in indexing high dimensional metric data, we consider to adopt it to work out the low-efficiency of pruning in high dimensional space. A variant of iDistance [27], i.e. M-Index [35], adopts pivot-based Voronoi space partition technique and each node is represented by a pivot permutation. According to the similarity of permutation, M-Index can improve the pruning efficiency and thus enhance the performance of searching. Traina et al. [49] proposes a pivot-based metric index method which can greatly improve the performance of indexing via referring to some pivots to build index. However, as far as we know, existing semantic-aware spatial keyword query indexes cannot support the accuracy and efficiency requirements of semantic-based retrieval for SKQ. In this paper, we propose a novel hybrid indexing structure, which not only integrates the spatial and semantic information seamlessly, but also represents semantic information by pivot-based coordinates in low dimensions, so that pruning effect can be improved significantly.

3 Problem definition

In this section, we briefly introduce some basic definitions and the statment of our problem. Table 1 summarizes the notations used throughout the paper.

Table 1 Summary of notations

Definition 1 (Spatial object)

A spatial object is an object in geographical space, such as a restaurant, a cinema, a library and so on. Each spatial object o is modelled as o = (ε,μ), where ε is a geographical location that contains a longitude and latitude in two-dimension space, μ is a set of keywords to describe the spatial object o.

Definition 2 (Spatial keyword query)

A spatial keyword query is a user’s input. Each query is formalized as q = (ε,μ), where ε is the position of q represented by a longtitude and latitude in two dimensional geographical space, and μ is a set of words representing user’s interests, such as ‘Sichuan restaurant’. We use the term query to represent it in short in the rest of this paper.

Example 1

Figure 1 shows an example with a query q and seven spatial objects, where each spatial object is a point-of-interest(POI) with a location and a set of keywords. The query aims to find a POI w.r.t. ‘coffee shop’ close to the query location. We can see that keywords of both objects o2 (‘STARBUCKS’) and o3 (‘ENO COFFEE’) are consistent to the query keywords in semantics, while in contrast, other objects have low semantic relevance to the query obviously. We thus use the term semantic vector to model the semantics of any given keyword.

Fig. 1
figure 1

An example of spatial keyword query

Definition 3 (Semantic vector)

Semantic vector that describes the semantic information of a set of keywords is a d-dimension vector in latent semantic space. In latent features (γ1,γ2,⋯,γd), each component represents a semantic interpretation. The different syntactic and semantic features of words are distributed to each dimension of the word. Given a set of keywords, it corresponds to a semantic vector γ = (γ1,γ2,⋯,γd) to denote, where each field represents a latent feature of a word, which captures useful syntactic and semantic features. We use γq and γo to denote the semantic vector of a query q and a spatial object o respectively in the following paper.

Next we discuss the ranking of spatial objects.

Definition 4 (Distance function)

Distance function is used to calculate the distance between an object and a query so that we can use this result to return the objects which meet the requirements of user. In spatial space, we use Euclidean distance to compute their spatial distance denoted as dist(q.ε,o.ε). We use the sigmoid function to normalize it to [0,1] as is shown in Eq. 1 [40].

$$ \mathcal{S}\mathcal{D}(q,o)=\frac{2}{1+e^{-dist(q.\varepsilon ,o.\varepsilon )}}-1 $$
(1)

In semantic space, we alse utilize Euclidean distance to calculate their semantic distance \(\mathcal {TD}(q,o)\). In order to let \(\mathcal {S}\mathcal {D}(q,o)\) and \(\mathcal {TD}(q,o)\) contribute equally to the result, we use Eq. 2 to make it normalized to [0,1].

$$ \mathcal{TD}(q,o)=\frac{\sqrt{\textstyle{\sum}_{i=0}^{d-1}(\gamma_{qi}-\gamma_{oi})^{2}}}{\max\mathcal {TD}}\\ $$
(2)

where γi is the i-th component of ∗’s semantic vector γ; d is the dimension of semantic vector; \( \max \mathcal {TD} \) is the maximum possible pairwise distance in semantic space.

Considering the spatial and semantic influences in the final result, we set a weighting factor λ to balance these two distances. And distance function \(\mathcal {D}(q,o)\) [10] can be computed as follows:

$$ \mathcal{D}(q,o)=\lambda \times \mathcal{S}\mathcal{D}(q,o)+(1-\lambda )\times \mathcal{TD}(q,o) $$
(3)

Example 2

Continuing with the example in Fig. 1, where the spatial distances between the query location and spatial objects are also given. According to Eq. 2, we can easily derive the semantic distance between q and o2 and o3, i.e. \(\mathcal {TD}(q,o_{2})=0.141\) and \(\mathcal {TD}(q,o_{3})=0.063\). That means, they are very relevant to query keywords in semantics. While by considering spatial distance and semantics simultaneously, the distance from q to o2, \(\mathcal {D}(q,o_{2})=0.1955\) is much less than that of o3, \(\mathcal {D}(q,o_{3})=0.3315\), meaning that o2 has higher priority to return. This coincides to our observation that o3 would incur much more travel cost than o2.

Problem formalization

Given a set of spatial objects O, a spatial keyword query q and an integer k, the problem of this paper is to return the top-k objects in O that are most similar to q according to Eq. 3.

4 Main algorithms

In this section, we introduce the baseline algorithm first and then propose our solution to solve the problem.

4.1 NIQ*-tree based algorithm

Towards the SKQ problem defined in Section 2, NIQ-tree [37] can be modified to a two-layered structure, called NIQ*-tree, which combines spatial and semantic domains hierarchically.

As is shown in Fig. 2, we adopt a spatial-first method because of its better pruning effect in spatial domain due to its 2D nature. NIQ*-tree utilizes Quadtree to index them according to their spatial closeness since Quadtree has more stability when a new object is inserted. For each leaf node of Quadtree, all objects are further organized by iDistance index in the semantic layer, such that objects are grouped and managed by their semantic coherence, and then construct a B+tree to organize these objects according to their key value which can be calculated as follows:

$$ key=i \times r+ \mathcal{TD}(p_{i},o) $$
(4)

where i is the identifier of the cluster Ci; r is a constant to map the objects in Ci into the range of [i × r,(i + 1) × r); pi is the reference point of to Ci. The basic form of a node in NIQ*-tree is \(n=(p,\mathcal {R},o,r)\), where p is the pointer(s) to its child node(s); \( \mathcal {R} \) is the minimum bounding rectangle(MBR in short) in spatial, which covers all objects contained by n; o and r are the center point and radius to refer to a semantic hyper-sphere that covers all objects contained by n in spatial domain.

Fig. 2
figure 2

NIQ*-Tree

During the process of searching, we use a priority queue to traverse the spatial layer nodes according to the best match distance \(\mathcal D_{bm}\) to a query q which can be calculated as follows:

$$ \mathcal D_{bm}(q,n)=\lambda \times \min \mathcal {D}_{s}(q,n)+(1-\lambda )\times \min \mathcal {D_{T}}(q,n) $$
(5)

and

$$ \min \mathcal D_{T}(q,n)=\left\{\begin{array}{ll} 0 & \mathcal{TD}(q,n.o)\leq n.r\\ \mathcal {TD}(q,n.o)-n.r & \mathcal {TD}(q,n.o) > n.r \end{array}\right.; $$

where λ is an integer between 0 and 1; \(\min \mathcal {D}_{s}(q,p)\) is the minimum spatial distance between query point \( q.\mathcal {R} \) and the spatial MBR of n; \(\min \mathcal {D_{T}}(q,n)\) is the minimum possible semantic distance between q and n; \(\mathcal {TD}(q,n.o)\) is the semantic distance between q and n.o. Note that, \( \mathcal D_{bm}(q,n) \) is the lower bound distance to q for all unvisited objects.

In the query processing, we dynamically maintain the top-k minimum distance for all scanned objects and keep the k-th minimum distance as an upper bound. If the node we fetch from the priority queue is a non-leaf node, we add all its child nodes to the queue; otherwise, we access objects in the iDistance node of n which intersects with the searching space and then update top-k and the upper bound. The search processing terminates when the lower bound is no less than the upper bound indicating that remaining unvisited objects have no opportunity to be better than the current top-k results.

4.2 S2R-tree based algorithm

Since the pruning in semantic space is usually inefficient due to the high dimensionality, a more rational solution can be transfering the semantic vectors into a low dimensional space for indexing. To this end, motivated by the OmniR-tree [49], this section further proposes a more effective hybrid indexing structure, called Spatial-Semantic R-tree (S2R-tree in short), to improve querying efficiency without sacrificing accuracy.

Index structure

S2R-tree adopts a two-layer structure shown as Fig. 3. Similar to NIQ*-tree, it is a spatial-first structure. R-tree is used to group objects according to their geographical coordinates since its superior pruning effect in 2D space.

Fig. 3
figure 3

An example of S2R-tree

In the semantic domain, motivated by [49], S2R-tree adopts a pivot-based indexing method to avoid the ‘large dead space’ phenomenon of iDistance. The basic idea of our solution is to map high dimensional semantic vectors to a low dimensional space, so that more effective indexing can be achieved. More specifically, we select a set P = (p0,p1,⋯,pm− 1) of m pivots (known as reference points) in the original space, so that a d-dimensional semantic vector γ = (γ0,γ1,⋯,γd− 1) can be transformed to a m-dimensional pivot-based coordinates \( \gamma ^{P} = ({\gamma ^{P}_{0}}, {\gamma ^{P}_{1}}, ..., \gamma ^{P}_{m-1}) \).

Definition 5 (Pivot-based coordinates)

Let P = (p0,p1,⋯,pm− 1) be a set of pivots in d-dimensional space, the pivot-based coordinates is a vector in the pivot-based system subject to P. Given semantic vector γ in original d-dimensional space, it is projected to an m-dimensional pivot-based coordinates \( \gamma ^{P} = ({\gamma ^{P}_{0}}, {\gamma ^{P}_{1}},\cdots , \gamma ^{P}_{m-1}) \) in the pivot-based system, where each component \( {\gamma ^{P}_{i}} (0 \leq i < m) \) is the semantic distance between γ and pivot pi in the d-dimensional space (i.e. original semantic space), such that \({\gamma ^{P}_{i}} = \mathcal {TD}(\gamma ,p_{i})\) which can be calculated as Eq. 2.

Example 3

As is shown in Fig. 3b, in original space, o1,o2,⋯,o7 are objects in Example 1 and we choose objects o0 and o4 as p0 and p1 respectively, i.e. P = {o0,o4}. The semantic vectors (e.g. for o0 and o4) obtained by Word2Vec are d-dimensional. While in the pivot-based coordinates, each object is represented by a m-dimensional coordinate, where each dimension is the semantic distance between the pivot and the object in the original space. Assuming that the distances between o2 and each pivot in the original space are \(\mathcal {TD}(o_{2},o_{0})=0.5\), \(\mathcal {TD}(o_{2},o_{4})=0.95\), thus we have the pivot-based coordinate of o2, i.e. \({o_{2}^{P}}=(\mathcal {TD}(o_{2},o_{0}), \mathcal {TD}(o_{2},o_{4}))=(0.5,0.95)\).

figure h

By projecting the semantic vectors to the pivot-based coordinates, it is successfully transformed from high dimensional space to lower dimensions (i.e. from d to m). Note that data variance in the low-dimensional representation is expected to be maximized, and it is affected by the pivots in P. We apply the HF algorithm [49] to generate P, in which pivots are properly located in the border of dataset to best identify objects in semantic space collectively. We finally utilize R-tree to index the pivot-based coordinates in m-dimensions, and it is expected to be effective given that the number of m is always small.

Next, we discuss how to reasonably generate the pivot set P, for maintaining the properties of original space in the low dimensional space. We are aimed to find a set of m pivots, which have the most dissimilarity in the original space, such that they can best identify objects in the semantic space collectively. As we all know finding such a set of pivots is a NPC problem, thus we apply a heuristic method [49]. As shown in Algorithm 2, we first initialize P by adding a pair of points (i.e. p0 and p1) with the maximum pairwise distance in the dataset. Afterwards, we keep adding semantic vectors into P, until m pivots are selected (i.e.|P| = m). In each round, the semantic vector that has the maximum distance to all partially selected pivots in P is chosen as the new pivot (Lines 2–4), since it ensures data variance in low dimensional space. Specifically, we use the following formula to measure the distance between a semantic vector to all pivots in set P, such that

$$ \mathcal MG(o,P)=\sum\limits_{p}^{p \in P} \mathcal {TD}(o,p) $$
(6)

where \(\mathcal {TD}(o,p)\) is the semantic distance between o and p.

figure i

The format of each node of the S2R-tree is \( n = (ptr, \mathcal {R}, \mathcal {B}) \), where ptr is the pointer(s) to the child node(s); \( \mathcal {R} \) is the spatial MBR to cover all objects contained by n geographically; \( \mathcal {B} \) is the minimum bounding box(MBB in short) that covers the pivot-based coordinates (m dimensional) of objects contained by n. The MBB of nodes in spatial layer are computed on basis of that of its child nodes in a bottom-up way. In this way, we derive hybrid index S2R- tree, which integrates spatial and low dimensional semantic information in a seamless way.

Query processing

On top of the S2R-tree, the SKQ processing is carried out in spatial and semantic spaces collaboratively. Algorithm 1 shows the details.

We set the upper bound \(\mathcal {D}_{ub}\) and lower bound \(\mathcal {D}_{lb}\) to store current searching scope(Lines 1,2). Query processing follows the steps below. Starting from the root node, we traverse the S2R-tree using a priority queue Q (Line 4), and keep visiting the node popped out from Q (Lines 5–16). In this procedure, we dynamically maintain the top-k candidates in C initialized as empty (Line 3) of the objects that we have seen, for any two visited objects oC and oC it must hold that \(\mathcal D(q, o) < \mathcal D(q, o^{\prime }) \). We also keep track of the distance of the k-th object in C, which is obviously the upper bound distance \(\mathcal D_{ub} = \max \left \{ \mathcal D(q,o)|o \in C \right \} \) of the final results (since C are the objects we have found so far). Everytime when a node n is popped out from the queue Q, we perform:

  • if n is a leaf node of S2R-tree, then we visit all objects belonging to n (Lines 7–12). More specifically, we calculate their actual distances to query. If \(\mathcal D(q, o) <\mathcal D_{ub} \), meaning that the object o is superior than at least one top-k candidates in C that we have found, then o is included to replace the worst object in C, and \(\mathcal D_{ub} \) is updated accordingly.

  • if n is a non-leaf node, we simply insert all its child nodes into Q(Lines 14-15), in which all objects are ranked by their minimum possible distance \( \min \mathcal {D}(q, n) \) to query q, which is defined as:

    $$ \min \mathcal{D}(q,n)=\lambda \times \min \mathcal {SD}(q,n)+(1-\lambda )\times \min \mathcal {TD}(q,n) $$
    (7)

where λ is an integer between 0 and 1; \(\min \mathcal {SD}(q,n)\) is the minimum spatial distance from q to node n and \(\min \mathcal {TD}(q,n)\) is possible minimum semantic distance from q to any object contained in the node n which can be calculated as follows (see Lemma 2 for detailed proof):

$$ \min \mathcal {TD}(q,n)=\max (\min \mathcal {TD}_{i}(q,n),0\leq i<m ) $$
(8)

where \(\min \mathcal {TD}_{i}(q,n)\) is the possible minimum semantic distance between q and n, w.r.t. (the bounds derived from) the pivot pi, such that (see Lemma 1 for detailed proof)

$$ \min \mathcal {TD}_{i}(q,n)=\left\{\begin{array}{ll} \min(n.\mathcal{B}_{i})-\mathcal {TD}(q,p_{i})\text{ ,}& \mathcal {TD}(q,p_{i})\leq \min(n.\mathcal{B}_{i})\\ 0\text{ ,}&\min(n.\mathcal{B}_{i})< \mathcal {TD}(q,p_{i})\leq \max(n.\mathcal{B}_{i})\\ \mathcal {TD}(q,p_{i})-\max(n.\mathcal{B}_{i})\text{ ,} & \mathcal {TD}(q,p_{i})>\max(n.\mathcal{B}_{i}) \end{array}\right. $$
(9)

where \(n.\mathcal {B}_{i}\) refers to the i-th dimension of node n’s MBB in low dimensional space, where \(\min (n.\mathcal {B}_{i})\) and \(\max (n.\mathcal {B}_{i})\) are the minimum and maximum values of \( n.\mathcal {B}_{i} \) respectively, denoting the minimum and maximum distances between any node on and pivot pi in original space, i.e. \( \min (n.\mathcal {B}_{i}) \leq \mathcal TD(o, p_{i}) \leq \max (n.\mathcal {B}_{i}) \).

Before we prove that \( \min \mathcal {D}(q, n) \) is the minimum possible distance from q to any object o belonging to node n, we first use Lemma 1 to show a bound of distance between query q and objects in n driven by a pivot pi in original space using S2R-tree.

Lemma 1

Given a queryqand a nodenof the S2R-tree,when considering the pivotpiP,the distance\( \min \mathcal {TD}_{i}(q,n)\)in Eq. 9is a lower bound semantic distance betweenqand any objectoin n with respect topi,i.e.\( \forall o \in n: \mathcal {TD}(q, o) \geq \min \mathcal {TD}_{i}(q,n) \).

Proof

According to the above, when given a node n in S2R-tree, \(n.\mathcal {B}_{i}\) refers to the i-th dimension of node n’s MBB in low dimensional space, \(\min (n.\mathcal {B}_{i})\) and \(\max (n.\mathcal {B}_{i})\) denotes the minimum and maximum distances respectively between any node on and pivot pi in original space.

In accordance with the triangle inequality, for any given points of o, q and pi in the original space, we have:

$$ \mathcal{TD}(o,p_{i})\leq \mathcal{TD}(q,o)+\mathcal{TD}(q,p_{i}) $$
(10)
  • if \(\mathcal {TD}(q,p_{i})\leq \min (n.\mathcal {B}_{i})\) and according to Eq. 10, we have:

    $$ \left.\begin{array}{ll} \mathcal{TD}(q,o) \geq \mathcal{TD}(o,p_{i})-\mathcal{TD}(q,p_{i})\\ \\ \forall o \in n,\mathcal {TD}(o,p_{i})\geq \min(n.\mathcal{B}_{i}) \end{array}\right\}\Rightarrow \mathcal{TD}(q,o) \geq \min(n.\mathcal{B}_{i})-\mathcal{TD}(q,p_{i}) $$

    Thus,

    $$ \min \mathcal {TD}_{i}(q,n)=\min(n.\mathcal{B}_{i})-\mathcal{TD}(q,p_{i}) $$
    (11)
  • if \(\min (n.\mathcal {B}_{i}) < \mathcal {TD}(q,p_{i})\leq \max (n.\mathcal {B}_{i})\), it indicates qn, i.e.

    $$ \min \mathcal {TD}_{i}(q,n)=0 $$
    (12)
  • if \(\mathcal {TD}(q,p_{i}) > \max (n.\mathcal {B}_{i})\) and according to Eq. 10, we have:

    $$ \left.\begin{array}{ll} \mathcal {TD}(q,o) \geq \mathcal {TD}(q,p_{i})-\mathcal {TD}(o,p_{i})\\ \\ \forall o \in n,\mathcal {TD}(o,p_{i})\leq \max(n.\mathcal{B}_{i}) \end{array}\right\}\Rightarrow \mathcal {TD}(q,o) \geq \mathcal {TD}(q,p_{i})-\max(n.\mathcal{B}_{i}) $$

    Thus,

    $$ \min \mathcal {TD}_{i}(q,n)=\mathcal{TD}(q,p_{i})-\max(n.\mathcal{B}_{i}) $$
    (13)

By combining Eqs. 1112 and 13, Eq. 9 is proven. □

Example 4

According to Fig. 3, we have \(\min (R_{5}.\mathcal {B}_{1})=0.12\), \(\max (R_{5}.\mathcal {B}_{1})=0.3\). Assuming that semantic distance between q and p1 is \( \mathcal {TD}(q,p_{1}) =0.45\), which is greater than \( \max (R_{5}.\mathcal {B}_{1}) \). Thus \(\forall o \in R_{5},\mathcal {TD}(q,o) \geq \mathcal {TD}(q,p_{1})-\max (R_{5}.\mathcal {B}_{1})=0.15\), i.e. \(\min \mathcal {TD}_{1}(q,R_{5})=0.15\).

Based on Lemma 1, by considering multiple pivots, Eq. 8 is the bound semantic distance between query q and any object in node n (in original semantic space) according to Lemma 2.

Lemma 2

Given a query q, a node n in S2R-tree,the possible semantic distance\(\min \mathcal {TD}(q,n)\)from q to all objects in n can be calculated as Eq. 8.

Proof

According to Lemma 1, we have the possible minimum semantic distance based on pi between q and all objects in n, \( \min \mathcal {TD}_{i}(q,n) (0\leq i<m)\). Via grouping the possible minimum semantic distance between q and n based on each pivot, we have Eq. 8. Thus lemma 2 can be proven. □

Based on Lemma 1 and Lemma 2, we can see that \( \min \mathcal {D}(q, n) \) is the minimum possible distance between query q and node n, as shown in Theorem 1.

Theorem 1

Given a query q and a node n of S2R-tree,\( \min \mathcal {D}(q, n) \)in Eq. 7is the minimum possible distance betweenqand any objecton,i.e.\( \min \mathcal {D}(q, n) \leq D(q, o) \).

Proof

According to Lemma 2, in semantic domain, we can calculate the minimum possible semantic distance between q and n using Eq. 8, i.e.

$$ \forall o \in n, \min \mathcal {TD}(q,n) \leq \mathcal {TD}(q,o) $$
(14)

and in spatial domain, we have:

$$ \forall o \in n, \min \mathcal {SD}(q,o) \leq \mathcal {SD}(q,o) $$
(15)

Combined with Eqs. 14 and 15 and considering the joint impact of spatial and semantic distances to \( \min \mathcal {D}(q, n) \), we set a weighting factor λ, i.e.

$$ \begin{array}{@{}rcl@{}} \min \mathcal{D}(q, n) &=& \lambda \times \min \mathcal {SD}(q,n)+(1- \lambda) \times \min \mathcal {TD}(q,n)\\ & \leq& \lambda \times \mathcal {SD}(q,o)+(1- \lambda) \times \mathcal {TD}(q,o)=\mathcal {D}(q,o) \end{array} $$

i.e. \(\min \mathcal {D}(q, n) \leq D(q, o)\). Theorem 1 is proven. □

Note that, all information in \( \min \mathcal {D}(q, n) \) can be obtained from the S2R-tree structure. That means, the pivot-based indexing in low-dimensional can help us to accurately prune objects in the original (high-dimensional) semantic space. Every time when a node n is popped out from Q, we can also derive a lower bound distance of all unvisited objects, i.e. \(\mathcal D_{lb} = \min \mathcal {D}(q, n) \), according to Theorem 2.

Theorem 2

Every time a nodenis popped out fromQ,for any unvisited objecto,it must hold thatminD(q,n) ≤ D(q,o).

Proof

Assuming that n is any unvisited node except n in Q, as \(\min \mathcal {D}(q,n^{\prime })\) is the possible minimum distance from the query q to n and n is the top element in Q, we have:

$$ \left.\begin{array}{ll} \forall o \in n^{\prime},\min \mathcal{D}(q,n^{\prime})\leq \mathcal{D}(q,o)\\ \\ \min \mathcal{D}(q,n)\leq \min \mathcal{D}(q,n^{\prime}) \end{array}\right\}\Rightarrow \min \mathcal{D}(q,n)\leq \mathcal{D}(q,o) $$
(16)

i.e. \(\min \mathcal {D}(q,n)\) is the lower bound distance \(\mathcal D_{lb}\) to q for all unvisited objects. Thus, Theorem 2 can be proven. □

Once the condition \(\mathcal D_{lb}\geq \mathcal D_{ub} \) is met (Lines 15–16), we can safely terminate the SKQ searching process, since all unvisited objects are not possible to replace the current top-k candidates that we have found in C so far. Otherwise, it stops when Q is empty, meaning that all nodes of S2R-tree are traversed. Finally, we return users all spatial objects in candidate set C (Line 18).

Assuming that we have n objects and storing these objects in a S2R-tree to return the top-k objects that are most similar to query. When k is small, in this case, time complexity is related to the furcation f of S2R-tree, i.e. O(logfn). When k is large (i.e. close to n), time complexity is O(n). And space complexity of S2R-tree is O(n).

5 Experiments

In this section, we conduct several experiments to compare our proposed algorithms and present the results.

5.1 Experiment settings

On one hand, we use the real dataset of the online check-in records of Foursquare, which consists of the user ID, time of check-in, venue with geo-location (point of interest) and other information about the user written in plain English. There are 422030 objects in the whole datast in sum. On the other hand, we use another dataset crawled from DaZhong Comments which records some information of shops (point of interest) written in Chinese, including the name, positional information about the shops and the comments of users to POI.

In order to compare the semantic similarity between different venues, we utilize Word2Vec to derive the semantic vector, where each component is a latent feature of a word. Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

In our experiments, we apply HF algorithm [49] to generate all pivots we used. These well-selected pivots are properly located in the border of dataset to best identify objects in semantic space collectively.

All algorithms are implemented in C++ on the same dataset Foursquare and DaZhong Comments. We compare our method with proposed NIQ-tree and R-tree in query time and I/O cost which is represented by the number of visited objects. The default values for parameters are given in Table 2. During the experiments, we keep a variable changing and other variables to maintain the default value which is satisfied with control variable principle. The experiments are realized on a PC with 8 GB memory.

Table 2 default values of parameters

5.2 Comparison with proposed algorithms

In this subsection, we compare our algorithm with the baselines on different parameters shown in Table 2.

Effect of |D|

As shown in Fig. 4, the efficiency of three algorithms shows a similar tendency both in Foursquare and DaZhong Comments. As we can find from these figures, both query time and I/O cost show an upward trend when the size of dataset goes up. Because we need to retrieve more objects and the time of distance calculating becomes greater simultaneously. In comparison, when |D| becomes larger, the efficiency of NIQ*-tree is close to R-tree since its worse spatial pruning effect. Besides, both S2R-tree and NIQ*-tree outperforms the R-tree, since they enable query processing to prune in spatial and semantic spaces collaboratively. We can see that the S2R-tree has the best performance, especially when the dataset becomes large, which can be explained by the pruning in lower dimensional space.

Fig. 4
figure 4

Effect of |D|

Effect of k

Figure 5 shows the performance of R-tree, NIQ*-tree and S2R-tree when k ranges from 10 to 50. All three algorithms show a slightly increase on both datasets. As shown in Fig. 5a and c, the number of visited objects shows an upward tendency with k floating from 10 to 50. This phenomenon is attributed that we want to return more objects which have high similarity with query, such that we have to visit more objects and to make more distance calculation. Additionally, the performances of R-tree and NIQ*-tree are basically identical, which shows that NIQ*-tree has poor pruning ability in high dimensional space. While S2R-tree is significantly superior to NIQ*-tree and R-tree both in I/O cost and query time but is more sensitive to k than R-tree and NIQ*-tree since its high efficiency in pruning dead space via mapping high dimensional data into lower dimensional space, which is not beyond our expectation.

Fig. 5
figure 5

Effect of k

Effect of λ

From Fig. 6, it is obvious that when λ increases, the number of visited objects and query time both decrease, due to the better pruning ability in the spatial domain in 2D space. More specifically NIQ*-tree is superior to R-tree only if λ is less than 0.5, while S2R-tree precedes others all the time. We can see from Fig. 6a and b, when λ is more than 0.6, the number of visited objects shows a sharply decrease in all three algorithms. It can be explained that when λ ranges closely to 1, spatial distance takes more and more proportion in the final distance. While S2R-tree is still better than the other two algorithms. It is within our expectation because S2R-tree has perferable efficiency via pruning in low-dimensional space. Furthermore, R-tree is the most sensitive to the variety of λ because of its pruning only in spatial domain.

Fig. 6
figure 6

Effect of λ

5.3 Evaluation on S2R-tree parameters

In this subsection, we further evaluate the performance of S2R-tree by varying the parameters c and m.

Effect of m

As is shown in Fig. 7, the number of pivot m in pivot-based space affects the performance of our proposed indexing structure. With the increase of m from 2 to 8, we can observe from Fig. 7a and c that the number of visited objects shows a little decrease, while as is shown in Fig. 7b and d, query time shows a gentle descent when m ranges from 2 to 8. This phenomenon can be explained for more dead space pruning conducted in pivot-based space(low dimensional space) and thus we need less time to calculate distance which is not beyond our expectation. From Fig. 7, it is noted that the increase of data size |D| alse makes more I/O cost and query time when m remains the same both in datasets Foursquare and DaZhong Comments.

Fig. 7
figure 7

Effect of m

Effect of c

According to Fig. 8, the performance of S2R-tree is affected by c, which is the capacity of S2R-tree leaf node. On one hand, we can observe from Fig. 8a and c that the visited objects remain almost constant with respect to c but increase with the data size |D|. It is attributed that although the number of objects in each leaf node increases gradually, the efficiency of pruning dead space also has an upward tendency and thus the number of visited objects remains almost constant when c ranges from 30 to 150. On the other hand, as is shown in Fig. 7b and d the query time has a decrease when c changes from 30 to 150 since more dead space are pruned and less time is spent in computing distance. Additionally, the increase of data size |D| alse makes more I/O cost and query time when c remains the same.

Fig. 8
figure 8

Effect of c

In summary, we can conclude that the S2R-tree based search algorithm is more efficient than other two baseline algorithms R-tree and NIQ*-tree, almost in all test settings. This demostrates the suprior querying performance of S2R-tree due to not only the pruning in a low dimensional space after coordinates transformation, but also the effectiveness of a series of bounds in query optimization.

6 Conclusion and future work

This paper proposes a novel hybrid index structure S2R-tree, which integrates the spatial and semantic information seamlessly. Instead of indexing objects in the original semantic space, we carefully design a pivot-based space mapping mechanism to transform the high dimensional semantic vectors to a low dimensional space, so that more effective pruning effect can be achieved. A novel SKQ search algorithm on top of S2R-tree is further designed, using some theoretical bounds to speed up query processing significantly. We conduct extensive experiments and then demonstrate the efficiency of our methods comparing with baseline algorithms.

In the future, we would like to consider the constraints of road network in spatial, develop a network topology-aware SKQ querying framwork and further optimize the query.