Transformers documentation

SuperPoint

Transformers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

SuperPoint

Overview

The SuperPoint model was proposed in SuperPoint: Self-Supervised Interest Point Detection and Description by Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.

This model is the result of a self-supervised training of a fully-convolutional network for interest point detection and description. The model is able to detect interest points that are repeatable under homographic transformations and provide a descriptor for each point. The use of the model in its own is limited, but it can be used as a feature extractor for other tasks such as homography estimation, image matching, etc.

The abstract from the paper is the following:

This paper presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, our fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce Homographic Adaptation, a multi-scale, multi-homography approach for boosting interest point detection repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when compared to LIFT, SIFT and ORB.

SuperPoint overview. Taken from the original paper.

Usage tips

Here is a quick example of using the model to detect interest points in an image:

from transformers import AutoImageProcessor, SuperPointForKeypointDetection
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint")
model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint")

inputs = processor(image, return_tensors="pt")
outputs = model(**inputs)

The outputs contain the list of keypoint coordinates with their respective score and description (a 256-long vector).

You can also feed multiple images to the model. Due to the nature of SuperPoint, to output a dynamic number of keypoints, you will need to use the mask attribute to retrieve the respective information :

from transformers import AutoImageProcessor, SuperPointForKeypointDetection
import torch
from PIL import Image
import requests

url_image_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_1 = Image.open(requests.get(url_image_1, stream=True).raw)
url_image_2 = "http://images.cocodataset.org/test-stuff2017/000000000568.jpg"
image_2 = Image.open(requests.get(url_image_2, stream=True).raw)

images = [image_1, image_2]

processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint")
model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint")

inputs = processor(images, return_tensors="pt")
outputs = model(**inputs)
image_sizes = [(image.height, image.width) for image in images]
outputs = processor.post_process_keypoint_detection(outputs, image_sizes)

for output in outputs:
    for keypoints, scores, descriptors in zip(output["keypoints"], output["scores"], output["descriptors"]):
        print(f"Keypoints: {keypoints}")
        print(f"Scores: {scores}")
        print(f"Descriptors: {descriptors}")

You can then print the keypoints on the image of your choice to visualize the result:

import matplotlib.pyplot as plt

plt.axis("off")
plt.imshow(image_1)
plt.scatter(
    outputs[0]["keypoints"][:, 0],
    outputs[0]["keypoints"][:, 1],
    c=outputs[0]["scores"] * 100,
    s=outputs[0]["scores"] * 50,
    alpha=0.8
)
plt.savefig(f"output_image.png")

image/png

This model was contributed by stevenbucaille. The original code can be found here.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SuperPoint. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

A notebook showcasing inference and visualization with SuperPoint can be found here. 🌎

SuperPointConfig

class transformers.SuperPointConfig

< source >

( encoder_hidden_sizes: typing.List[int] = [64, 64, 128, 128] decoder_hidden_size: int = 256 keypoint_decoder_dim: int = 65 descriptor_decoder_dim: int = 256 keypoint_threshold: float = 0.005 max_keypoints: int = -1 nms_radius: int = 4 border_removal_distance: int = 4 initializer_range = 0.02 **kwargs )

Parameters

encoder_hidden_sizes (List, optional, defaults to [64, 64, 128, 128]) — The number of channels in each convolutional layer in the encoder.
decoder_hidden_size (int, optional, defaults to 256) — The hidden size of the decoder.
keypoint_decoder_dim (int, optional, defaults to 65) — The output dimension of the keypoint decoder.
descriptor_decoder_dim (int, optional, defaults to 256) — The output dimension of the descriptor decoder.
keypoint_threshold (float, optional, defaults to 0.005) — The threshold to use for extracting keypoints.
max_keypoints (int, optional, defaults to -1) — The maximum number of keypoints to extract. If -1, will extract all keypoints.
nms_radius (int, optional, defaults to 4) — The radius for non-maximum suppression.
border_removal_distance (int, optional, defaults to 4) — The distance from the border to remove keypoints.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of a SuperPointForKeypointDetection. It is used to instantiate a SuperPoint model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SuperPoint magic-leap-community/superpoint architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import SuperPointConfig, SuperPointForKeypointDetection

>>> # Initializing a SuperPoint superpoint style configuration
>>> configuration = SuperPointConfig()
>>> # Initializing a model from the superpoint style configuration
>>> model = SuperPointForKeypointDetection(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

SuperPointImageProcessor

class transformers.SuperPointImageProcessor

< source >

( do_resize: bool = True size: typing.Dict[str, int] = None do_rescale: bool = True rescale_factor: float = 0.00392156862745098 **kwargs )

Parameters

do_resize (bool, optional, defaults to True) — Controls whether to resize the image’s (height, width) dimensions to the specified size. Can be overriden by do_resize in the preprocess method.
size (Dict[str, int] optional, defaults to {"height" -- 480, "width": 640}): Resolution of the output image after resize is applied. Only has an effect if do_resize is set to True. Can be overriden by size in the preprocess method.
do_rescale (bool, optional, defaults to True) — Whether to rescale the image by the specified scale rescale_factor. Can be overriden by do_rescale in the preprocess method.
rescale_factor (int or float, optional, defaults to 1/255) — Scale factor to use if rescaling the image. Can be overriden by rescale_factor in the preprocess method.

Constructs a SuperPoint image processor.

post_process_keypoint_detection

< source >

( outputs: SuperPointKeypointDescriptionOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, typing.List[typing.Tuple]] ) → List[Dict]

Parameters

outputs (SuperPointKeypointDescriptionOutput) — Raw outputs of the model containing keypoints in a relative (x, y) format, with scores and descriptors.
target_sizes (torch.Tensor or List[Tuple[int, int]]) — Tensor of shape (batch_size, 2) or list of tuples (Tuple[int, int]) containing the target size (height, width) of each image in the batch. This must be the original image size (before any processing).

Returns

List[Dict]

A list of dictionaries, each dictionary containing the keypoints in absolute format according to target_sizes, scores and descriptors for an image in the batch as predicted by the model.

Converts the raw output of SuperPointForKeypointDetection into lists of keypoints, scores and descriptors with coordinates absolute to the original image sizes.

preprocess

< source >

( images do_resize: bool = None size: typing.Dict[str, int] = None do_rescale: bool = None rescale_factor: float = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )

Parameters

images (ImageInput) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
do_resize (bool, optional, defaults to self.do_resize) — Whether to resize the image.
size (Dict[str, int], optional, defaults to self.size) — Size of the output image after resize has been applied. If size["shortest_edge"] >= 384, the image is resized to (size["shortest_edge"], size["shortest_edge"]). Otherwise, the smaller edge of the image will be matched to int(size["shortest_edge"]/ crop_pct), after which the image is cropped to (size["shortest_edge"], size["shortest_edge"]). Only has an effect if do_resize is set to True.
do_rescale (bool, optional, defaults to self.do_rescale) — Whether to rescale the image values between [0 - 1].
rescale_factor (float, optional, defaults to self.rescale_factor) — Rescale factor to rescale the image by if do_rescale is set to True.
return_tensors (str or TensorType, optional) — The type of tensors to return. Can be one of:
- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
- TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
- TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
- TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.
data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" or ChannelDimension.NONE: image in (height, width) format.

Preprocess an image or batch of images.

resize

< source >

( image: ndarray size: typing.Dict[str, int] data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )

Parameters

image (np.ndarray) — Image to resize.
size (Dict[str, int]) — Dictionary of the form {"height": int, "width": int}, specifying the size of the output image.
data_format (ChannelDimension or str, optional) — The channel dimension format of the output image. If not provided, it will be inferred from the input image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" or ChannelDimension.NONE: image in (height, width) format.
input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" or ChannelDimension.NONE: image in (height, width) format.

Resize an image.

preprocess
post_process_keypoint_detection

SuperPointForKeypointDetection

class transformers.SuperPointForKeypointDetection

< source >

( config: SuperPointConfig )

Parameters

config (SuperPointConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

SuperPoint model outputting keypoints and descriptors. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

SuperPoint model. It consists of a SuperPointEncoder, a SuperPointInterestPointDecoder and a SuperPointDescriptorDecoder. SuperPoint was proposed in SuperPoint: Self-Supervised Interest Point Detection and Description <https://arxiv.org/abs/1712.07629>__ by Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. It is a fully convolutional neural network that extracts keypoints and descriptors from an image. It is trained in a self-supervised manner, using a combination of a photometric loss and a loss based on the homographic adaptation of keypoints. It is made of a convolutional encoder and two decoders: one for keypoints and one for descriptors.

forward

< source >

( pixel_values: FloatTensor labels: typing.Optional[torch.LongTensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )

Parameters

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using SuperPointImageProcessor. See SuperPointImageProcessor.call() for details.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

Examples:

The SuperPointForKeypointDetection forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

forward

< > Update on GitHub

←SegGpt SwiftFormer→