API Documentation¶

Tasks¶

Models¶

Encoders¶

class openhands.models.encoder.CNN2D(in_channels=3, backbone='resnet18', pretrained=True)[source]¶

Creates a 2D Convolution backbone from timm library

Parameters:

in_channels (int) – Number of input channels
backbone (string) – Backbone to use
pretrained (bool, optional) – Whether to use pretrained Backbone. Default: True

forward(x)[source]¶: forward step

class openhands.models.encoder.CNN3D(in_channels, backbone, pretrained=True, **kwargs)[source]¶

Initializes the 3D Convolution backbone.

Supported Backbones

i3d_r50
c2d_r50
csn_r101
r2plus1d_r5
slow_r50
slowfast_r50
slowfast_r101
slowfast_16x8_r101_50_50
x3d_xs
x3d_s
x3d_m
x3d_l

Parameters:

in_channels (int) – Number of input channels
backbone (string) – Backbone to use
pretrained (bool, optional) – Whether to use pretrained Backbone. Default: True
**kwargs (optional) – Will be passed to pytorchvideo.models.hub models;

forward(x)[source]¶: forward step

class openhands.models.encoder.DecoupledGCN(in_channels, graph_args, groups=8, block_size=41, n_out_features=256)[source]¶

ST-GCN backbone with Decoupled GCN layers, Self Attention and DropGraph proposed in the paper: Skeleton Aware Multi-modal Sign Language Recognition

Parameters:

in_channels (int) – Number of channels in the input data.
graph_cfg (dict) – The arguments for building the graph.
groups (int) – Number of Decouple groups to use. Default: 8.
block_size (int) – Block size used for Temporal masking in Dropgraph. Default: 41.
n_out_features (int) – Output Embedding dimension. Default: 256.

forward(x, keep_prob=0.9)[source]¶

Parameters:

x (torch.Tensor) – Input graph sequence of shape \((N, in\_channels, T_{in}, V_{in})\)
keep_prob (float) – The probability to keep the node. Default: 0.9.

Returns:

Output embedding of shape \((N, n\_out\_features)\)

Return type:

torch.Tensor

where:

\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,
\(n\_out\_features\) is the `n_out_features’ value.

class openhands.models.encoder.PoseFlattener(in_channels=3, num_points=27)[source]¶

Flattens the pose keypoints across the channel dimension.

Parameters:

in_channels (int) – Number of channels in the input data.
num_points (int) – Number of spatial joints

forward(x)[source]¶

Parameters:: x (torch.Tensor) – Input tensor of shape \((N, in_channels, T_{in}, V_{in})\)
Returns:: Tensor with channel dimension flattened of shape \((N, T_{in}, in\_channels * V_{in})\)
Return type:: torch.Tensor

where

\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,

class openhands.models.encoder.SGN(n_frames, num_points, in_channels=2, bias=True)[source]¶

SGN model proposed in Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition

Note

The model supports inputs only with fixed number of frames.

Parameters:

n_frames (int) – Number of frames in the input sequence.
num_points (int) – Number of spatial points in a graph.
in_channels (int) – Number of channels in the input data. Default: 2.
bias (bool) – Whether to use bias or not. Default: True.

forward(input)[source]¶

Parameters:: input (torch.Tensor) – Input tensor of shape \((N, in\_channels, T_{in}, V_{in})\)
Returns:: Output embedding of shape \((N, n\_out\_features)\)
Return type:: torch.Tensor

where

\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,
\(n\_out\_features\) is the output embedding dimension.

one_hot(bs, spa, tem)[source]¶: get one-hot encodings

class openhands.models.encoder.STGCN(in_channels, graph_args, edge_importance_weighting, n_out_features=256, **kwargs)[source]¶

Spatial temporal graph convolutional network backbone

This module is proposed in Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Parameters:

in_channels (int) – Number of channels in the input data.
graph_args (dict) – The arguments for building the graph.
edge_importance_weighting (bool) – If True, adds a learnable importance weighting to the edges of the graph. Default: True.
n_out_features (int) – Output Embedding dimension. Default: 256.
kwargs (dict) – Other parameters for graph convolution units.

forward(x)[source]¶

Parameters:: x (torch.Tensor) – Input tensor of shape \((N, in\_channels, T_{in}, V_{in})\)
Returns:: Output embedding of shape \((N, n\_out\_features)\)
Return type:: torch.Tensor

where

\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,
\(n\_out\_features\) is the output embedding dimension.

Decoders¶

class openhands.models.decoder.BERT(n_features, num_class, config)[source]¶

BERT decoder module.

Parameters:

n_features (int) – Number of features in the input.
num_class (int) – Number of class for classification.
config (dict) – Configuration set for BERT layer.

forward(x)[source]¶

Parameters:: x (torch.Tensor) – Input tensor of shape: (batch_size, T, n_features)
Returns:: logits for classification.
Return type:: torch.Tensor

openhands.models.decoder.FullyConnectedClassifier¶: alias of FC

class openhands.models.decoder.RNNClassifier(n_features, num_class, rnn_type='GRU', hidden_size=512, num_layers=1, bidirectional=True, use_attention=False)[source]¶

RNN head for classification.

Parameters:

n_features (int) – Number of features in the input.
num_class (int) – Number of class for classification.
rnn_type (str) – GRU or LSTM. Default: GRU.
hidden_size (str) – Hidden dim to use for RNN. Default: 512.
num_layers (int) – Number of layers of RNN to use. Default: 1.
bidirectional (bool) – Whether to use bidirectional RNN or not. Default: True.
use_attention (bool) – Whether to use attenion for pooling or not. Default: False.

forward(x)[source]¶

Parameters:: x (torch.Tensor) – Input tensor of shape: (batch_size, T, n_features)
Returns:: logits for classification.
Return type:: torch.Tensor

SSL-Models¶

Datasets¶

Augmentations¶

class openhands.datasets.pose_transforms.CenterAndScaleNormalize(reference_points_preset=None, reference_point_indexes=[], scale_factor=1, frame_level=False)[source]¶

Centers and scales the keypoints based on the referent points given.

Parameters:

reference_points_preset (str | None, optional) – can be used to specify existing presets - mediapipe_holistic_minimal_27 or mediapipe_holistic_top_body_59
reference_point_indexes (list) – shape(p1, p2); point indexes to use if preset is not given then
scale_factor (int) – scaling factor. Default: 1
frame_level (bool) – Whether to center and normalize at frame level or clip level. Default: False

calc_center_and_scale(x)[source]¶

Calculates the center and scale value based on the sequence of skeletons.

Parameters:: x (torch.Tensor) – all keypoints for the video clip.
Returns:: center and scale value to normalize
Return type:: [float, float]

calc_center_and_scale_for_one_skeleton(x)[source]¶

Calculates the center and scale values for one skeleton.

Parameters:: x (torch.Tensor) – Spatial keypoints at a timestep
Returns:: center and scale value to normalize for the skeleton
Return type:: [float, float]

class openhands.datasets.pose_transforms.Compose(transforms)[source]¶

Compose a list of pose transforms

Parameters:: transforms (list) – List of transforms to be applied.

class openhands.datasets.pose_transforms.FrameSkipping(skip_range=1)[source]¶

Skips the frame based on the jump range specified.

Parameters:: skip_range (int) – The skip range.

class openhands.datasets.pose_transforms.PoseRandomShift[source]¶: Randomly distribute the zero padding at the end of a video to initial and final positions

class openhands.datasets.pose_transforms.PoseSelect(preset=None, pose_indexes: list = [])[source]¶

Select the given index keypoints from all keypoints.

Parameters:

preset (str | None, optional) – can be used to specify existing presets - mediapipe_holistic_minimal_27 or mediapipe_holistic_top_body_59
None (If) – None
Default (then the pose_indexes argument indexes will be used to select.) – None
pose_indexes – List of indexes to select.

class openhands.datasets.pose_transforms.PoseTemporalSubsample(num_frames, temporal_dim=1)[source]¶

Randomly subsamples num_frames indices from the temporal dimension of the sequence of keypoints. If the num_frames if larger than the length of the sequence, then the remaining frames will be padded with zeros.

Parameters:

num_frames (int) – Number of frames to subsample.
temporal_dim (int) – dimension of temporal to perform temporal subsample.

class openhands.datasets.pose_transforms.PoseUniformSubsampling(num_frames, randomize_start_index=False, temporal_dim=1)[source]¶

Uniformly subsamples num_frames indices from the temporal dimension of the sequence of keypoints. If the num_frames is larger than the length of the sequence, then the remaining frames will be padded with zeros.

Parameters:

num_frames (int) – Number of frames to subsample.
randomize_start_index (int) – While performing interleaved subsampling, select start_index from randint(0, step_size)
temporal_dim (int) – dimension of temporal to perform temporal subsample.

class openhands.datasets.pose_transforms.PrependLangCodeOHE(lang_codes: list)[source]¶

Prepend a one-hot encoded vector based on the language of the input video. Ideally, it should be used finally after all other normalizations/augmentations.

Parameters:: lang_codes – List of sign language codes.

class openhands.datasets.pose_transforms.RandomMove(move_range=(-2.5, 2.5), move_step=0.5)[source]¶: Moves all the keypoints randomly in a random direction.

class openhands.datasets.pose_transforms.RotatationTransform(rotation_std: float = 0.2)[source]¶

Applies 2D rotation transformation.

Parameters:: rotation_std (float) – std to use for rotation transformation. Default: 0.2

class openhands.datasets.pose_transforms.ScaleToVideoDimensions(width: int, height: int)[source]¶

Scale the pose keypoints to the given width and height values.

Parameters:

width (int) – Width of the frames
height (int) – Height of the frames

class openhands.datasets.pose_transforms.ScaleTransform(scale_std=0.2)[source]¶

Applies Scaling transformation

Parameters:: scale_std (float) – std to use for Scaling transformation. Default: 0.2

class openhands.datasets.pose_transforms.ShearTransform(shear_std: float = 0.2)[source]¶

Applies 2D shear transformation

Parameters:: shear_std (float) – std to use for shear transformation. Default: 0.2

class openhands.datasets.pose_transforms.TemporalSample(num_frames, subsample_mode=2)[source]¶

Randomly choose Uniform and Temporal subsample

If subsample_mode==2, randomly sub-sampling or uniform-sampling is done
If subsample_mode==0, only uniform-sampling (for test sets)
If subsample_mode==1, only sub-sampling (to reproduce results of some papers that use only subsampling)

Parameters:

num_frames (int) – Number of frames to subsample.
subsample_mode (int) – Mode to choose.

ISLR Datasets¶

Base class¶

class openhands.datasets.isolated.base.BaseIsolatedDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

This module provides the datasets for Isolated Sign Language Classification. Do not instantiate this class

enumerate_data_files(dir)[source]¶

Lists the video files from given directory. - If pose modality, generate .pkl files for all videos in folder.

If no videos present, check if some .pkl files already exist

load_pose_from_path(path)[source]¶

Load dumped pose keypoints. Should contain: {

“keypoints” of shape (T, V, C), “confidences” of shape (T, V)

}

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

Labelled datasets¶

class openhands.datasets.isolated.ASLLVDDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

American Isolated Sign language dataset from the paper:

The American Sign Language Lexicon Video Dataset <https://ieeexplore.ieee.org/abstract/document/4563181> The train test split has been taken from the paper <https://arxiv.org/pdf/1901.11164.pdf>

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

class openhands.datasets.isolated.AUTSLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Turkish Isolated Sign language dataset from the paper:

AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and Baseline Methods

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

class openhands.datasets.isolated.Bosphorus22kDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Turkish Isolated Sign language dataset(Bosphorus22k) from the paper: Link to paper: https://arxiv.org/pdf/2004.01283.pdf

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶

Dataset includes 22542 videos where 6 signers executed 4+ repetitions of 744 different types of signs.

For train-set, we use all signers except signer called user_4. It contains 18,018 videos. Test-set: Signer for test set is user_4, total number of videos 4525.

class openhands.datasets.isolated.CSLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Chinese Isolated Sign language dataset from the paper:

Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶

Format for word-level CSL dataset: 1. naming: P01_25_19_2._color.mp4

P01: 1, signer ID (person) 25_19: (25-1)*20+19=499, label ID 2: 2, the second time performing the sign

experiment setting: split:

train set: signer ID, [0, 1, …, 34, 35] test set: signer ID, [36, 37, … ,48, 49]

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

class openhands.datasets.isolated.ConcatDataset(datasets, unify_vocabulary=False, **kwargs)[source]¶

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

class openhands.datasets.isolated.DeviSignDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Chinese Isolated Sign language dataset from the paper:

The devisign large vocabulary of chinese sign language database and baseline evaluations

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶

Check the file “DEVISIGN Technical Report.pdf” inside `Documents` folder for dataset format (page 12) and splits (page 15)

TODO: The train set size is 16k, and test set size is 8k (for 2k classes). Should we use 4k from test set as valset, and only the other 4k for benchmarking?

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

class openhands.datasets.isolated.FingerSpellingDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Fingerspelling datasets : ‘Argentine’, ‘American’, ‘Chinese’, ‘Indian’, ‘German’, ‘Greek’, ‘Turkish’

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Divided all fingerspelling datasets to 80-20 as test train split

class openhands.datasets.isolated.GSLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Greek Isolated Sign language dataset from the paper:

A Comprehensive Study on Deep Learning-based Methods for Sign Language Recognition

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

class openhands.datasets.isolated.INCLUDEDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Indian Isolated Sign language dataset from the paper:

INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

class openhands.datasets.isolated.LSA64Dataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

Argentinian Isolated Sign language dataset from the paper:

LSA64: An Argentinian Sign Language Dataset

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶

Dataset includes 3200 videos where 10 non-expert subjects executed 5 repetitions of 64 different types of signs.

Signer-independent splits: For train-set, we use signers 1-8. Val-set & Test-set: Signer-9 & Signer-10

Signer-dependent splits: In the original paper, they split randomly, and do not open-source the splits. Hence we only follow the signer-based splits we have come-up with (as mentioned above)

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

class openhands.datasets.isolated.MSASLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

American Isolated Sign language dataset from the paper:

MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

read_video_data(index)[source]¶: Extend this method for dataset-specific formats

class openhands.datasets.isolated.RWTH_Phoenix_Signer03_Dataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

German Isolated Sign language dataset from the paper:

RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. <https://www-i6.informatik.rwth-aachen.de/~forster/database-rwth-phoenix.php> Signer03 cutout has been taken for the experiments : Image sequence - https://www-i6.informatik.rwth-aachen.de/ftp/pub/rwth-phoenix/rwth-phoenix-weather-signer03-cutout-images_20120820.tgz Anotations - https://www-i6.informatik.rwth-aachen.de/ftp/pub/rwth-phoenix/rwth-phoenix-weather-signer03-cutout_20120820.tgz

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

class openhands.datasets.isolated.WLASLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶

American Isolated Sign language dataset from the paper:

Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

read_glosses()[source]¶: Implement this method to construct self.glosses[]

read_original_dataset()[source]¶: Implement this method to read (video_name/video_folder, classification_label) into self.data[]

read_video_data(index)[source]¶: Extend this method for dataset-specific formats