API Documentation¶
Tasks¶
Models¶
Encoders¶
- class openhands.models.encoder.CNN2D(in_channels=3, backbone='resnet18', pretrained=True)[source]¶
Creates a 2D Convolution backbone from timm library
- Parameters:
in_channels (int) – Number of input channels
backbone (string) – Backbone to use
pretrained (bool, optional) – Whether to use pretrained Backbone. Default:
True
- class openhands.models.encoder.CNN3D(in_channels, backbone, pretrained=True, **kwargs)[source]¶
Initializes the 3D Convolution backbone.
Supported Backbones
i3d_r50
c2d_r50
csn_r101
r2plus1d_r5
slow_r50
slowfast_r50
slowfast_r101
slowfast_16x8_r101_50_50
x3d_xs
x3d_s
x3d_m
x3d_l
- Parameters:
in_channels (int) – Number of input channels
backbone (string) – Backbone to use
pretrained (bool, optional) – Whether to use pretrained Backbone. Default:
True
**kwargs (optional) – Will be passed to pytorchvideo.models.hub models;
- class openhands.models.encoder.DecoupledGCN(in_channels, graph_args, groups=8, block_size=41, n_out_features=256)[source]¶
ST-GCN backbone with Decoupled GCN layers, Self Attention and DropGraph proposed in the paper: Skeleton Aware Multi-modal Sign Language Recognition
- Parameters:
in_channels (int) – Number of channels in the input data.
graph_cfg (dict) – The arguments for building the graph.
groups (int) – Number of Decouple groups to use. Default: 8.
block_size (int) – Block size used for Temporal masking in Dropgraph. Default: 41.
n_out_features (int) – Output Embedding dimension. Default: 256.
- forward(x, keep_prob=0.9)[source]¶
- Parameters:
x (torch.Tensor) – Input graph sequence of shape \((N, in\_channels, T_{in}, V_{in})\)
keep_prob (float) – The probability to keep the node. Default: 0.9.
- Returns:
Output embedding of shape \((N, n\_out\_features)\)
- Return type:
torch.Tensor
- where:
\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,
\(n\_out\_features\) is the `n_out_features’ value.
- class openhands.models.encoder.PoseFlattener(in_channels=3, num_points=27)[source]¶
Flattens the pose keypoints across the channel dimension.
- Parameters:
in_channels (int) – Number of channels in the input data.
num_points (int) – Number of spatial joints
- forward(x)[source]¶
- Parameters:
x (torch.Tensor) – Input tensor of shape \((N, in_channels, T_{in}, V_{in})\)
- Returns:
Tensor with channel dimension flattened of shape \((N, T_{in}, in\_channels * V_{in})\)
- Return type:
torch.Tensor
- where
\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,
- class openhands.models.encoder.SGN(n_frames, num_points, in_channels=2, bias=True)[source]¶
SGN model proposed in Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition
Note
The model supports inputs only with fixed number of frames.
- Parameters:
n_frames (int) – Number of frames in the input sequence.
num_points (int) – Number of spatial points in a graph.
in_channels (int) – Number of channels in the input data. Default: 2.
bias (bool) – Whether to use bias or not. Default:
True
.
- forward(input)[source]¶
- Parameters:
input (torch.Tensor) – Input tensor of shape \((N, in\_channels, T_{in}, V_{in})\)
- Returns:
Output embedding of shape \((N, n\_out\_features)\)
- Return type:
torch.Tensor
- where
\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,
\(n\_out\_features\) is the output embedding dimension.
- class openhands.models.encoder.STGCN(in_channels, graph_args, edge_importance_weighting, n_out_features=256, **kwargs)[source]¶
Spatial temporal graph convolutional network backbone
This module is proposed in Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
- Parameters:
in_channels (int) – Number of channels in the input data.
graph_args (dict) – The arguments for building the graph.
edge_importance_weighting (bool) – If
True
, adds a learnable importance weighting to the edges of the graph. Default: True.n_out_features (int) – Output Embedding dimension. Default: 256.
kwargs (dict) – Other parameters for graph convolution units.
- forward(x)[source]¶
- Parameters:
x (torch.Tensor) – Input tensor of shape \((N, in\_channels, T_{in}, V_{in})\)
- Returns:
Output embedding of shape \((N, n\_out\_features)\)
- Return type:
torch.Tensor
- where
\(N\) is a batch size,
\(T_{in}\) is a length of input sequence,
\(V_{in}\) is the number of graph nodes,
\(n\_out\_features\) is the output embedding dimension.
Decoders¶
- class openhands.models.decoder.BERT(n_features, num_class, config)[source]¶
BERT decoder module.
- Parameters:
n_features (int) – Number of features in the input.
num_class (int) – Number of class for classification.
config (dict) – Configuration set for BERT layer.
- openhands.models.decoder.FullyConnectedClassifier¶
alias of
FC
- class openhands.models.decoder.RNNClassifier(n_features, num_class, rnn_type='GRU', hidden_size=512, num_layers=1, bidirectional=True, use_attention=False)[source]¶
RNN head for classification.
- Parameters:
n_features (int) – Number of features in the input.
num_class (int) – Number of class for classification.
rnn_type (str) – GRU or LSTM. Default:
GRU
.hidden_size (str) – Hidden dim to use for RNN. Default: 512.
num_layers (int) – Number of layers of RNN to use. Default: 1.
bidirectional (bool) – Whether to use bidirectional RNN or not. Default:
True
.use_attention (bool) – Whether to use attenion for pooling or not. Default:
False
.
SSL-Models¶
Datasets¶
Augmentations¶
- class openhands.datasets.pose_transforms.CenterAndScaleNormalize(reference_points_preset=None, reference_point_indexes=[], scale_factor=1, frame_level=False)[source]¶
Centers and scales the keypoints based on the referent points given.
- Parameters:
reference_points_preset (str | None, optional) – can be used to specify existing presets - mediapipe_holistic_minimal_27 or mediapipe_holistic_top_body_59
reference_point_indexes (list) – shape(p1, p2); point indexes to use if preset is not given then
scale_factor (int) – scaling factor. Default: 1
frame_level (bool) – Whether to center and normalize at frame level or clip level. Default:
False
- class openhands.datasets.pose_transforms.Compose(transforms)[source]¶
Compose a list of pose transforms
- Parameters:
transforms (list) – List of transforms to be applied.
- class openhands.datasets.pose_transforms.FrameSkipping(skip_range=1)[source]¶
Skips the frame based on the jump range specified.
- Parameters:
skip_range (int) – The skip range.
- class openhands.datasets.pose_transforms.PoseRandomShift[source]¶
Randomly distribute the zero padding at the end of a video to initial and final positions
- class openhands.datasets.pose_transforms.PoseSelect(preset=None, pose_indexes: list = [])[source]¶
Select the given index keypoints from all keypoints.
- Parameters:
preset (str | None, optional) – can be used to specify existing presets - mediapipe_holistic_minimal_27 or mediapipe_holistic_top_body_59
None (If) –
None
Default (then the pose_indexes argument indexes will be used to select.) –
None
pose_indexes – List of indexes to select.
- class openhands.datasets.pose_transforms.PoseTemporalSubsample(num_frames, temporal_dim=1)[source]¶
Randomly subsamples num_frames indices from the temporal dimension of the sequence of keypoints. If the num_frames if larger than the length of the sequence, then the remaining frames will be padded with zeros.
- Parameters:
num_frames (int) – Number of frames to subsample.
temporal_dim (int) – dimension of temporal to perform temporal subsample.
- class openhands.datasets.pose_transforms.PoseUniformSubsampling(num_frames, randomize_start_index=False, temporal_dim=1)[source]¶
Uniformly subsamples num_frames indices from the temporal dimension of the sequence of keypoints. If the num_frames is larger than the length of the sequence, then the remaining frames will be padded with zeros.
- Parameters:
num_frames (int) – Number of frames to subsample.
randomize_start_index (int) – While performing interleaved subsampling, select start_index from randint(0, step_size)
temporal_dim (int) – dimension of temporal to perform temporal subsample.
- class openhands.datasets.pose_transforms.PrependLangCodeOHE(lang_codes: list)[source]¶
Prepend a one-hot encoded vector based on the language of the input video. Ideally, it should be used finally after all other normalizations/augmentations.
- Parameters:
lang_codes – List of sign language codes.
- class openhands.datasets.pose_transforms.RandomMove(move_range=(-2.5, 2.5), move_step=0.5)[source]¶
Moves all the keypoints randomly in a random direction.
- class openhands.datasets.pose_transforms.RotatationTransform(rotation_std: float = 0.2)[source]¶
Applies 2D rotation transformation.
- Parameters:
rotation_std (float) – std to use for rotation transformation. Default: 0.2
- class openhands.datasets.pose_transforms.ScaleToVideoDimensions(width: int, height: int)[source]¶
Scale the pose keypoints to the given width and height values.
- Parameters:
width (int) – Width of the frames
height (int) – Height of the frames
- class openhands.datasets.pose_transforms.ScaleTransform(scale_std=0.2)[source]¶
Applies Scaling transformation
- Parameters:
scale_std (float) – std to use for Scaling transformation. Default: 0.2
- class openhands.datasets.pose_transforms.ShearTransform(shear_std: float = 0.2)[source]¶
Applies 2D shear transformation
- Parameters:
shear_std (float) – std to use for shear transformation. Default: 0.2
- class openhands.datasets.pose_transforms.TemporalSample(num_frames, subsample_mode=2)[source]¶
- Randomly choose Uniform and Temporal subsample
If subsample_mode==2, randomly sub-sampling or uniform-sampling is done
If subsample_mode==0, only uniform-sampling (for test sets)
If subsample_mode==1, only sub-sampling (to reproduce results of some papers that use only subsampling)
- Parameters:
num_frames (int) – Number of frames to subsample.
subsample_mode (int) – Mode to choose.
ISLR Datasets¶
Base class¶
- class openhands.datasets.isolated.base.BaseIsolatedDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
This module provides the datasets for Isolated Sign Language Classification. Do not instantiate this class
- enumerate_data_files(dir)[source]¶
Lists the video files from given directory. - If pose modality, generate .pkl files for all videos in folder.
If no videos present, check if some .pkl files already exist
- load_pose_from_path(path)[source]¶
Load dumped pose keypoints. Should contain: {
“keypoints” of shape (T, V, C), “confidences” of shape (T, V)
}
Labelled datasets¶
- class openhands.datasets.isolated.ASLLVDDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
American Isolated Sign language dataset from the paper:
The American Sign Language Lexicon Video Dataset <https://ieeexplore.ieee.org/abstract/document/4563181> The train test split has been taken from the paper <https://arxiv.org/pdf/1901.11164.pdf>
- class openhands.datasets.isolated.AUTSLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Turkish Isolated Sign language dataset from the paper:
AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and Baseline Methods
- class openhands.datasets.isolated.Bosphorus22kDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Turkish Isolated Sign language dataset(Bosphorus22k) from the paper: Link to paper: https://arxiv.org/pdf/2004.01283.pdf
- class openhands.datasets.isolated.CSLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Chinese Isolated Sign language dataset from the paper:
Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition
- read_original_dataset()[source]¶
Format for word-level CSL dataset: 1. naming: P01_25_19_2._color.mp4
P01: 1, signer ID (person) 25_19: (25-1)*20+19=499, label ID 2: 2, the second time performing the sign
experiment setting: split:
train set: signer ID, [0, 1, …, 34, 35] test set: signer ID, [36, 37, … ,48, 49]
- class openhands.datasets.isolated.ConcatDataset(datasets, unify_vocabulary=False, **kwargs)[source]¶
- class openhands.datasets.isolated.DeviSignDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Chinese Isolated Sign language dataset from the paper:
The devisign large vocabulary of chinese sign language database and baseline evaluations
- read_original_dataset()[source]¶
Check the file “DEVISIGN Technical Report.pdf” inside `Documents` folder for dataset format (page 12) and splits (page 15)
TODO: The train set size is 16k, and test set size is 8k (for 2k classes). Should we use 4k from test set as valset, and only the other 4k for benchmarking?
- class openhands.datasets.isolated.FingerSpellingDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Fingerspelling datasets : ‘Argentine’, ‘American’, ‘Chinese’, ‘Indian’, ‘German’, ‘Greek’, ‘Turkish’
- class openhands.datasets.isolated.GSLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Greek Isolated Sign language dataset from the paper:
A Comprehensive Study on Deep Learning-based Methods for Sign Language Recognition
- class openhands.datasets.isolated.INCLUDEDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Indian Isolated Sign language dataset from the paper:
INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition
- class openhands.datasets.isolated.LSA64Dataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
Argentinian Isolated Sign language dataset from the paper:
LSA64: An Argentinian Sign Language Dataset
- read_original_dataset()[source]¶
Dataset includes 3200 videos where 10 non-expert subjects executed 5 repetitions of 64 different types of signs.
Signer-independent splits: For train-set, we use signers 1-8. Val-set & Test-set: Signer-9 & Signer-10
Signer-dependent splits: In the original paper, they split randomly, and do not open-source the splits. Hence we only follow the signer-based splits we have come-up with (as mentioned above)
- class openhands.datasets.isolated.MSASLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
American Isolated Sign language dataset from the paper:
MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language
- class openhands.datasets.isolated.RWTH_Phoenix_Signer03_Dataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
German Isolated Sign language dataset from the paper:
RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. <https://www-i6.informatik.rwth-aachen.de/~forster/database-rwth-phoenix.php> Signer03 cutout has been taken for the experiments : Image sequence - https://www-i6.informatik.rwth-aachen.de/ftp/pub/rwth-phoenix/rwth-phoenix-weather-signer03-cutout-images_20120820.tgz Anotations - https://www-i6.informatik.rwth-aachen.de/ftp/pub/rwth-phoenix/rwth-phoenix-weather-signer03-cutout_20120820.tgz
- class openhands.datasets.isolated.WLASLDataset(root_dir, split_file=None, class_mappings_file_path=None, normalized_class_mappings_file=None, splits=['train'], modality='rgb', transforms='default', cv_resize_dims=(264, 264), pose_use_confidence_scores=False, pose_use_z_axis=False, inference_mode=False, only_metadata=False, multilingual=False, languages=None, language_set=None, seq_len=1, num_seq=1)[source]¶
American Isolated Sign language dataset from the paper: