Kaggle赛题总结：谷歌手语识别

文章目录[隐藏]

赛题背景
赛题任务
赛题数据集
评价指标
优胜方案
- 第一名
- 第二名
- 第三名
- 第四名
- 第五名

Google Isolated Sign Language Recognition

https://www.kaggle.com/c/asl-signs/

赛题类型：深度学习、时间序列

赛题背景

在美国每天有 33 名婴儿出生时患有永久性听力损失，其中大约 90% 的父母是听力正常的人，其中许多人可能不懂美国手语。如果没有手语，聋哑婴儿有患上语言剥夺综合症的风险。

PopSign 是一款智能手机游戏应用程序，它使学习美国手语变得有趣、互动且易于访问。玩家将 ASL 标志的视频与包含书面英语单词的泡泡相匹配以弹出它们。

赛题任务

本次比赛的目标是对孤立的美国手语 (ASL) 标志进行分类。您将创建一个TensorFlow Lite模型，该模型需要在有指定数据集上进行预测。

Kaggle赛题总结：谷歌手语识别

赛题数据集

train_landmark_files/按照文件的方式存储了✋?在不同帧下的空间位置。

train.csv：手语标签

评价指标

本次比赛的评估指标是简单的分类准确率，参赛选手将提交一个 TensorFlow Lite 模型文件。该模型必须将一个或多个地标帧作为输入，并返回一个浮点向量（每个标志类别的预测概率）作为输出。

模型必须以在100ms内进行单个样本的预测，并且模型的权重文件应该小于40MB。

优胜方案

第一名

https://www.kaggle.com/competitions/asl-signs/discussion/406684

我的解决方案涉及一维 CNN 和 Transformer 的组合，使用所有训练数据（仅比赛数据）从头开始训练，并使用4种子集成进行提交。

我最初使用 PyTorch + GPU，但后来切换到 TensorFlow + Colab TPU(tpuv2-8) 以确保与 TensorFlow Lite 的兼容性。

如果帧间相关性很强，一维 CNN 会比 Transformer 更有效。在我的实验中，纯 1D CNN 的性能轻松超过了 Transformer。 因此我最终仅使用 1D CNN 就获得了 0.80 的公共 LB 分数。然而Transformer 仍然有作用，可以在 1D CNN 之上使用（我们可以将 1d cnn 视为某种可训练的分词器）。

Regularization
- Drop Path(stochastic depth, p=0.2)
- high rate of Dropout (p=0.8)
- AWP(Adversarial Weight Perturbation, with lambda = 0.2)
Augmentation
- hflip
- Random Affine(Scale, shift, rotate, shear)
- Random Cutout
- Random resample (0.5x ~ 1.5x to original length)
- Random masking
- temporal augmentation
- Spatial augmentation

第二名

https://www.kaggle.com/competitions/asl-signs/discussion/406306

我们使用了一种类似于使用 EfficientNet-B0 模型的音频频谱图分类的方法，使用大量增强和转换器模型（例如 BERT 和 DeBERTa）作为辅助模型。

最终的解决方案包括一个输入大小为 160x80 的 EfficientNet-B0，在 8 个随机分割折叠中的单个折叠上训练，以及在完整数据集上训练的 DeBERTa 和 BERT。使用 EfficientNet 的单折模型的 CV 得分为 0.898，排行榜得分约为 0.8。

CNN预处理
- 提取了 18 个嘴唇点、20 个姿势点（包括手臂、肩膀、眉毛和鼻子）以及所有手部点，总共 80 个点。
- 应用了各种增强和标准规范化。
- 没有丢弃 NaN 值，而是在归一化后用零填充它们。
- 使用“最近”插值将时间轴插值到
Transformer预处理
- 保留了61个穴位，其中唇穴40个，手穴21个。对于左手和右手，保留 NaN 较少的那个。如果保留右手，则将其镜像到左手。
- 依次应用增强、归一化和 NaN 填充。
- 长于 96 的序列被内插到 96。短于 96 的序列保持不变。
- 除了原始位置外，还使用了手工制作的特征，包括运动、距离和角度的余弦。
Augmentations
- Random affine
- Random interpolation
- Flip pose
- Finger tree rotate

第三名

https://www.kaggle.com/competitions/asl-signs/discussion/406568

我们使用了六个 conv1d 模型的和两个Transformer模型。方案的关键点是数据预处理、硬增强和集成。

Preprocessing
- 20 lip points, 32 eyes points, 42 hands points(left hand and right hand) and 8 pose points.
- input sequence is normalized with shoulder, hip, lip and eyes points.
- Filling the NaN values with 0.0
- Learn a motion embedding by input sequence
Augmentation
- Global augmentation (apply same aug for all frames), including rotation(-10,10), shift(-0.1,0.1), scale(0.8,1.2), shear(-1.0,1.0), flip(apply for some signs)
- Time-based augmentation (apply aug for some frames), random select some frames(1-8) do affine augmentations, random drop frames (fill with 0.0)

第四名

https://www.kaggle.com/competitions/asl-signs/discussion/406673

Modeling
- The first is a model that classifies fixed-length sequences (1DCNN-FixLen).
- The second is a model that classifies variable-length sequences (1DCNN-VariableLen).
Data Augmentation
- Randomly drop frames (p=0.3).
- Augment hand position, size, and angle.
Preprocessing
- Use XY coordinates
- Normalize the coordinates between the eyebrows to (0,0).
- Compare the number of frames detected for the right and left hands, and flip.
- Use XY coordinates of 21 feature points of the right hand (flip left hand) and 40 feature points of the lips.
- Delete frames in which the feature points of the hand have not been detected.