Polygence Scholar2026

Ruize Xia

Class of 2027Nanjing, Jiangsu

About

Projects

"Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation" with mentor Khushi (Working project)

Project Portfolio

Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation

Started July 14, 2025

Abstract or project description

Sign language is a primary communication channel for millions of Deaf and hard-of-hearing people, yet generating signer video directly from text remains difficult because video diffusion models are expensive to train and evaluate. This paper presents Text2Sign, a text-conditioned diffusion architecture for short sign-language video clips, designed to operate on a single NVIDIA L4 graphics processor rather than on a multi-node training infrastructure. The model combines a frozen vision--language text encoder with a three-dimensional encoder--decoder backbone and factorized spatial and temporal attention, thereby reducing the cost of full spatio-temporal attention while preserving motion coherence. Three design choices are examined: whether transformer-style blocks improve upon convolution-only baselines, whether a frozen pretrained text encoder yields lower loss than a task-specific encoder trained from scratch under the present short-budget comparison, and whether factorized attention is competitive with full video attention. On a signer-disjoint partition of short clips extracted from How2Sign, the best short-run ablation attains a validation loss of 0.0648, while a longer-run checkpoint reaches 0.00999. A compact evaluation slice of that checkpoint yields SSIM $0.2403\pm0.0238$, PSNR $15.11\pm0.42$\,dB, and temporal consistency $1.0000\pm0.0000$; under an 8-step DDIM setting with guidance scale 5.0, the model generates a 32-frame $64\times64$ clip in 12.60\,s (2.54 frames/s) with 3.12\,GB peak inference memory on a single NVIDIA L4. In a held-out conditional denoising audit on real validation clips, removing text raises late-timestep denoising loss from 0.9875 to 0.9891, whereas shuffled prompts remain nearly indistinguishable from the intended prompt. Thus, frozen text conditioning yields a lower short-budget validation loss than the custom encoder baseline, and the revised post-revision checkpoint is qualitatively stronger than the earlier baseline in direct side-by-side inspection; however, held-out audits still show only weak prompt-specific separation. The present system remains limited to low-resolution short clips and does not yet include expert linguistic evaluation; accordingly, the reported results should be interpreted as a single-GPU research baseline rather than a complete solution to sign-language production. The code is publicly available at https://github.com/xiaruize0911/text2sign.