SPEAKER RETRIEVAL FOR TV SHOW VIDEOS BY ASSOCIATING AUDIO SPEAKER RECOGNITION RESULT TO VISUAL FACES

Yina Han,Joseph Razik,Gerard Chollet,Guizhong Liu
2008-01-01
Abstract:Person retrieval based on solely visual face recognition is hard because of the well known problems of illumination, pose, size and expression variation, which can exceed those due to identity. Fortunately, videos often accompanied with other modalities, like audio, text, etc. In this paper, we propose a framework to associate who and when information provided by speaker recognition result to the present faces in the frame sequence for retrieving speakers in TV show videos. First shot segmentation and clustering is carried out. Then face detection and tracking is followed to further locate the faces spacially. Finally, given the identity and time information by speaker recognition result, we point out three ambiguities to be resolved and propose their corresponding solutions.
What problem does this paper attempt to address?