Text Detection, Tracking and Recognition in Video

YinXu-Cheng,ZuoZe-Yu,TianShu,LiuCheng-Lin

IF: 10.6

2016-01-01

IEEE Transactions on Image Processing

Abstract:The intelligent analysis of video data is currently in wide demand because a video is a major source of sensory data in our lives. Text is a prominent and direct source of information in video, whi...

What problem does this paper attempt to address?

A new video text detection method.

Jie Yuan,Baogang Wei,Weiming Lu,Lidong Wang

DOI: https://doi.org/10.1145/1998076.1998142

2011-01-01

Abstract:Nowadays, digital libraries contain more and more videos in them, and how to organize and retrieve those videos effectively has become very urgent. Text in videos is a very meaningful clue for video semantic understanding, and it can be used for video organization and retrieval. However, existing text recognizing methods can not deal with multilingual texts or texts embedded in a complex background very well. In this paper, we propose a novel video text detection method. Edge detection and candidate region extraction are firstly used to obtain all rough candidate text regions, and then region refinement is used to obtain the accurate location of each region. Based on our observation that a real text region has a uniform distribution with its non-zero pixels in its binary image, an entropy filter is used to remove non-text regions. Experiments on various videos show that our method is effective and robust against different languages, background complexities and font styles.
A Novel Approach to Text Detection and Extraction from Videos by Discriminative Features and Density

Wei Baogang,Zhang Yin,Yuan Jie,Liu Yonghuai,Wang Lidong

2014-01-01

Abstract:Text is very important to video retrieval,index, and understanding. However, its detection and extraction is challenging due to varying background, low contrast between text and non-text regions, and perspective distortion. In this paper, we propose a novel two phase approach to tackling this problem by discriminative features and edge density. The first phase firstly defines and extracts a novel feature called edge distribution entropy and then uses this feature to remove most non-text regions. The second phase employs a Support vector machine（SVM） to further distinguish real text regions from nontext ones. To generate inputs for SVM, additional three novel features are defined and extracted from each region:a foreground pixel distribution entropy, skeleton/size ratio, and edge density. After text regions have been detected, texts are extracted from such regions that are surrounded by sufficient edge pixels. A comparative study using two publicly accessible datasets shows that the proposed method significantly outperforms the selected four state of the art ones for accurate text detection and extraction.
Text Detection and Recognition Technique in Video

ZHU Chengjun,LI Chao,XIONG Zhang

DOI: https://doi.org/10.3969/j.issn.1000-3428.2007.10.078

2007-01-01

Abstract:Text presented in video frames can provide important supplemental information for video indexing and retrieval.In order to make researchers know the academic area more systemically,this paper gives an overview of state of the art of text detection and recognition technique in video.Typical techniques and method based on feature and learning theory are discussed,as well as their merits and shortcomings.With the present problem,the paper gives some work and issues that can be researched in the future.
A Research on Video Text Tracking and Recognition

Baokang Wang,Changsong Liu,Xiaoqing Ding

DOI: https://doi.org/10.1117/12.2009441

2013-01-01

Abstract:Nowadays, video has gradually become the mainstream of dissemination media for its rich information capacity and intelligibility, and texts in videos often carry significant semantic information, thus making great contribution to video content understanding and construction of content-based video retrieval system. Text-based video analyses usually consist of text detection, localization, tracking, segmentation and recognition. There has been a large amount of research done on video text detection and tracking, but most solutions focus on text content processing in static frames, few making full use of redundancy between video frames. In this paper, a unified framework for text detection, localization and tracking in video frames is proposed. We select edge and corner distribution of text blocks as text features, localizing and tracking are performed. By making good use of redundancy between frames, location relations and motion characteristics are determined, thus effectively reduce false-alarm and raise correct rate in localizing. Tracking schemes are proposed for static and rolling texts respectively. Through multi-frame integration, text quality is promoted, so is correct rate of OCR. Experiments demonstrate the reduction of false-alarm and the increase of correct rate of localization and recognition.
Text Recognition in Video Using OCR

陈义,李言俊,孙小炜

DOI: https://doi.org/10.3778/j.issn.1002-8331.2010.10.057

2010-01-01

Abstract:A simple and effective method is presented for real-time text segmentation and recognition in videos.Firstly,an algorithm is provided which detects text event and gets edges,then makes a size restrict to the edges and eventually wipes off nontext regions according to the textual energy.The overlaying of detected horizontal edges and vertical edges enhances the text edges,the size restrict of edges helps to wipe off the non-text edges.Later,image projection is applied to get text regions.Eventually,the text regions are processed by OCR technology.The combination of these methods guarantees the performance of this algorithm.
Text detection, localization, and tracking in compressed video

Xueming Qian,Guizhong Liu,Huan Wang,Rui Su

DOI: https://doi.org/10.1016/j.image.2007.06.005

2007-01-01

Abstract:Video text information plays an important role in semantic-based video analysis, indexing and retrieval. Video texts are closely related to the content of a video. Usually, the fundamental steps of text-based video analysis, browsing and retrieval consist of video text detection, localization, tracking, segmentation and recognition. Video sequences are commonly stored in compressed formats where MPEG coding techniques are often adopted. In this paper, a unified framework for text detection, localization, and tracking in compressed videos using the discrete cosines transform (DCT) coefficients is proposed. A coarse to fine text detection method is used to find text blocks in terms of the block DCT texture intensity information. The DCT texture intensity of an 8x8 block of an intra-frame is approximately represented by seven AC coefficients. The candidate text block regions are further verified and refined. The text block region localization and tracking are carried out by virtue of the horizontal and vertical block texture intensity projection profiles. The appearing and disappearing frames of each text line are determined by the text tracking. The final experimental results show the effectiveness of the proposed methods.
A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video

Liang Wu,Palaiahnakote Shivakumara,Tong Lu,Chew Lim Tan

DOI: https://doi.org/10.1109/tmm.2015.2443556

IF: 7.3

2015-01-01

IEEE Transactions on Multimedia

Abstract:Text detection and tracking in video is challenging due to contrast, resolution and background variations, and different orientations and text movements. In addition, the presence of both caption and scene texts in video aggravates the problem because these two text types differ in characteristics significantly . This paper proposes a new technique for detecting and tracking video texts of any orientation by using spatial and temporal information, respectively. The technique explores gradient directional symmetry at component level for smoothing edge components before text detection. Spatial information is preserved by forming Delaunay triangulation in a novel way at this level, which results in text candidates. Text characteristics are then proposed in a different way for eliminating false text candidates , which results in potential text candidates. Then grouping is proposed for combining potential text candidates regardless of orientation based on the nearest neighbor criterion. To tackle the problems of multi-font and multi-sized texts, we propose multi-scale integration by a pyramid structure, which helps in extracting full text lines. Then, the detected text lines are tracked in video by matching the subgraphs of triangulation. Experimental results for text detection and tracking on our video dataset, the benchmark video datasets, and the natural scene image benchmark datasets show that the proposed method is superior to the state-of-the-art methods in terms of recall, precision , and F-measure.
Video text detection and segmentation for optical character recognition

Chong-Wah Ngo,Chi-Kwong Chan

DOI: https://doi.org/10.1007/s00530-004-0157-0

IF: 3.9

2005-03-01

Multimedia Systems

Abstract:Abstract.In this paper, we present approaches to detecting and segmenting text in videos. The proposed video-text-detection technique is capable of adaptively applying appropriate operators for video frames of different modalities by classifying the background complexities. Effective operators such as the repeated shifting operations are applied for the noise removal of images with high edge density. Meanwhile, a text-enhancement technique is used to highlight the text regions of low-contrast images. A coarse-to-fine projection technique is then employed to extract text lines from video frames. Experimental results indicate that the proposed text-detection approach is superior to the machine-learning-based (such as SVM and neural network), multiresolution-based, and DCT-based approaches in terms of detection and false-alarm rates. Besides text detection, a technique for text segmentation is also proposed based on adaptive thresholding. A commercial OCR package is then used to recognize the segmented foreground text. A satisfactory character-recognition rate is reported in our experiments.

computer science, information systems, theory & methods
The Developments and Challenges of Text Detection Algorithms

Yi-xin LI,Jin-wen MA

DOI: https://doi.org/10.16798/j.issn.1003-0530.2017.04.016

2017-01-01

Journal of Signal Processing

Abstract:Recognition and understanding of the text from a natural scene is fundamental to a variety of practical applications in the fields of computer vision and intelligent information processing.The objective of text detection algorithms is to detect and localize text regions precisely in natural images.Therefore,text detection is a major part of recognizing and understanding the text from a natural scene,and has become a very popular topic in recent years.In this paper,we first introduce the objective,methods,and challenges of text detection.We then review some classic algorithms on text detection,and introduce two deep learning based algorithms that represent the trends of text detection research.Moreover,we summarize typical text detection datasets available,as well as the results of representative and leading algorithms on these data sets.Finally,we conclude the current researches on text detection and the challenges we face,and point out some prospective directions of text detection.
End-to-end video text detection with online tracking

Hongyuan Yu,Yan Huang,Lihong Pi,Chengquan Zhang,Xuan Li,Liang Wang

DOI: https://doi.org/10.1016/j.patcog.2020.107791

IF: 8

2021-05-01

Pattern Recognition

Abstract:Text in videos usually acts as important semantic cues, which is helpful to video analysis. Video text detection is considered as one of the most difficult tasks in document analysis due to the following two challenges: 1) the difficulties caused by video scenes, i.e., motion blur, illumination changes, and occlusion; 2) the properties of text including variants of fonts, languages, orientations, and shapes. Most existing methods try to improve the video text detection through video text tracking, but treat these two tasks separately. This can significantly increase the amount of calculations and cannot take full advantage of the supervisory information of both tasks. In this work, we introduce explainable descriptor, combines appearance, geometry and PHOC features, to establish a bridge between detection and tracking and build an end-to-end video text detection model with online tracking to address these challenges together. By integrating these two branches into one trainable framework, they can promote each other and the computational cost is significantly reduced. Besides, the introduce explainable descriptor also make our end-to-end model have inherent interpretability. Experiments on existing video text benchmarks including ICDAR 2013 Video, DOST, Minetto and YVT verify the role of explainable descriptors in improving model expression ability and the proposed method significantly outperforms state-of-the-art methods. Our method improves F-score by more than <math>2%</math> on all datasets and achieves <math>81.52%</math> on the MOTA of the Minetto dataset.

computer science, artificial intelligence,engineering, electrical & electronic
An Automatic Video Text Detection, Localization and Extraction Approach.

Chengjun Zhu,Yuanxin Ouyang,Lei Gao,Zhenyong Chen,Zhang Xiong

DOI: https://doi.org/10.1007/978-3-642-01350-8_1

2009-01-01

Abstract:Text in video is a very compact and accurate clue for video indexing and summarization. This paper presents an algorithm regarding word group as a special symbol to detect, localize and extract video text using support vector machine (SVM) automatically. First, four sobel operators are applied to get the EM(edge map) of the video frame and the EM is segmented into N×2N size blocks. Then character features and characters group structure features are extracted to construct a 19-dimension feature vector. We use a pre-trained SVM to partition each block into two classes: text and non-text blocks. Secondly a dilatation-shrink process is employed to adjust the text position. Finally text regions are enhanced by multiple frame information. After binarization of enhanced text region, the text region with clean background is recognized by OCR software. Experimental results show that the proposed method can detect, localize, and extract video texts with high accuracy.
Video text rediscovery: Predicting and tracking text across complex scenes

Veronica Naosekpam,Nilkanta Sahu

DOI: https://doi.org/10.1111/coin.12686

2024-06-20

Computational Intelligence

Abstract:Dynamic texts in scene videos provide valuable insights and semantic cues crucial for video applications. However, the movement of this text presents unique challenges, such as blur, shifts, and blockages. While efficient in tracking text, state‐of‐the‐art systems often need help when text becomes obscured or complicated scenes. This study introduces a novel method for detecting and tracking video text, specifically designed to predict the location of obscured or occluded text in subsequent frames using a tracking‐by‐detection paradigm. Our approach begins with a primary detector to identify text within individual frames, thus enhancing tracking accuracy. Using the Kalman filter, Munkres algorithm, and deep visual features, we establish connections between text instances across frames. Our technique works on the concept that when text goes missing in a frame due to obstructions, we use its previous speed and location to predict its next position. Experiments conducted on the ICDAR2013 Video and ICDAR2015 Video datasets confirm our method's efficacy, matching or surpassing established methods in performance.

computer science, artificial intelligence
Text from corners: a novel approach to detect text and caption in videos.

Xu Zhao,Kai-Hsiang Lin,Yun Fu,Yuxiao Hu,Yuncai Liu,Thomas S Huang

DOI: https://doi.org/10.1109/TIP.2010.2068553

IF: 10.6

2011-01-01

IEEE Transactions on Image Processing

Abstract:Detecting text and caption from videos is important and in great demand for video retrieval, annotation, indexing, and content analysis. In this paper, we present a corner based approach to detect text and caption from videos. This approach is inspired by the observation that there exist dense and orderly presences of corner points in characters, especially in text and caption. We use several discriminative features to describe the text regions formed by the corner points. The usage of these features is in a flexible manner, thus, can be adapted to different applications. Language independence is an important advantage of the proposed method. Moreover, based upon the text features, we further develop a novel algorithm to detect moving captions in videos. In the algorithm, the motion features, extracted by optical flow, are combined with text features to detect the moving caption patterns. The decision tree is adopted to learn the classification criteria. Experiments conducted on a large volume of real video shots demonstrate the efficiency and robustness of our proposed approaches and the real-world system. Our text and caption detection system was recently highlighted in a worldwide multimedia retrieval competition, Star Challenge, by achieving the superior performance with the top ranking.
Video Text Tracking With a Spatio-Temporal Complementary Model

Yuzhe Gao,Xing Li,Jiajian Zhang,Yu Zhou,Dian Jin,Jing Wang,Shenggao Zhu,Xiang Bai

DOI: https://doi.org/10.1109/TIP.2021.3124313

2021-12-29

Abstract:Text tracking is to track multiple texts in a video,and construct a trajectory for each text. Existing methodstackle this task by utilizing the tracking-by-detection frame-work, i.e., detecting the text instances in each frame andassociating the corresponding text instances in consecutiveframes. We argue that the tracking accuracy of this paradigmis severely limited in more complex scenarios, e.g., owing tomotion blur, etc., the missed detection of text instances causesthe break of the text trajectory. In addition, different textinstances with similar appearance are easily confused, leadingto the incorrect association of the text instances. To this end,a novel spatio-temporal complementary text tracking model isproposed in this paper. We leverage a Siamese ComplementaryModule to fully exploit the continuity characteristic of the textinstances in the temporal dimension, which effectively alleviatesthe missed detection of the text instances, and hence ensuresthe completeness of each text trajectory. We further integratethe semantic cues and the visual cues of the text instance intoa unified representation via a text similarity learning network,which supplies a high discriminative power in the presence oftext instances with similar appearance, and thus avoids the mis-association between them. Our method achieves state-of-the-art performance on several public benchmarks. The source codeis available at <a class="link-external link-https" href="https://github.com/lsabrinax/VideoTextSCM" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition
You Only Recognize Once: Towards Fast Video Text Spotting

Zhanzhan Cheng,Jing Lu,Yi Niu,Shiliang Pu,Fei Wu,Shuigeng Zhou

DOI: https://doi.org/10.48550/arXiv.1903.03299

2021-10-25

Abstract:Video text spotting is still an important research topic due to its various real-applications. Previous approaches usually fall into the four-staged pipeline: text detection in individual images, framewisely recognizing localized text regions, tracking text streams and generating final results with complicated post-processing skills, which might suffer from the huge computational cost as well as the interferences of low-quality text. In this paper, we propose a fast and robust video text spotting framework by only recognizing the localized text one-time instead of frame-wisely recognition. Specifically, we first obtain text regions in videos with a well-designed spatial-temporal detector. Then we concentrate on developing a novel text recommender for selecting the highest-quality text from text streams and only recognizing the selected ones. Here, the recommender assembles text tracking, quality scoring and recognition into an end-to-end trainable module, which not only avoids the interferences from low-quality text but also dramatically speeds up the video text spotting process. In addition, we collect a larger scale video text dataset (LSVTD) for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. Extensive experiments on two public benchmarks show that our method greatly speeds up the recognition process averagely by 71 times compared with the frame-wise manner, and also achieves the remarkable state-of-the-art.

Computer Vision and Pattern Recognition
Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming.

Chun Yang,Xu-Cheng Yin,Wei-Yi Pei,Shu Tian,Ze-Yu Zuo,Chao Zhu,Junchi Yan

DOI: https://doi.org/10.1109/TIP.2017.2695104

2017-01-01

Abstract:There are a variety of grand challenges for multi-orientation text detection in scene videos, where the typical issues include skew distortion, low contrast, and arbitrary motion. Most conventional video text detection methods using individual frames have limited performance. In this paper, we propose a novel tracking based multi-orientation scene text detection method using multiple frames within a unified framework via dynamic programming. First, a multi-information fusion-based multi-orientation text detection method in each frame is proposed to extensively locate possible character candidates and extract text regions with multiple channels and scales. Second, an optimal tracking trajectory is learned and linked globally over consecutive frames by dynamic programming to finally refine the detection results with all detection, recognition, and prediction information. Moreover, the effectiveness of our proposed system is evaluated with the state-of-the-art performances on several public data sets of multi-orientation scene text images and videos, including MSRA-TD500, USTB-SV1K, and ICDAR 2015 Scene Videos.
Scene Video Text Tracking Based on Hybrid Deep Text Detection and Layout Constraint

Xihan Wang,Xiaoyi Feng,Zhaoqiang Xia

DOI: https://doi.org/10.1016/j.neucom.2019.05.101

IF: 6

2019-01-01

Neurocomputing

Abstract:Video text in real-world scenes often carries rich high-level semantic information and plays an ever-increasingly important role in the content-based video analysis and retrieval. Therefore, the scene video text detection and tracking are important prerequisites of numerous multimedia applications. However, the performance of most existing tracking methods is not satisfactory due to frequent mis-detections, unexpected camera motion and similar appearances between text regions. To address these problems, we propose a new video text tracking approach based on hybrid deep text detection and layout constraint. Firstly, a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way is proposed to locate possible text candidates in individual frames. Then, text trajectories are derived from consecutive frames with a novel data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the ambiguities caused by similar text regions are effectively reduced. We conduct experiments on four benchmark datasets, i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto, to evaluate the proposed method. The experimental results demonstrate the effectiveness and superiority of the proposed approach.
Video Text Detection with Fully Convolutional Network and Tracking

Yang Wang,Lan Wang,Feng Su,Jiahao Shi

DOI: https://doi.org/10.1109/icme.2019.00299

2019-01-01

Abstract:Scene text in videos carries rich semantic information that is of great value in various content-based video applications. In this paper, we propose an effective fully convolutional network model for detecting text in videos based on a novel refine block structure. The model hierarchically exploits low-level features from earlier convolutions to refine high-level semantic features, thereby fusing multi-resolution features extracted from the frame to generate high-resolution semantic feature maps for better capturing widely varied appearances of video text. We further complement the individual-frame detection with an efficient correlation filter based text tracking mechanism, and enhance the overall detection performance by matching and combining detection and tracking results. Experiments on public scene text video datasets demonstrate the state-of-the-art performance of the proposed method.
Detecting both superimposed and scene text with multiple languages and multiple alignments in video

Xiaodong Huang,Huadong Ma,Charles X. Ling,Guangyu Gao

DOI: https://doi.org/10.1007/s11042-012-1201-2

IF: 2.577

2012-08-12

Multimedia Tools and Applications

Abstract:Video text often contains highly useful semantic information that can contribute significantly to video retrieval and understanding. Video text can be classified into scene text and superimposed text. Most of the previous methods detect superimposed or scene text separately due to different text alignments. Moreover, because different language characters have different edge and texture features, it is very difficult to detect the multilingual text. In this paper, we first perform a detailed analysis of motion patterns of video text, and show that the superimposed and scene text exhibit different motion patterns on consecutive frames, which is insensitive to multiple language characters and multiple text alignments. Based on our analysis, we define Motion Perception Field (MPF) to represent the text motion patterns. Finally, we propose a text detection algorithms using MPF for both superimposed and scene text with multiple languages and multiple alignments. Experimental results on diverse videos demonstrate that our algorithms are robust, and outperform previous methods for detecting both superimposed and scene texts with multiple languages and multiple alignments.

computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
Automatic video superimposed text detection based on Nonsubsampled Contourlet Transform

Xiaodong Huang

DOI: https://doi.org/10.1007/s11042-017-4619-8

IF: 2.577

2017-03-25

Multimedia Tools and Applications

Abstract:Compared with other video semantic clues, such as gestures, motions etc., video text generally provides highly useful and fairly precise semantic information, the analysis of which can to a great extent facilitate video and scene understanding. It can be observed that the video texts show stronger edges. The Nonsubsampled Contourlet Transform (NSCT) is a fully shift-invariant, multi-scale, and multi-direction expansion, which can preserve the edge/silhouette of the text characters well. Therefore, in this paper, a new approach has been proposed to detect video text based on NSCT. First of all, the 8 directional coefficients of NSCT are combined to build the directional edge map (DEM), which can keep the horizontal, vertical and diagonal edge features and suppress other directional edge features. Then various directional pixels of DEM are integrated into a whole binary image (BE). Based on the BE, text frame classification is carried out to determine whether the video frames contain the text lines. Finally, text detection based on the BE is performed on consecutive frames to discriminate the video text from non-text regions. Experimental evaluations based on our collected TV videos data set demonstrate that our method significantly outperforms the other 3 video text detection algorithms in both detection speed and accuracy, especially when there are challenges such as video text with various sizes, languages, colors, fonts, short or long text lines.

computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering

Text Detection, Tracking and Recognition in Video

A new video text detection method.

A Novel Approach to Text Detection and Extraction from Videos by Discriminative Features and Density

Text Detection and Recognition Technique in Video

A Research on Video Text Tracking and Recognition

Text Recognition in Video Using OCR

Text detection, localization, and tracking in compressed video

A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video

Video text detection and segmentation for optical character recognition

The Developments and Challenges of Text Detection Algorithms

End-to-end video text detection with online tracking

An Automatic Video Text Detection, Localization and Extraction Approach.

Video text rediscovery: Predicting and tracking text across complex scenes

Text from corners: a novel approach to detect text and caption in videos.

Video Text Tracking With a Spatio-Temporal Complementary Model

You Only Recognize Once: Towards Fast Video Text Spotting

Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming.

Scene Video Text Tracking Based on Hybrid Deep Text Detection and Layout Constraint

Video Text Detection with Fully Convolutional Network and Tracking

Detecting both superimposed and scene text with multiple languages and multiple alignments in video

Automatic video superimposed text detection based on Nonsubsampled Contourlet Transform