Web Resource Naming Conventions and User Behavior Analysis

Chong Chen,Hongfei Yan
DOI: https://doi.org/10.3772/j.issn.1000-0135.2009.04.013
2009-01-01
Abstract:A Web resource refers to a file, or some files (maybe with directory or subdirectories) which represent a certain thing, meaning or entity, and are worthy of treasure in the long term. Web resources, such as e-books, learning materials or songs, can provide various contents to digital libraries, educational repositories or other digital collections. However, Web resources are characterized as chaotic naming, which have obstructed the searching and organizing to them. We inspect web resource naming conventions and user behavior characteristics using statistical methods based on 16,284 resources. The data set consisting of about 61 thousand files had been continually gathered on the Web from 2003 to 2006. In this paper, we study the distributions of the length of resource names, subdirectory names, and file names; the entropy of the character types; high occurrence of symbols in the names; high-frequency snippet styles and semantic types. These analyses reveal the disorderly naming conventions of the Internet users. The results we concluded will help both purify and extract useful information from chaos names for better retrieval, as well as illustrating the user behaviors when sharing and spreading web resources in the Internet.
What problem does this paper attempt to address?