User Attributes Inference Based on Reviews on Social Media
Yun LIU,Yu-Qing SUN,Ming-Zhu LI
DOI: https://doi.org/10.11897/SP.J.1016.2017.02762
2018-01-01
Abstract:The user attribute inference problem occupies an important role in practical applications such as personalized recommendation,marketing and promotion on quality of web service.The current works mainly aim at the identity related user online behaviors,such as a user query history,user relationships etc.,which are not applicable for the case on social media since users are often anonymous.Additionally,user reviews are not only fragmented and noisy,but also imbalanced on both the quantity and distribution.In this paper,we propose a series of methods to solve the above challenging problems.We take into account the item information user commented and the context as the supplements for solving the imbalanced problem on quantity distribution,which reveals a user's preference and behavior trajectory.In addition,we introduce an ontology database to enrich inner semantic features of user comments,which summarizes and generalizes the relevant knowledge of words and organizes it into a hierarchical structure.User comments are partitioned into words and mapped to the nodes in the ontology that represent conceptions of the same meaning words.The hierarchical features reveal semantic relationship existed in words and effectively reduce the negative influence of fragmented data and imbalanced quantity problem.The feature dimension is high after modeling and the fragmented information has low value.To solve this problem,we adopt information gain to measure the importance of features.It can be used to measure the influence of the variety of features on user attributes inference result.It reflects the amount of information that a feature contains.In the information theory,the entropy is used to measure the uncertainty of a random variable.For user attributes inference,the uncertainty change of user attributes after adding a feature is called information gain,which indicates the amount of information brought about by this feature.The larger the difference,the more the ability of the feature to distinguish users who have different attributes.In order to reduce the influence by high dimension problem,based on information gain,we improve the two representative methods of probabilistic feature selection:Probability Wrapped Features Selection algorithm and Heuristic Probability Feature Selection algorithm.Both methods adopt feature importance as the probability in feature selection either in pre-classification or iterative learning process.These two methods reduce the search space and improve the convergence rate of feature selection.By taking into account the correlations between features and classifiers on the small scale type data,we proposed the Unbalanced Data Enhancement Learning algorithm to integrate multiple featurerelated classifiers.It retains the important features while selects trivial features with low probability.It is more advantageous in the problem of unbalanced attributes inference.Several real datasets are adopted to validate our methods on attribute inference from several aspects,including behavior models,feature selection methods,parameters influence and the degree of imbalanced data on user attributes.The experimental results show that the proposed approach not only relieves the negative influence of fragmented and noisy data,but also effectively solve the difficulty of attribute classification under imbalanced user attribute distribution.The results also show that our methods outperform the related algorithms.