Social Network Extraction: Superficial Method and Information Retrieval

Mahyuddin K. M. Nasution,Shahrul Azman Mohd. Noah,Saidah Saad
DOI: https://doi.org/10.48550/arXiv.1601.02904
2016-01-12
Abstract:Social network has become one of the themes of government issues, mainly dealing with the chaos. The use of web is steadily gaining ground in these issues. However, most of the web documents are unstructured and lack of semantic. In this paper we proposed an Information Retrieval driven method for dealing with heterogeneity of features in the web. The proposed solution is to compare some approaches have shown the capacity to extract social relation: strength relations and relations based on online academic database.
Information Retrieval,Social and Information Networks
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the extraction of entities and their relationships in social networks. Specifically, the author focuses on how to extract social networks from web pages, especially dealing with the heterogeneity of web - page data and the lack of semantic structure. The paper proposes an information - retrieval - based method to address these challenges and compares the performance of several methods in extracting social relationships, including strength relations (Strength Relations) and relations based on online academic databases. ### Main problems: 1. **Heterogeneity of web - page data and lack of semantic structure**: Most web - page documents are unstructured and lack explicit semantic information, which makes it difficult to extract social networks from them. 2. **Entity recognition and relationship extraction**: How to effectively identify entities (such as individuals, organizations, etc.) in web pages and their relationships, especially in large - scale data. 3. **Performance evaluation of methods**: How to evaluate the performance of different methods in extracting social networks, especially precision and recall. ### Solutions: 1. **Information - retrieval - driven method**: Use information - retrieval techniques to extract social networks from web pages, focusing on entity recognition and relationship extraction. 2. **Comparison of multiple methods**: Compare the performance of supervised and unsupervised learning methods in extracting social networks, especially strength relations (SRS) and underlying strength relations based on URLs (USR). 3. **Experimental verification**: Verify the performance of different methods through experiments, using a data set of 539 web pages and comparing with the benchmark graph in the DBLP online database. ### Formula summary: - **Jaccard coefficient**: \[ \text{sim}_{\text{jac}}(a, b)=\frac{|a\cap b|}{|a| + |b|-|a\cap b|} \] - **Conditional probability**: \[ p(b_i|a)=\frac{|(q\Rightarrow b_i) = T|}{|M|} \] - **Improved Jaccard coefficient**: \[ \text{sim}(a, b_i)=\frac{|(a\Rightarrow b_i) = T|}{|M|+|D_{b_i}|-|(q\Rightarrow b_i) = T|} \] - **TF - IDF calculation**: \[ \text{TF.IDF}_w=\text{tf}(w)\cdot\text{idf}(w)=\left(\sum_{j = 1}^{N}\sum_{i = 1}^{m}\frac{1}{n}\right)\log\frac{N}{\text{df}(w)} \] - **Normalized TF - IDF**: \[ \text{tfidf}_{\text{nor}}=(\text{TF.IDF})\left(\frac{N}{\sigma}\right) \] - **Recall**: \[ \text{Rec}(S_i)=\frac{| \{ S\in P(S_i):C(S)=C(S_i)\}|}{| \{ S\in P(S_i)\}|} \] - **Precision**: \[ \text{Prec}(S_i)=\frac{| \{ S\in C(S_i):P(S)=P(S_i)\}|}{| \{ S\in C(S_i)\}|} \] - **F - value**: \[ F = 2\cdot\text{REC}\cdot\text{PREC}/(\text{REC}+\text{PREC}) \] Through these methods and formulas, the paper aims to improve the efficiency and accuracy of extracting social networks from web pages.