The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Recentemente, para rastrear e relacionar documentos noticiosos de diversas fontes, a mineração de regras de associação tem sido aplicada devido ao seu desempenho e escalabilidade. Este artigo apresenta uma investigação empírica sobre como a base de representação de termos, a ponderação de termos e a medida de associação afetam a qualidade das relações descobertas entre documentos noticiosos. Vinte e quatro combinações iniciadas por duas bases de representação de termos, quatro ponderações de termos e três medidas de associação são exploradas com seus resultados comparados ao julgamento humano de relações de três níveis: relações completamente relacionadas, de alguma forma relacionadas e não relacionadas. A avaliação de desempenho é realizada comparando os melhoresk resultados de cada combinação com os das outras usando a chamada incompatibilidade de ordem de classificação (ROM). Os resultados experimentais indicam que uma combinação de bigrama (BG), frequência de termo com frequência inversa de documento (TFIDF) e confiança (CONF), bem como uma combinação de BG, TFIDF e convicção (CONV), alcança o melhor desempenho para encontrar o documentos relacionados, colocando-os em posições superiores com ROM de 0.41% nas 50 relações mais mineradas. No entanto, uma combinação de unigram (UG), TFIDF e lift (LIFT) tem o melhor desempenho ao localizar relações irrelevantes nas classificações mais baixas (top-1100) com 9.63% de ROM. Uma análise detalhada do número de relações de três níveis no que diz respeito às suas classificações também é realizada para examinar as características das relações resultantes. Finalmente, uma discussão e uma análise de erros são apresentadas.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Nichnan KITTIPHATTANABAWON, Thanaruk THEERAMUNKONG, Ekawit NANTAJEEWARAWAT, "News Relation Discovery Based on Association Rule Mining with Combining Factors" in IEICE TRANSACTIONS on Information,
vol. E94-D, no. 3, pp. 404-415, March 2011, doi: 10.1587/transinf.E94.D.404.
Abstract: Recently, to track and relate news documents from several sources, association rule mining has been applied due to its performance and scalability. This paper presents an empirical investigation on how term representation basis, term weighting, and association measure affects the quality of relations discovered among news documents. Twenty four combinations initiated by two term representation bases, four term weightings, and three association measures are explored with their results compared to human judgment of three-level relations: completely related, somehow related, and unrelated relations. The performance evaluation is conducted by comparing the top-k results of each combination to those of the others using so-called rank-order mismatch (ROM). The experimental results indicate that a combination of bigram (BG), term frequency with inverse document frequency (TFIDF) and confidence (CONF), as well as a combination of BG, TFIDF and conviction (CONV), achieves the best performance to find the related documents by placing them in upper ranks with 0.41% ROM on top-50 mined relations. However, a combination of unigram (UG), TFIDF and lift (LIFT) performs the best by locating irrelevant relations in lower ranks (top-1100) with 9.63% ROM. A detailed analysis on the number of the three-level relations with regard to their rankings is also performed in order to examine the characteristic of the resultant relations. Finally, a discussion and an error analysis are given.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E94.D.404/_p
Copiar
@ARTICLE{e94-d_3_404,
author={Nichnan KITTIPHATTANABAWON, Thanaruk THEERAMUNKONG, Ekawit NANTAJEEWARAWAT, },
journal={IEICE TRANSACTIONS on Information},
title={News Relation Discovery Based on Association Rule Mining with Combining Factors},
year={2011},
volume={E94-D},
number={3},
pages={404-415},
abstract={Recently, to track and relate news documents from several sources, association rule mining has been applied due to its performance and scalability. This paper presents an empirical investigation on how term representation basis, term weighting, and association measure affects the quality of relations discovered among news documents. Twenty four combinations initiated by two term representation bases, four term weightings, and three association measures are explored with their results compared to human judgment of three-level relations: completely related, somehow related, and unrelated relations. The performance evaluation is conducted by comparing the top-k results of each combination to those of the others using so-called rank-order mismatch (ROM). The experimental results indicate that a combination of bigram (BG), term frequency with inverse document frequency (TFIDF) and confidence (CONF), as well as a combination of BG, TFIDF and conviction (CONV), achieves the best performance to find the related documents by placing them in upper ranks with 0.41% ROM on top-50 mined relations. However, a combination of unigram (UG), TFIDF and lift (LIFT) performs the best by locating irrelevant relations in lower ranks (top-1100) with 9.63% ROM. A detailed analysis on the number of the three-level relations with regard to their rankings is also performed in order to examine the characteristic of the resultant relations. Finally, a discussion and an error analysis are given.},
keywords={},
doi={10.1587/transinf.E94.D.404},
ISSN={1745-1361},
month={March},}
Copiar
TY - JOUR
TI - News Relation Discovery Based on Association Rule Mining with Combining Factors
T2 - IEICE TRANSACTIONS on Information
SP - 404
EP - 415
AU - Nichnan KITTIPHATTANABAWON
AU - Thanaruk THEERAMUNKONG
AU - Ekawit NANTAJEEWARAWAT
PY - 2011
DO - 10.1587/transinf.E94.D.404
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E94-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2011
AB - Recently, to track and relate news documents from several sources, association rule mining has been applied due to its performance and scalability. This paper presents an empirical investigation on how term representation basis, term weighting, and association measure affects the quality of relations discovered among news documents. Twenty four combinations initiated by two term representation bases, four term weightings, and three association measures are explored with their results compared to human judgment of three-level relations: completely related, somehow related, and unrelated relations. The performance evaluation is conducted by comparing the top-k results of each combination to those of the others using so-called rank-order mismatch (ROM). The experimental results indicate that a combination of bigram (BG), term frequency with inverse document frequency (TFIDF) and confidence (CONF), as well as a combination of BG, TFIDF and conviction (CONV), achieves the best performance to find the related documents by placing them in upper ranks with 0.41% ROM on top-50 mined relations. However, a combination of unigram (UG), TFIDF and lift (LIFT) performs the best by locating irrelevant relations in lower ranks (top-1100) with 9.63% ROM. A detailed analysis on the number of the three-level relations with regard to their rankings is also performed in order to examine the characteristic of the resultant relations. Finally, a discussion and an error analysis are given.
ER -