The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Este artigo apresenta um algoritmo Gini-Index aprimorado para corrigir o viés de seleção de recursos na classificação de texto. O Índice de Gini tem sido usado como medida de divisão para escolher o atributo de divisão mais apropriado na árvore de decisão. Recentemente, foi introduzido um algoritmo aprimorado do Índice de Gini para seleção de recursos, projetado para categorização de texto e baseado na teoria do Índice de Gini, e provou ser melhor que os outros métodos. No entanto, descobrimos que o Índice de Gini ainda mostra um viés de seleção de recursos na classificação de texto, especificamente para conjuntos de dados desequilibrados com um grande número de recursos. O viés de seleção de recursos do Índice de Gini na seleção de recursos é mostrado de três maneiras: 1) os valores de Gini dos recursos de baixa frequência são baixos (na medida de pureza) em geral, independentemente da distribuição dos recursos entre as classes, 2) para alta -características de frequência, os valores de Gini são sempre relativamente altos e 3) para características específicas pertencentes a classes grandes, os valores de Gini são relativamente mais baixos do que aqueles pertencentes a classes pequenas. Portanto, para corrigir esse viés e melhorar a seleção de recursos na classificação de texto usando o Índice de Gini, propomos um algoritmo de Índice de Gini (I-GI) aprimorado com três expressões de Índice de Gini reformuladas. No presente estudo, usamos redução de dimensionalidade global (DR) e DR local para medir a qualidade dos recursos nas seleções de recursos. Nos resultados experimentais para o algoritmo I-GI, obtivemos valores de recursos imparciais e eliminamos muitos recursos gerais irrelevantes, mantendo muitos recursos específicos. Além disso, poderíamos melhorar o desempenho geral da classificação quando utilizamos o método DR local. As médias totais do desempenho de classificação foram aumentadas em 19.4%, 15.9%, 3.3%, 2.8% e 2.9% (kNN) em Micro-F1, 14%, 9.8%, 9.2%, 3.5% e 4.3% (SVM) em Micro-F1, 20%, 16.9%, 2.8%, 3.6% e 3.1% (kNN) na Macro-F1, 16.3%, 14%, 7.1%, 4.4%, 6.3% (SVM) na Macro-F1, em comparação com tf*idf, χ2, Ganho de Informação, Odds Ratio e os métodos Gini-Index existentes de acordo com cada classificador.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Heum PARK, Hyuk-Chul KWON, "Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification" in IEICE TRANSACTIONS on Information,
vol. E94-D, no. 4, pp. 855-865, April 2011, doi: 10.1587/transinf.E94.D.855.
Abstract: This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E94.D.855/_p
Copiar
@ARTICLE{e94-d_4_855,
author={Heum PARK, Hyuk-Chul KWON, },
journal={IEICE TRANSACTIONS on Information},
title={Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification},
year={2011},
volume={E94-D},
number={4},
pages={855-865},
abstract={This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.},
keywords={},
doi={10.1587/transinf.E94.D.855},
ISSN={1745-1361},
month={April},}
Copiar
TY - JOUR
TI - Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification
T2 - IEICE TRANSACTIONS on Information
SP - 855
EP - 865
AU - Heum PARK
AU - Hyuk-Chul KWON
PY - 2011
DO - 10.1587/transinf.E94.D.855
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E94-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2011
AB - This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.
ER -