A Comparative Analysis of Machine Learning Models for Banking News Extraction by Multiclass Classification With Imbalanced Datasets of Financial News: Challenges and Solutions.

Varun Dogra; Sahil Verma; Kavita Verma; Noor Zaman Jhanjhi; Uttam Ghosh; Dac Nhuong Le

doi:10.9781/ijimai.2022.02.002

Authors

Varun Dogra Lovely Professional University
Sahil Verma Chandigarh University
Kavita Verma Chandigarh University
Noor Zaman Jhanjhi Taylor's University
Uttam Ghosh Meharry School of Applied Computational Sciences (United States).
Dac Nhuong Le Duy Tan University

DOI:

https://doi.org/10.9781/ijimai.2022.02.002

Keywords:

Class Imbalance, Ensemble Methods, Machine Learning, N-grams, Sampling, Tf-idf

Abstract

Online portals provide an enormous amount of news articles every day. Over the years, numerous studies have concluded that news events have a significant impact on forecasting and interpreting the movement of stock prices. The creation of a framework for storing news-articles and collecting information for specific domains is an important and untested problem for the Indian stock market. When online news portals produce financial news articles about many subjects simultaneously, finding news articles that are important to the specific domain is nontrivial. A critical component of the aforementioned system should, therefore, include one module for extracting and storing news articles, and another module for classifying these text documents into a specific domain(s). In the current study, we have performed extensive experiments to classify the financial news articles into the predefined four classes Banking, Non-Banking, Governmental, and Global. The idea of multi-class classification was to extract the Banking news and its most correlated news articles from the pool of financial news articles scraped from various web news portals. The news articles divided into the mentioned classes were imbalanced. Imbalance data is a big difficulty with most classifier learning algorithms. However, as recent works suggest, class imbalances are not in themselves a problem, and degradation in performance is often correlated with certain variables relevant to data distribution, such as the existence in noisy and ambiguous instances in the adjacent class boundaries. A variety of solutions to addressing data imbalances have been proposed recently, over-sampling, down-sampling, and ensemble approach. We have presented the various challenges that occur with data imbalances in multiclass classification and solutions in dealing with these challenges. The paper has also shown a comparison of the performances of various machine learning models with imbalanced data and data balances using sampling and ensemble techniques. From the result, it’s clear that the performance of Random Forest classifier with data balances using the over-sampling technique SMOTE is best in terms of precision, recall, F-1, and accuracy. From the ensemble classifiers, the Balanced Bagging classifier has shown similar results as of the Random Forest classifier with SMOTE. Random forest classifier's accuracy, however, was 100% and it was 99% with the Balanced Bagging classifier.

Downloads

Download data is not yet available.

References

Atkins, Adam, Mahesan Niranjan, and Enrico Gerding. “Financial news predicts stock market volatility better than close price,” The Journal of Finance and Data Science 4, no. 2, pp. 120-137, 2018.

Belainine, Billal, Alexsandro Fonseca, and Fatiha Sadat. “Named entity recognition and hashtag decomposition to improve the classification of tweets,” In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pp. 102-111. 2016.

da Costa Albuquerque, Fábio, Marco A. Casanova, Jose Antonio F. de Macedo, Marcelo Tilio M. de Carvalho, and Chiara Renso. “A proactive application to monitor truck fleets,” In 2013 IEEE 14th International Conference on Mobile Data Management, vol. 1, pp. 301-304. IEEE, 2013.

D. McDonald, H. Chen, and R. Schumaker. “Transforming Open-Source Documents to Terror Networks: The Arizona TerrorNet,” In AAAI Spring Symposium: AI Technologies for Homeland Security, pp. 62-69, 2005.

C.P. Wei, and Y.H. Lee. “Event detection from online news documents for supporting environmental scanning,” Decision Support Systems 36, pp. 385-401, 2004.

M.H. Steinberg. “Clinical trials in sickle cell disease: adopting the combination chemotherapy paradigm,” American Journal of Hematology 83, no. 1, pp. 1-3, 2008.

S. Xiong, K. Wang, D. Ji, B. Wang. “A short text sentiment-topic model for product reviews,” Neurocomputing 297, pp. 94-102, 2018.

Abbasi, Ahmed, Stephen France, Zhu Zhang, and Hsinchun Chen. “Selecting attributes for sentiment classification using feature relation networks,” IEEE Transactions on Knowledge and Data Engineering 23, no. 3, pp. 447-462, 2010.

Aggarwal, Charu C. “Machine Learning for Text: An Introduction,” In Machine Learning for Text, pp. 1-16. Springer, Cham, 2018.

Ahmed, Sajid, Asif Mahbub, Farshid Rayhan, Rafsan Jani, Swakkhar Shatabda, and Dewan Md Farid. “Hybrid methods for class imbalance learning employing bagging with sampling techniques,” In 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), pp. 1-5. IEEE, 2017.

Alcalá-Fdez, Jesús, Luciano Sanchez, Salvador Garcia, Maria Jose del Jesus, Sebastian Ventura, Josep Maria Garrell, José Otero et al. “KEEL: a software tool to assess evolutionary algorithms for data mining problems,” Soft Computing 13, no. 3, pp. 307-318, 2009.

Armanfard, Narges, James P. Reilly, and Majid Komeili. “Local feature selection for data classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence 38, no. 6, pp. 1217-1227, 2015.

Bahassine, Said, Abdellah Madani, and Mohamed Kissi. “An improved Chi-sqaure feature selection for Arabic text classification using decision tree,” In 2016 11th International Conference on Intelligent Systems: Theories and Applications (SITA), pp. 1-5. IEEE, 2016.

Cao, Peng, Dazhe Zhao, and Osmar Zaiane. “An optimized costsensitive SVM for imbalanced data learning,” In Pacific-Asia conference on knowledge discovery and data mining, pp. 280-292. Springer, Berlin, Heidelberg, 2013.

Chen, Jingnian, Houkuan Huang, Shengfeng Tian, and Youli Qu. “Feature selection for text classification with Naïve Bayes,” Expert Systems with Applications 36, no. 3, pp. 5432-5435, 2009.

S. Kumar, Ravishankar, and S. Verma. “Context Aware Dynamic Permission Model: A Retrospect of Privacy and Security in Android System,” in 2018 International Conference on Intelligent Circuits and Systems, IEEE Xplore, Phagwara, India, pp. 324-329, 2018.

T. Sabbah, A. Selamat, M.H. Selamat, F.S. Al-Anzi, E.H. Viedma, O. Krejcar, and H. Fujita. “Modified frequency-based term weighting schemes for text classification,” Applied Soft Computing 58, pp. 193–206, 2017.

B. Vijayalakshmi, K. Ramar, NZ Jhanjhi, S. Verma, M. Kaliappan, et.al. “An Attention Based Deep Learning Model For Traffic Flow Prediction Using Spatio Temporal Features Towards Sustainable Smart City,” International Journal of Communication Systems, 34, pp. 1-14 ,2020.

S. Schmidt, S. Schnitzer, and C. Rensing. “Text classification based filters for a domain-specific search engine,” Computers in Industry 78, pp. 70–79, 2016.

Y. Liu, H.T. Loh, and A. Sun. “Imbalanced text classification: A term weighting approach,” Expert System Applications 36, pp. 690–701, 2013.

Ghosh, Samujjwal, and Maunendra Sankar Desarkar. “Class specific TF-IDF boosting for short-text classification: Application to short-texts generated during disasters,” In Companion Proceedings of the The Web Conference 2018, pp. 1629-1637. 2018.

Dal Pozzolo, Andrea, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi. “Credit card fraud detection: a realistic modeling and a novel learning strategy,” IEEE Transactions on Neural Networks and Learning Systems 29, no. 8, pp. 3784-3797, 2017.

Das, Sanjiv Ranjan. “Text and context: Language analytics in finance,” Foundations and Trends® in Finance 8, no. 3, pp. 145-261, 2014.

I. Batra, S. Verma and Kavita, and M. Alazab. “A Lightweight IoT based Security Framework for Inventory Automation Using Wireless Sensor Network,” International Journal of Communication Systems 33, pp.1-16, 2019.

Elagamy, Mazen Nabil, Clare Stanier, and Bernadette Sharp. “Stock market random forest-text mining system mining critical indicators of stock market movements,” In 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), pp. 1-8. IEEE, 2018.

García, Salvador, and Francisco Herrera. “Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy,” Evolutionary Computation 17, no. 3, pp. 275-306, 2009.

Ghanem, Amal S., Svetha Venkatesh, and Geoff West. “Multi-class pattern classification in imbalanced data,” In 2010 20th International Conference on Pattern Recognition, pp. 2881-2884. IEEE, 2010.

Gomez, Juan Carlos, and Marie-Francine Moens. “PCA document reconstruction for email classification,” Computational Statistics & Data Analysis 56, no. 3, pp. 741-751, 2012.

Granitto, Pablo M., Cesare Furlanello, Franco Biasioli, and Flavia Gasperi. “Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products,” Chemometrics and Intelligent Laboratory Systems 83, no. 2, pp. 83-90, 2006.

He, Haibo, and Edwardo A. Garcia. “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering 21, no. 9, pp. 1263-1284, 2009.

Jeatrakul, Piyasak, and Kok Wai Wong. “Enhancing classification performance of multi-class imbalanced data using the OAA-DB algorithm,” In The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1-8. IEEE, 2012.

Jin, Xin, Anbang Xu, Rongfang Bie, and Ping Guo. “Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles,” In International Workshop on Data Mining for Biomedical Applications, pp. 106-115. Springer, Berlin, Heidelberg, 2006.

H. Kaur, H.S. Pannu, and A.K. Malhi. “A systematic review on imbalanced data challenges in machine learning: Applications and solutions,” ACM Computing Surveys (CSUR) 52, no. 4, pp. 1-36, 2019.

L. Khreisat. “Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study,” In Conference on Data Mining (DMIN 2006), pp. 78-82, 2006.

S.B. Kotsiantis. “Decision trees: a recent overview,” Artificial Intelligence Review 39, no. 4, pp. 261-283, 2013.

B. Krawczyk. “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence 5, no. 4, pp. 221-232, 2016.

I. Batra, S. Verma, Kavita, U. Ghosh, J. J. P. C. Rodrigues, et al. “Hybrid Logical Security Framework for Privacy Preservation in the Green Internet of Things,” MDPI-Sustainability 12, no. 14, pp. 5542, 2020.

J. Lee, I. Yu, J. Park, D.W. Kim. “Memetic feature selection for multilabel text categorization using label frequency difference,” Information Sciences 485, pp. 263-280, 2019.

G. Lemaître, F. Nogueira, and C.K. Aridas. “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” The Journal of Machine Learning Research 18, no. 1, pp. 559- 563, 2017.

Jing, Li-Ping, Hou-Kuan Huang, and Hong-Bo Shi. “Improved feature selection approach TFIDF in text mining,” In Proceedings. International Conference on Machine Learning and Cybernetics, vol. 2, pp. 944-946. IEEE, 2002.

G. Liang, C. Zhang. “A comparative study of sampling methods and algorithms for imbalanced time series classification,” In Australasian Joint Conference on Artificial Intelligence, pp. 637-648. Springer, Berlin, Heidelberg, 2012.

M. A. Jan, B. Dong, S. R. U. Jan, Z. Tazzn, S. Verma, et al. “A Comprehensive Survey on Machine Learning-based Big Data Analytics for IoT-enabled Smart Healthcare System,” Mobile Networks and Applications 26, pp. 234- 252, Springer, 2021.

P. Liu, X. Qiu, and H. Xuanjing. “Recurrent neural network for text classification with multi-task learning,” In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2873- 2879, 2016.

X. Liu, Q. Li, and Z. Zhou. “Learning imbalanced multi-class data with optimal dichotomy weights,” In 2013 IEEE 13th International Conference on Data Mining, pp. 478-487. IEEE, 2013.

R.J. Lyon, J.M. Brooke, J.D. Knowles, and B.W. Stappers. “Hellinger distance trees for imbalanced streams,” in 2014 22nd International Conference on Pattern Recognition, pp. 1969-1974. IEEE, 2014.

D. Fatta, Giuseppe, A. Fiannaca, R. Rizzo, A. Urso, M. R. Berthold, and S. Gaglio. “Context-Aware Visual Exploration of Molecular Datab,” In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), pp. 136-141. IEEE, 2006.

A. Makazhanov, and D. Rafiei, “Predicting the political preference of Twitter users,” Social Network Analysis and Mining - ASONAM ’13, pp. 298–305, 2013.

K. Mathew, and B. Issac. “Intelligent spam classification for mobile text message,” In Proceedings of 2011 International Conference on Computer Science and Network Technology, vol. 1, pp. 101-105. IEEE, 2011.

C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, and H. Fujita. “MultiImbalance: An open-source software for multi-class imbalance learning,” Knowledge Based System 174, pp. 137–143, 2019.

A. Mazyad, F. Teytaud, and C. Fonlupt. “A comparative study on term weighting schemes for text classification”, in Lecture Notes in Computer Science, Springer Verlag, pp. 100–108, 2018.

A. Moreo, A. Esuli, and F. Sebastiani. “Distributional random oversampling for imbalanced text classification,” In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 805-808, 2016.

A. Onan, S. Korukoǧlu, and H. Bulut. “Ensemble of keyword extraction methods and classifiers in text classification,” Expert System Applications 57, pp. 232–247, 2016.

N.C. Oza, and S. J. Russell. “Online bagging and boosting,” In International Workshop on Artificial Intelligence and Statistics, pp. 229-236., 2001.

A. Özçift. “Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis,” Computers in Biology and Medicine 41, no. 5, pp. 265-271, 2011.

V.N. Phu, V.T.N. Tran, V.T.N. Chau, N.D. Dat, and K.L.D. Duy. “A decision tree using ID3 algorithm for English semantic analysis,” International Journal of Speech Technology 20, no. 3, pp. 593-613, 2017.

T. Pranckevičius, and V. Marcinkevičius. “Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification,” Baltic Journal of Modern Computing 5, no. 2, pp. 221, 2017.

M. Raza, F.K. Hussain, O.K. Hussain, M. Zhao, and Z. ur Rehman. “A comparative analysis of machine learning models for quality pillar assessment of SaaS services by multi-class text classification of users’ reviews,” Future Generation Computer Systems 101, pp. 341–371, 2017.

F. Khan, A. Shahnazir, N. Ayazsb, S. Khan, S. Verma, and Kavita. “A Resource Efficient hybrid Proxy Mobile IPv6 extension for Next Generation IoT Networks,” IEEE Internet of Things Journal, 2021, 10.1109/JIOT.2021.3058982.

A. P. Singh, A. K. Luhach, S. Agnihotri, N. R. Sahu, D. S. Roy, NZ Jhanjhi, S. Verma, Kavita, and U. Ghosh. “A Novel Patient-Centric Architectural Framework for Blockchain-Enabled Healthcare Applications,” IEEE Transaction on Industrial Informatics 17, no. 8, pp. 5779–5789, 2020, 10.1109/TII.2020.3037889.

R.E. Schapire, Y. Singer, and A. Singhal. “Boosting and Rocchio applied to text filtering,” In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 215- 223. 1998.

R.P. Schumaker, and H. Chen. “Textual analysis of stock market prediction using breaking financial news: The AZFin text system,” ACM Transactions on Information Systems (TOIS) 27, no. 2, pp. 1-19, 2009.

R.A. Stein, P.A. Jaques, and J.F. Valiati. “An analysis of hierarchical text classification using word embeddings,” Information Sciences 471, pp. 216–232, 2019.

S. Tan. “Neighbor-weighted K-nearest neighbor for unbalanced text corpus,” Expert System Applications 28, pp. 667–671, 2005.

H. Tayyar Madabushi, E. Kochkina, and M. Castelle. “Cost-Sensitive BERT for Generalisable Sentence Classification on Imbalanced Data,” arXiv preprint arXiv:2003.11563, pp. 125–134, 2020.

C.F. Tsai, W.C. Lin, Y.H. Hu, and G.T. Yao. “Under-sampling class imbalanced datasets by combining clustering analysis and instance selection,” Information Sciences 477, pp. 47–54, 2019.

A.K. Uysal, and S. Gunal. “A novel probabilistic feature selection method for text classification,” Knowledge-Based Systems 36, pp. 226–235, 2012.

B. Verma, and A. Rahman. “Cluster-oriented ensemble classifier: Impact of multicluster characterization on ensemble classifier learning,” IEEE Transactions on Knowledge and Data Engineering 24, no. 4, pp. 605–618, 2012.

M.K. Verma, D.K. Xaxa, and S. Verma. “DBCS: density based cluster sampling for solving imbalanced classification problem,” In 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), vol. 1, pp. 156-161. IEEE, 2017.

G. Yang, M. A. Jan, A. U. Rehman, M. Babar, and M. M. Aimal. “Interoperability and Data Storage in Internet of Multimedia Things: Investigating Current Trends, Research Challenges and Future Directions,” IEEE Access 8, pp. 124382–124401, 2020.

V. Dogra. “Banking news-events representation and classification with a novel hybrid model using DistilBERT and rule-based features,” Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 10, pp. 3039-3054, 2021.

J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, and Z. Chen. “Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing,” IEEE Transactions on Knowledge and Data Engineering 18, no. 3, pp. 320–332, 2006.

J. Yang, Y. Liu, X. Zhu, Z. Liu, and X. Zhang. “A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization,” Information Processing Management 48, pp. 741–754, 2012.

K. Yang, Z. Yu, S. Member, X. Wen, W. Cao, S. Member, C.L.P. Chen, H. Wong, and J. You. “Hybrid Classifier Ensemble for Imbalanced Data,” IEEE Transactions on Neural Networks and Learning Systems 31, no. 4, pp. 1–14, 2019.

H. Zhang, and M. Li. “RWO-Sampling: A random walk over-sampling approach to imbalanced data classification,” Information Fusion 20, pp. 99–116, 2014.

A. S. Ashour, S. Beagum, N. Dey, A. S. Ashour, D. S. Pistolla, , G. N. Nguyen,... and F. Shi. “Light microscopy image de-noising using optimized LPA-ICI filter,” Neural Computing and Applications 29, no. 12, pp. 1517-1533, 2018.

S. Doss, J. Paranthaman, S. Gopalakrishnan, A. Duraisamy, S. Pal et al. “Memetic optimization with cryptographic encryption for secure medical data transmission in iot-based distributed systems,” Computers, Materials & Continua 66, no.2, pp. 1577–1594, 2021.

D. N. Le. “A new ant algorithm for optimal service selection with endto-end QoS constraints,” Journal of Internet Technology 18, no.5, pp. 1017-1030, 2017.

Z. Zhang, B. Krawczyk, S. Garcìa, A. Rosales-Pérez, and F. Herrera, “Empowering one-vs-one decomposition with ensemble learning for multiclass imbalanced data,” Knowledge Based System 106, pp. 251–263, 2016.

Z. Sabir, K. Nisar, M. A. Z. Raja, M. R. Haque, M. Umar, A. A. A. Ibrahim, and D. N. Le. “IoT technology enabled heuristic model with Morlet wavelet neural network for numerical treatment of heterogeneous mosquito release ecosystem,” IEEE Access 9, pp. 132897-132913, 2021.

T. Zhu, Y. Lin, and Y. Liu. “Synthetic minority oversampling technique for multiclass imbalance problems,” Pattern Recognition 72, pp. 327–340, 2017.