The citation services I choose are doing wonders! Maulidah, M., Gata, W., Aulianita, R., & Agustyaningrum, C. I. 13 0 obj Finally, it was also observed that individuals who do not pay their other debt are likely to default on their loans. The algorithm that best solves the problem is the algorithm obtained by comparison with the specific statistical criterions. We therefore transformed the age variable into a categorical variable (one of 3 categories), and the logistic regression analysis and the chi-square analysis for variable selection were repeated. Additionally, people from the looking for a job group are 0.7632 times more at risk of going into default than other working groups. Moreover, by applying the best algorithm (logistic regression) to the dataset, we determined which characteristics increase the default risk most. Comparison of Data Mining Classification Algorithms Determining the Default Risk, Cukurova University, Faculty of Arts and Sciences, Department of Statistics, Adana, Turkey, Classification accuracy: the ability of the model to correctly predict the label of class which is expressed as a percentage, Speed: the speed refers to the time taken to set up the model, Robustness: the ability to predict the model correctly even though the data has noisy observations and missing values, Scalability: the ability of a model to be accurate and productive while handling an increasing amount of data, Interpretability: the level of understanding provided by the model, Rule Structure: the understandability of the algorithms rule structure, Education level (1=illiterate, 2=primary school, 3=secondary school, 4=high school, 5=higher degree), Working status (1=working, 2=looking for a job, 3=retired, 4=other (nonactive)), Region (1=mediterranean, 2=aegean, 3=marmara, 4=black sea, 5=central anatolia, 6=eastern anatolia, 7=southeastern anatolia), Housing status (1=paying rent, 2=not paying rent), Individual revenue (1=low income, 2=medium income, 3=higher income), Nonpayment of house rent, interest-bearing debt repayment, or home loan payment within the last 12 months (1=no, 2=yes), Nonpayment of electricity, water, and gas bills within the last 12months (1=no, 2=yes), Nonpayment of credit card installments and other debt payments within the last 12months (1=no, 2=yes), R. Arora and S. Suman, Comparative analysis of classification algorithms on different datasets using WEKA,, J. Xia, F. Xie, Y. Zhang, and C. Caulfield, Artificial intelligence and data mining: algorithms and applications,, D. Donko and A. Dzelihodzic, Data mining techniques for credit risk assessment task,, H. Selimler, Analysis of problem loans, assessment of the effect on bank financial statements and rates,, A. Wang, L. Yong, W. Zeng, and Y. Wang, The optimal analysis of default probability for a credit risk model,, G. Kou and W. Wu, An analytic hierarchy model for classification algorithms selection in credit risk analysis,, M. D. M. Sousa and R. S. Figueiredo, Credit analysis using data mining: application in the case of a credit union,, I. Yeh and C. Lien, The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,, Y. Yan and B. Suo, Risks analysis of logistics financial business based on evidential bayesian network,. For this reason, different classification algorithms for the given dataset must be compared before problem solving. For this analysis, odds ratios were used as a criterion. An Estimation of Distribution Algorithms Applied to Sequence Pattern Mining, Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification. The dataset structure is shown in Table 1. These results are not only beneficial to the literature, but they could also have a significant influence in financial institutions to predict the customer default risks. Mantik Penusa, 2(1), 3540. /Widths 14 0 R The WEKA data-mining implementation software was developed by the University of New Zealand. /FontWeight 400 Do you need video marketing services? Data mining is a technique that is based on statistical applications. The study is also valuable in terms of illustrating that DM can be used in the determination of credit risk within the framework of the development of academic studies both in Turkey and globally. Ordering and finding the best of K> 2 supervised learning algorithms, TUNING MODEL COMPLEXITY USING CROSS-VALIDATION FOR SUPERVISED LEARNING, Introduction to Machine Learning Second Edition, Introduction to Machine Learning 2e Ethem Alpaydin, Incremental Construction of Cost-Conscious Ensembles Using Multiple Learners and Representations In Machine Learning, Statistical Comparison of Classifiers Using Area Under the ROC Curve, Bayesian comparison of machine learning algorithms on single and multiple datasets, Concentration Tuning Mediated by Spare Receptor Capacity in Olfactory Sensory Neurons: A Theoretical Study, Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms, Complex Response to Periodic Inhibition In Simple and Detailed Neuronal Models, Multivariate Statistical Tests for Comparing Classification Algorithms, Statistical Tests using Hinge/-Sensitive Loss, A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability, Classification of Soft Tissue Tumors by Machine Learning Algorithms, A Methodology for Analyzing the Performance of Genetic Based Machine Learning by Means of Non-Parametric Tests, Learning with Nearest Neighbour Classifiers, Machine learning study of several classifiers trained with texture analysis features to differentiate benign from malignant soft-tissue tumors in T1-MRI images, Incremental construction of classifier and discriminant ensembles, Development of a Workflow for the Comparison of Classification Techniques, Cost-Conscious Comparison of Supervised Learning Algorithms over Multiple Data Sets, Statistical foundations of machine learning: the handbook, A classification framework for lung tissue categorization, On the Classification of a Small Imbalanced Cytogenetic Image Database, Comparative Performance Analysis of State-of-the-Art Classification Algorithms Applied to Lung Tissue Categorization, A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests, Modern Statistical Methods for Astronomy.pdf, D A T A M I N I N G A N D A N A L Y S I S, Introduction to Machine Learning The Wikipedia Guide, Internal combustion engine valve clearance fault classification using multivariate analysis of variance and discriminant analysis, Ranked Tag Recommendation Systems Based on Logistic Regression, Intelligent Data Analysis: An Introduction, Is It Significant? Considering these results and the fact that credit institutions should consider the characteristics of their customers and their circumstances, which will in turn affect their defaults, the risk of default could be reduced by using DM. application/pdfIEEE2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT);2018; ; ; autism spectrum disorderdecision treelinear discriminant analysislogistic regressionsupport vector machinek-nearest neighborComparison of Classification Algorithms of the Autism Spectrum Disorder DiagnosisArmin LawiFirman Aziz This study contributes to existing literature by suggesting classification algorithms that can be used to determine credit risks. Copyright 2019 Begm ar and Deniz nal. Citations are very important for local marketing. In the second part, the best algorithmthe logistic regressionis used to research the attributes that may cause default risk. IOP Conference Series: Materials Science and Engineering, >> The application model uses classification methods with machine learning algorithms such as Nave Bayes, Multi Level Perceptron, AdaBoost Classifier, Decision Tree and Support Vector Machine. Birmingham: Packt Publishing. This model needs to be tested on a larger dataset to produce the best accuracy value. endobj >> All Rights Reserved. 18 0 obj The data are analyzed after the discretization process for the continuous variables, similar to the Bayesian group. This classification algorithm is then used to classify the dataset. There are classification algorithms that give different results for different datasets or different problems.

Financial institutions design models using certain customer characteristics (age, gender, area of residence, income, marital status, previous credit payments, etc.) 830 032006, 1 Despite its simplicity, it is a powerful algorithm for predictive modeling. Classification is the best-known and most used method of DM. >> In other words, the best algorithm does not solve every problem in the best way. No related content is available yet for this article.

Call us. Additionally, those from the Southeastern Anatolia Region are at greater risk of going into default. Enter the email address you signed up with and we'll email you a reset link. The lowest and highest odds ratios are shown in Table 8. After both these analyses were run, the odds ratio values were used to determine the probability with which individuals with certain characteristics may default on paying loans. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. This indicates that banks should extend lower credit amounts to persons of lower income. Furthermore, the results for noneducated individuals indicate that their risk of default is lower than that of other education levels. Comparing the values in F-measure according to the recall criterion, the logistic regression algorithm, along with Naive Bayes and BayesNet, all show a good value of 0.824. These algorithms were compared considering the root mean error squares, receiver operating characteristic area, accuracy, precision, F-measure, and recall statistical criteria. In this study, for determining best algorithms for current dataset, all data mining classification algorithms were compared with respect to the suitability of data and accuracy rates (accuracy threshold was taken as 80%). The data for this study were obtained from the TUIK survey for 2015. These results are not only beneficial to the literature but also have a significant influence in the financial industry in terms of the ability to predict customers default risk. Faktor Exacta, 12(2), 101111. The odds for people that are not renting a house to not go into default risk is 1.1008 times higher than that of people renting a house. There are many DM methods to detect problems faced by bankers and insurers, for example, clustering, classification, and association. The data for this study, which contain the demographic and socioeconomic characteristics of individuals, were obtained from the 2015 TUIK (Turkish Statistical Institution) survey. The confusion matrix is shown in Table 2. /StemV 40 Maulina, D., & Sagara, R. (2018).

Big data is not only a subject of interest for researchers but has also become an essential tool in business. Biemann, F., Rukat, T., Schmidt, P., Naidu, P., Schelter, S., Taptunov, A., Salinas, D. (2019). The model predicts that the odds of not going into default risk are 1.2381 times higher for a retired person than for people of other work groups. /FontName /Times#20New#20Roman These new algorithms will eliminate the imperfections of existing algorithms and introduce new approaches. Each object in the dataset is classified according to its similarities. This means that individuals who do not pay their bills fully and on time are likely to not pay their other creditors either. /FontBBox [-568 -216 2046 693] [CDATA[ Saputra, I., & Rosiyadi, D. (2019). For this reason, DM using WEKA software is implemented to identify risk groups and ensure that financial institutions extend credit to clients not at risk of default. With the current situation its getting harder to outrank your competition. This means that individuals who do not pay their house rent are more likely to not go into default. Eng. x`U;cliI'm y-TZ(PT|{zM#^yGH+*-< Nominal age attribute omitted logistic regression results. endobj In order to implement the BayesNet algorithm, the dataset being studied should not have any missing data and all variables must be discrete.