Using natural language processing for programming language code classification with Multinomial Naive Bayes

Artificial Intelligence

Jun 01, 2023

DOI: https://doi.org/10.18280/ria.370515

Published in: Revue d'Intelligence Artificielle

Publisher: Revue d'Intelligence Artificielle

Ayman Odeh Munther Odeh Hussein Odeh Nada Odeh

Abstract

Classifying Programming Languages scripts is very important task for several reasons such as: automated analysis, code maintenance, code search, quality assurance, and code understanding; this process is similar to processing natural languages, especially high-level languages like Python, Java, C#, C, C++, PHP, JavaScript, and others. Leveraging natural language processing concepts, this research explores the application of the Multinomial Naïve Bayes (MNB) algorithm to identify and classify programming languages used in source code files. MNB is a relatively simple and fast algorithm for text classification. The study utilizes a dataset comprising 12 programming languages and consists of 12,003 samples, totaling 396,090 lines of code. The MNB algorithm is trained on this diverse dataset, and its performance in classifying programming language source code is evaluated. The results of the study demonstrate an impressive accuracy rate of 95.09% in accurately identifying and classifying programming languages. This high accuracy highlights the effectiveness of the applied NLP techniques, specifically the MNB algorithm, in the classification task. The findings of this research have significant implications for multi-programming language editors such as Visual Studio Code and Notepad+ or any programming editor. With the automatic recognition of programming languages enabled by this approach, users can conveniently paste source code into these editors, and the system will automatically identify and classify the programming language being used. This functionality enhances the user experience and streamlines the coding process, particularly in multi-language development environments