SMS Spam detection and classification using ML
Thousands probably millions of messages and emails are sent almost every day. How many of them are spam? Is there a way to classify them at the first view?
Price : 5500
Thousands probably millions of messages and emails are sent almost every day. How many of them are spam? Is there a way to classify them at the first view?
Price : 5500
We worked on the SMS Spam collection data set. It consists of 5574 messages and only 747 of them are spam messages. That means that this dataset is unbalanced and it is possible our final model has some bias.
The data set has only two columns, ‘label’ and ‘text’. By creating a new column named ‘text_length’ we can use it in order to visualize the distribution of two categories: ‘ham’ and ‘spam’.The ‘ham’ subset has a maximum value of the length of 910 characters. Below we cite the two plots for distributions with and without this value.
Insights of the distributions and the statistical analysis
These insights could help us answer the first question. We could say that there is a hidden pattern to classify a message as spam or ham from its length. We could not be sure 100% but there is a big chance to classify it as spam and be right.
Natural Language Process
Using NLP first thing we work on was to take a look at the most common and most significant words for each message category. Below we refer to tools we use for the NLP