Course Content

We worked on the SMS Spam collection data set. It consists of 5574 messages and only 747 of them are spam messages. That means that this dataset is unbalanced and it is possible our final model has some bias.

Exploratory Data Analysis

The data set has only two columns, ‘label’ and ‘text’. By creating a new column named ‘text_length’ we can use it in order to visualize the distribution of two categories: ‘ham’ and ‘spam’.The ‘ham’ subset has a maximum value of the length of 910 characters. Below we cite the two plots for distributions with and without this value.

Insights of the distributions and the statistical analysis

These insights could help us answer the first question. We could say that there is a hidden pattern to classify a message as spam or ham from its length. We could not be sure 100% but there is a big chance to classify it as spam and be right.

Natural Language Process

Using NLP first thing we work on was to take a look at the most common and most significant words for each message category. Below we refer to tools we use for the NLP

SMS Spam detection and classification using ML

Course Content

Exploratory Data Analysis

Watch free demo

Information

Customer Service

Extra

My Account

Help & Support

Connect Us