This repository contains a URL classification system using Neural Networks. It aims to classify various URLs into different categories, such as Benign, Malware, Phishing, Spam, and Defacement. The project is based on an original implementation provided in the Medium Post by the authors, Aaditya Jain, Anirudh Bhaskar, Srikanth, and Rohith Ramakrishnan. We have made modifications to the feature extraction and the model used for classification.
Before running this project, make sure you have the following installed:
Clone the repository and create a Python environment with the required packages:
git clone https://github.com/gitzaan/Transfer-Learning-for-Customizable-Web-Filtering
cd Transfer-Learning-for-Customizable-Web-Filtering
conda create -n pyenv python=3.8.5
conda activate pyenv
pip install -r requirements.txt
cd scripts/
python FeatureExtraction.py
The features extracted are explained and visualized in the DataProcessing.ipynb notebook.
The project extracts various features from URLs, which are categorized into different groups. These features include characteristics of the URL string, domain features, and page features. For a detailed list of features, please refer to the original Medium Post
In our modified version of the project, we use a Convolutional Neural Network (CNN) with LSTM layers for classification. The architecture includes convolutional layers, batch normalization, max-pooling, dropout layers, and fully connected layers. The model is trained with an Adam optimizer.
You can train the model using :
cd scripts/
python modelTrain.py
After training the model, we evaluate its performance using metrics like accuracy and generate a classification report. The report provides detailed information on the model's classification performance for different categories.
To make predictions on a new URL, you need to follow these steps:
- Load the LabelEncoder and StandardScaler from the saved files.
- Load the pre-trained model.
- Featurize the URL and prepare the features for prediction.
- Standardize the features and reshape them to match the model's input shape.
- Make predictions and convert numerical predictions to class labels.
- You can make predictions by running the following
cd scripts/
python predict_args.py -i <url>
This project is licensed under the MIT License - see the LICENSE file for details.