Final PPT - Phishing Website

PHISHING WEBSITE DETECTION USING MACHINE
LEARNING ALGORITHMS
Abstract
• Phishing attack is a simplest way to obtain sensitive information from innocent users. Aim of the
phishers is to acquire critical information like username, password and bank account details. Cyber
security persons are now looking for trustworthy and steady detection techniques for phishing
websites detection. This paper deals with machine learning technology for detection of phishing
URLs by extracting and analyzing various features of legitimate and phishing URLs. Decision
Tree, random forest and Support vector machine algorithms are used to detect phishing websites.
Aim of the paper is to detect phishing URLs as well as narrow down to best machine learning
algorithm by comparing accuracy rate, false positive and false negative rate of each algorithm.
Existing System
• Nowadays Phishing becomes a main area of concern for security researchers because it is not difficult to
create the fake website which looks so close to legitimate website. Experts can identify fake websites but not
all the users can identify the fake website and such users become the victim of phishing attack. it was
estimated that the annual worldwide impact of phishing could be as high as $5 billion. Phishing attacks are
becoming successful because lack of user awareness. Since phishing attack exploits the weaknesses found in
users, it is very difficult to mitigate them but it is very important to enhance phishing detection techniques.
• The general method to detect phishing websites by updating blacklisted URLs, Internet Protocol (IP) to the
antivirus database which is also known as “blacklist" method. To evade blacklists attackers uses creative
techniques to fool users by modifying the URL to appear legitimate via obfuscation and many other simple
techniques including: fast-flux, in which proxies are automatically generated to host the web-page;
algorithmic generation of new URLs; etc. Major drawback of this method is that, it cannot detect zero-hour
phishing attack.
Existing Algorithm
• SVM is a supervised learning algorithm that can be used for classification or regression tasks. It
finds the hyperplane that best separates classes in an n-dimensional space. SVM has been employed
for phishing website detection due to its effectiveness in handling high-dimensional data.
• KNN is a simple, instance-based learning algorithm that stores instances of the training data and
classifies new data points based on their proximity to existing data points. It is often used in
phishing website detection due to its simplicity and effectiveness, especially when there is not a
large amount of training data.
• GBM is an ensemble learning technique that builds a strong model by combining multiple weak
models, typically decision trees, in a serial manner. It sequentially trains new models to correct
errors made by existing models. GBM has been shown to be effective in detecting phishing
websites.
Drawback in Existing System
• Experts can identify fake websites but not all the users can identify the fake website and such users
become the victim of phishing attack.
• In real scenario it is not possible to separate complex and nonlinear data
Proposed System
• Data Collection: Gather a dataset containing samples of both legitimate and phishing websites.
This dataset should include features extracted from the website content, URL characteristics,
metadata, and possibly historical data on known phishing attacks.
• Data Preprocessing: Clean the dataset by handling missing values, encoding categorical variables,
and scaling numerical features if necessary. This step may also involve feature selection or
dimensionality reduction techniques to improve model performance and reduce computation time.
• Feature Extraction: Extract relevant features from the collected data. Features may include URL
length, domain age, presence of HTTPS, presence of suspicious keywords, HTML and JavaScript
analysis, presence of redirection, WHOIS information, etc.
• Model Training: Train machine learning models on the preprocessed dataset. Experiment with
various algorithms such as Random Forest, Support Vector Machines, Logistic Regression, K-
Nearest Neighbors, Gradient Boosting Machines, Neural Networks, Naive Bayes, and Decision
Trees to find the best-performing models.
Advantages
• Scalability: ML algorithms can scale to handle large datasets, making them suitable for continuously
monitoring and analyzing a vast number of websites in real-time.
• Adaptability: Machine learning models can adapt to evolving phishing tactics and techniques by
continuously learning from new data. This adaptability is crucial in combating emerging threats in the
dynamic landscape of cyber attacks.
• Accuracy: With appropriate training data and feature engineering, machine learning models can achieve
high accuracy in distinguishing between legitimate and phishing websites, reducing false positives and false
negatives.
• Efficiency: ML algorithms can efficiently process and classify websites, enabling quick identification of
potential phishing threats without significant computational overhead.
• Feature Extraction: ML algorithms can automatically extract relevant features from website content, URL
structures, and metadata, capturing subtle patterns and characteristics indicative of phishing activities.
System Specification
Hardware Software
• Processor : i3 Processor • Operating system : windows 10
• Hard Disk : 500 GB • Coding language : JAVA
• RAM : 2GB • IDE : Eclipse
• • Database : Mysql
Software Specification
Data Flow Diagram
Remote User
Service Provider
Use case Diagram
Class Diagram
Sequence Diagram
List of Modules
Service Provider
• In this module, the Service Provider has to login by using valid user name and password. After login successful he can do some
operations such as Browse Website URLs and Train & Test Data Sets,View Trained and Tested Accuracy in Bar Chart,View
Trained and Tested Accuracy Results,View Prediction Of Website URL Type,View Website URL Type Ratio,Download Trained
Data Sets,View Website URL Type Ratio Results, View All Remote Users.
View and Authorize Users
• In this module, the admin can view the list of users who all registered. In this, the admin can view the user’s details such as, user
name, email, address and admin authorizes the users.
Remote User
• In this module, there are n numbers of users are present. User should register before doing any operations. Once user registers,
their details will be stored to the database. After registration successful, he has to login by using authorized user name and
password. Once Login is successful user will do some operations like PREDICT WEBSITE URL TYPE,VIEW YOUR PROFILE.
Modules Description
1. Data Collection:
1. Gather a dataset of URLs that includes features indicating whether a website is legitimate or phishing.
2. Features can include URL length, domain age, presence of HTTPS, use of subdomains, etc.
3. Label each URL as 0 (legitimate) or 1 (phishing) based on ground truth information.
2. Data Preprocessing:
1. Clean the data by removing any irrelevant features or outliers.
2. Normalize numerical features to ensure they have similar scales.
3. Encode categorical features into numerical values using techniques like one-hot encoding.
3. Split the Data:

1. Divide the dataset into training and testing sets (e.g., 70% for training and 30% for testing).
2. Optionally, you can also use techniques like k-fold cross-validation for better evaluation.
4. Feature Selection:
1. Use feature selection techniques such as Recursive Feature Elimination (RFE) or feature importance from SVM to identify the most relevant
features.
5. Model Training:
1. Initialize an SVM classifier, choosing an appropriate kernel function (e.g., linear, polynomial, radial basis function).
2. Train the SVM model using the training data, fitting the classifier to learn the decision boundary between phishing and legitimate URLs.
6. Model Evaluation:
1. Evaluate the trained SVM model using the testing dataset.
2. Calculate metrics such as accuracy, precision, recall, F1 score, and ROC-AUC to assess the model's performance.
Algorithm Used
Detecting phishing websites using machine learning algorithms typically involves using supervised learning techniques. Some
common machine learning algorithms and methods used for phishing website detection include:
1. Logistic Regression: This algorithm is commonly used for binary classification tasks like phishing detection, where the output is
either phishing or legitimate.
2. Decision Trees: Decision trees can be effective for phishing detection as they can capture complex relationships between features.
Ensemble methods like Random Forests and Gradient Boosting Machines (GBM) can also be used for improved performance.
3. Support Vector Machines (SVM): SVMs are good for binary classification tasks and can be trained to distinguish between
phishing and legitimate websites based on various features.
4. K-Nearest Neighbors (KNN): KNN is a simple and intuitive algorithm that can be used for phishing detection by considering the
similarity between features of websites.
5. Naive Bayes: Naive Bayes classifiers assume independence between features and are particularly useful when dealing with text-
based features extracted from the website content.

Algorithm Implementation
• SVM is a supervised learning algorithm that can be effective for binary classification
tasks like identifying phishing websites. Below is a step-by-step process for using SVM
for phishing website detection
• Data Collection
• Data Pre-processing
• Split the Data
• Feature Selection
• Model Training
• Model Evaluation
• Model Deployment
Screen Shots
Conclusion
• Phishing website detection using machine learning is a crucial area of research aimed at
combating cyber threats and protecting users from falling victim to fraudulent activities
online. Through this process, various machine learning algorithms are employed to
analyze features of websites and classify them as either legitimate or phishing.
• Machine learning algorithms have demonstrated high accuracy and efficiency in
detecting phishing websites. Techniques such as supervised learning, unsupervised
learning, and deep learning have been successfully applied to classify websites based on
features like URL structure, domain age, SSL certificate validity, and content analysis.
• machine learning has proven to be a powerful tool in the fight against phishing by
enabling accurate and efficient detection of malicious websites. Continued research and
innovation in this field are essential to stay ahead of cyber threats and protect users'
online safety
Future work
• Real-time detection: Develop efficient algorithms and models capable of real-time
phishing website detection, suitable for deployment in web browsers, email clients, or
network security systems.
• Scalability: Design scalable solutions that can handle large volumes of web traffic and
rapidly adapt to new phishing tactics and patterns.
References
1. Alazab, M., & Broadhurst, R. (2016). Phishing Detection: A Literature Survey. In D. Zeng, P. K. Singh, M.
Alazab, & S. Yan (Eds.), Cyber Security Intelligence and Analytics (pp. 187-210). Springer International
Publishing. DOI: 10.1007/978-3-319-33808-8_8
2. Singh, A., Bansal, M., & Kumar, V. (2019). Machine Learning Based Phishing Website Detection Using
URL and HTML Features. In Proceedings of the 2019 4th International Conference on Internet of Things:
Smart Innovation and Usages (IoT-SIU) (pp. 1-6). IEEE. DOI: 10.1109/IoT-SIU.2019.8778155
3. Mishra, M., Bhattacharyya, S., & Ruj, S. (2018). A Machine Learning Approach for Detecting Phishing
Websites Using Unsupervised Learning Techniques. In Proceedings of the 2018 9th International
Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp. 1-6). IEEE. DOI:
10.1109/ICCCNT.2018.8494121
4. Adhikari, M., & Chatterjee, S. (2020). Phishing Website Detection Using Machine Learning: A
Comprehensive Survey. Journal of Intelligent & Fuzzy Systems, 39(3), 4787-4804. DOI: 10.3233/JIFS-
191968
5. Bhatia, S., & Kaur, H. (2021). Detection of Phishing Websites Using Machine Learning Techniques: A
Review. In V. Sharma, H. Kaur, & P. Singh (Eds.), Intelligent Communication and Computational
Technologies (pp. 1-9). Springer Singapore. DOI: 10.1007/978-981-15-6147-7_1
THANK YOU

Final PPT - Phishing Website

Uploaded by

Copyright:

Available Formats

Final PPT - Phishing Website

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final PPT - Phishing Website

Uploaded by

Copyright:

Available Formats

PHISHING WEBSITE DETECTION USING MACHINE

• Processor : i3 Processor • Operating system : windows 10

• Hard Disk : 500 GB • Coding language : JAVA

• RAM : 2GB • IDE : Eclipse

View and Authorize Users

3. Split the Data:

either phishing or legitimate.

phishing and legitimate websites based on various features.

similarity between features of websites.

based features extracted from the website content.

You might also like