Fake News Detection
Fake News Detection
Fake News Detection
Group Members:
Muhammad Saad ur Rehman
Izhar Mehdi
The study utilizes the Fake News Detection Dataset II. Data Preprocessing
obtained from Kaggle. The dataset is preprocessed by
converting text to lowercase, removing URLs, The data preprocessing section describes the steps taken
non-alphanumeric and special characters, to prepare the dataset for machine learning. These steps
tokenization, removing stopwords, and lemmatization include converting the text to lowercase, removing
using the Natural Language Toolkit (NLTK) library. URLs, non-alphanumeric and special characters, and
These preprocessing steps aim to standardize the text tokenization. Additionally, stopwords (commonly used
data and improve the models' performance. words like "the," "is," etc.) are removed, and the
remaining words are lemmatized using the Natural
The models are trained using a pipeline approach, Language Toolkit (NLTK) library [5]. These
combining text vectorization with TF-IDF and the preprocessing techniques help to standardize the text
respective classifier. The scikit-learn library is data and improve the performance of machine learning
employed for model implementation and evaluation. models.
The dataset is divided into training and testing sets
using the train_test_split function. The models are III. Experiment with Different Classifiers
evaluated based on accuracy, precision, recall, and
F1-score, with a focus on their ability to identify fake To experiment with different classifiers, you can modify
news instances. the existing code by adding new classifiers and
evaluating their performance alongside the ones already
The results reveal that both the Random Forest and implemented. Here's how you can do it based on the
SVM classifiers exhibit promising performance in provided code:
detecting fake news. The Random Forest classifier
demonstrates higher accuracy and precision, Import the necessary libraries for the new classifiers you
indicating its proficiency in correctly classifying fake want to use. For example, if you want to use SVM,
news instances. On the other hand, the SVM classifier Random Forest, and Gradient Boosting, you would need
exhibits better recall, suggesting its capability to to import sklearn.svm.SVC,
identify a higher proportion of actual fake news cases. sklearn.ensemble.RandomForestClassifier, and
sklearn.ensemble.GradientBoostingClassifier,
Detailed classification reports, confusion matrices, and respectively.
precision-recall curves provide insights into the
models' performance across different classes. These Update the train_classifiers function to include the new
evaluation metrics aid in understanding the models' classifiers. Add the new classifiers to the classifiers
strengths and weaknesses, highlighting the trade-off dictionary and define their corresponding
between precision and recall. hyperparameter search spaces in the param_grid
dictionary. For example:
Overall, the study underscores the potential of machine
learning algorithms, particularly Random Forest and from sklearn.svm import SVC
SVM, in combating fake news. The findings contribute from sklearn.ensemble import
to the ongoing efforts to develop robust tools for RandomForestClassifier, GradientBoostingClassifier
fact-checking and empowering news consumers to
make informed decisions. Future work may focus on classifiers = {
refining the models, incorporating additional features, 'Support Vector Machine':
and addressing dataset quality and bias concerns. SVC(random_state=42),
'Random Forest': param_distributions = {
RandomForestClassifier(random_state=42), 'classifier__C': [0.1, 1.0, 10.0],
'Gradient Boosting': 'classifier__kernel': ['linear', 'rbf'],
GradientBoostingClassifier(random_state=42) 'classifier__gamma': [0.1, 0.01, 0.001]
} }
import optuna
IV.Hyperparameter Optimization Techniques
def objective(trial):
The hyperparameter optimization technique used is grid param_grid = {
search, which exhaustively searches through a 'classifier__C':
predefined grid of hyperparameter values to find the best trial.suggest_loguniform('classifier__C', 0.1, 10.0),
combination. However, there are other more advanced 'classifier__kernel':
techniques that can be implemented to efficiently search trial.suggest_categorical('classifier__kernel',
for the best hyperparameters. Here's an explanation of ['linear', 'rbf']),
some of those techniques: 'classifier__gamma':
trial.suggest_loguniform('classifier__gamma', 0.001,
Random Search: Instead of exhaustively searching 0.1)
through a predefined grid, random search randomly }
samples hyperparameter values from a given distribution
for a fixed number of iterations. This approach allows pipeline = Pipeline([
for a broader exploration of the hyperparameter space ('vectorizer', vectorizer),
and can be more efficient when only a few ('classifier', classifier)
hyperparameters have a significant impact on the ])
model's performance.
To implement random search, you can use the pipeline.set_params(**param_grid)
RandomizedSearchCV class from scikit-learn instead of
GridSearchCV. The param_distributions parameter of f1_scores = cross_val_score(pipeline, X_train,
RandomizedSearchCV takes a dictionary of parameter y_train, cv=5, scoring='f1_macro')
distributions instead of a grid. Here's an example: return f1_scores.mean()
The feature engineering technique used is TF-IDF (Term Modify the preprocess_text function to tokenize the text
Frequency-Inverse Document Frequency), which data into individual words or subwords as required by
represents each text document as a numerical vector the chosen word embeddings or transformer models.
based on the frequency of words in the document
relative to their importance in the entire corpus. Replace or augment the TfidfVectorizer in the
However, there are more advanced feature engineering train_classifiers function with the appropriate feature
techniques that can be explored to potentially improve engineering technique. For example, you can use
the model's performance. Here are some examples: Word2Vec or GloVe to transform the text data into word
embedding vectors, or you can use a pre-trained
Pre-trained Word Embeddings (Word2Vec or GloVe): transformer model to extract contextualized
Word embeddings are dense vector representations of representations of the text.
words that capture semantic and contextual information.
Instead of representing words as sparse vectors with Fine-tune the pre-trained models on your specific dataset
TF-IDF, pre-trained word embeddings can be used to if using transformer models. This involves passing your
encode words into dense vectors. These embeddings can text data through the pre-trained model, extracting the
be trained on large corpora and capture more nuanced
contextualized representations, and training a often leads to improved prediction accuracy,
classification model on top of these representations. generalization performance, and model robustness,
making ensemble techniques a powerful approach in
By incorporating advanced feature engineering machine learning.
techniques, you can capture more nuanced semantic
information from the text data, potentially improving the VII. Explainability and Interpretability
model's performance on the classification task.
Step 1: Feature Importance Analysis
VI. Ensemble Learning By employing techniques like permutation importance
or SHAP values, the code can identify the features that
Ensemble learning is utilized to combine multiple have the most significant impact on the classifier's
classifiers to improve the overall performance and decisions. This helps in understanding which factors
robustness of the models. Specifically, two ensemble contribute more to the classification process. However,
techniques are implemented: this specific code does not include feature importance
analysis.
Bagging: Bagging, short for bootstrap aggregating,
involves training multiple classifiers independently on Step 2: LIME (Local Interpretable Model-Agnostic
different subsets of the training data and then combining Explanations)
their predictions through majority voting (for LIME provides explanations for individual predictions
classification) or averaging (for regression). The by approximating the behavior of a black-box classifier
Random Forest algorithm is an example of a bagging using local interpretable models. In the code, LIME is
ensemble method that combines multiple decision trees. not explicitly used for explanation generation.
In the code, the Random Forest classifier is used as one
of the ensemble classifiers. Multiple Random Forest Step 3: SHAP (SHapley Additive exPlanations)
classifiers are trained with different hyperparameter SHAP values provide explanations for individual
configurations using grid search, and the predictions by assigning each feature an importance
best-performing Random Forest classifier is selected value based on its contribution to the prediction.
based on the F1 score. The final ensemble of classifiers However, the code does not use SHAP values for
includes the selected Random Forest classifier. explanation generation.
Precision-Recall Curve:
References