Project Description

1
MM5413 Mar 12, 2024
Overview
In this project, you will apply the business forecasting models and techniques you learned in class to solve
real-world problems. You will be provided a dataset related to S&P500 stock data and its associated news
information. You are required to submit a written report for grade assessment, together with your code
for reference. You must submit the project (zipped file containing the written report and code) no later
than the deadline. Please email me ([email protected]) if you have any questions about the project.
Project Descriptions and Datasets
Deciphering Financial News for Stock Return Prediction

Business Problem: Using business forecasting models to predict the stock return.
Data: Stock data (“SP500Panel2019.csv”) are collected from the Thomson Reuters platform covering the
period from January 2019 to December 2019. Text-related variables are constructed using NLP techniques
and deep learning models beforehand.
Descriptions of the variables are shown below:

1. Stock: Stock indicator
2. Date: Date indicator
3. Year: Year indicator
4. Count: Daily count of financial news articles mentioning the stock
5. Volume: Daily abnormal trading volume of stock, measured based on the average approach (Clarke
et al. 2021, Kogan et al. 2020)
6. Return (Target Variable): Daily abnormal return of the stock, measured based on the Fama and
French three-factor model (Chen et al. 2021, Kogan et al. 2020, Luo et al. 2013)
7. CumPrevRet: Cumulative daily abnormal return of stock over the period [-30, -1] before the
publication date of the article
8. Fake: Daily average level of credibility/trustworthiness of financial news, measured based on the
deep learning model
9. WC: Daily average of word count per article in financial news
10. WPS: Daily average of words per sentence in financial news
11. Sixltr: Daily average of the percentage of words having six or more letters in financial news
12. Positive: Daily average of positive sentiment score of financial news, measured based on the
Loughran and McDonald (LM) lexicon
13. Negative: Daily average of negative sentiment score of financial news, measured based on the LM
lexicon
14. Uncertainty: Daily average of uncertainty score of financial news, measured based on the LM
lexicon
2
News-Induced Market Signaling
15. Litigious: Daily average of litigious score of financial news, measured based on the LM lexicon
16. StrongModal: Daily average of strong modal score of financial news, measured based on the LM
lexicon
17. WeakModal: Daily average of weak modal score of financial news, measured based on the LM
lexicon
18. Constraining: Daily average of constraining score of financial news, measured based on the LM
lexicon
Evaluation Criteria
In general, you should record your step-by-step progress. The possible efforts include data cleaning, missing
data handling, outlier detection, feature engineering/selection, learning algorithm selection, parameter
tuning, etc. Clarity and organization of your written report are important when evaluating your project.
Please justify why you believe the problem addressed in your project is important and describe the
techniques you used to tackle the problem and the rationale behind your approaches.
To be specific, the assessment scheme is based on three parts:
Basic Part (Max: 100 points):
Please refer to the last section, “Report Structure,” for more details on the score breakdown.
Deduction Part (Max: 30 points):

Deduct 0.5 points for each typo and spelling error.
Deduct 2 points for each extra page going beyond the 10-page limit.
Deduct 10 points for each day going beyond the submission deadline.
Project Status Meeting

If you have any problem or want to discuss the writing, idea, methodology, and assessment criterion, please
email me ([email protected]) to schedule a meeting.
Final submission
In the final submission, you must submit a final written report and your Python code for plagiarism check.
The report should follow the specific structure, as indicated in the next section. The report must be in 12-
point Times New Roman font, single- or double-spaced, and left-justified. Figures, tables, and exhibits can
be smaller than 12-point Times New Roman font sizes. Text in figures, tables, and exhibits may be single-
spaced. The report MUST be no more than ten pages (in Word or PDF format), including references and
appendices. Please note that the report length will NOT be used as a criterion for grading. The written
3
report MUST be self-contained, in which you should include all necessary details and information you find
important in the report only rather than in the code or other files.
Report Structure
1. Introduction (20 points)
Describe the business problem you are going to tackle. You may want to put your business problem in a
larger context and motivate the importance of the issue addressed in your project.
• State your business problem.

• Justify why it is important. You can provide a real-life example based on a media report or your
own experience for the justification.
• Describe briefly how you will address the business problem.
• Summarize the results.
• Provide a section list to describe the report’s structure.
2. Data Understanding & Preprocessing (20 points)
Describe the dataset. You may consider the following aspects: the number of data records; the number of
features and a brief description of their meanings, attribute type, range, mean, skewness; outliers; class
imbalance.
It would help if you also considered data preprocessing, such as missing values and feature normalization.
• Report the summary statistics.

• Visualize the variable(s).
• Discuss the findings.
• Discuss whether there is any need to address the missing value.
3. Model Building (30 points)
You must implement ONE deep learning model and ONE conventional time series model as the benchmark.
For each model built, indicate the parameter values and describe the conclusions you can draw from it. You
may dedicate a specific subsection to each model used.
Some additional efforts you can consider to improve model performance: e.g., feature normalization,
feature discretization, feature selection, and parameter tuning. Provide a logical explanation of why you
made such an effort.
• Describe the benchmark (time series model) briefly and justify your choice.
• Describe the model architectures and parameters.
• Describe the deep learning model briefly and justify your choice.
• Describe the model architectures and parameters.
4
• Hypothesize TWO improvements that can be applied to the deep learning model with reasonings
and propose the corresponding changes.
• Implement the proposed changes.
4. Performance Evaluation (20 points)
Indicate the performance measures (e.g., accuracy, ROC, 𝑅 2) you have chosen to evaluate the performance
of the models built. You may want to summarize the performance of the built models using the chosen
performance measures in a table. In this way, it is easy to compare the performance of different models.
You also need to provide some reasoning on why and how models differ from each other in terms of
performance.
• State your evaluation metric(s).

• Compare the deep learning model and the benchmark.
• Test your hypothesis by comparing the improved deep learning model against other models.
5. Conclusion, Discussion, and Limitations (10 points)
Summarize the problem to be addressed and how the conclusions drawn from the built models help you to
tackle the problem. List any potential problems as future work.
• Highlight the importance of your business problem.

• Summarize the findings.
• Discuss the implications of your findings.
• State the limitation(s).
• State the future direction(s).
6. References (If any)
7. Appendices (If any)

Project Description

Uploaded by

Copyright:

Available Formats

Project Description

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Description

Uploaded by

Copyright:

Available Formats

1

MM5413 Mar 12, 2024

Project Descriptions and Datasets

Deciphering Financial News for Stock Return Prediction

Descriptions of the variables are shown below:

To be specific, the assessment scheme is based on three parts:

Basic Part (Max: 100 points):

Deduction Part (Max: 30 points):

Project Status Meeting

• State your business problem.

2. Data Understanding & Preprocessing (20 points)

• Report the summary statistics.

3. Model Building (30 points)

4. Performance Evaluation (20 points)

• State your evaluation metric(s).

5. Conclusion, Discussion, and Limitations (10 points)

• Highlight the importance of your business problem.

6. References (If any)

7. Appendices (If any)

You might also like