2024 COMP1702CourseWork

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

COMP1702 Big Data Faculty Header Contribution: 100% of course

ID
Course Leader: Coursework Deadline Date:
Hai Huang 8 April 2024(23:30, UK time)

Feedback and grades are normally made available within 21 calendar days of the coursework
deadline
Learning Outcomes:
1 Explain the concept of Big Data and its importance in a modern economy 2 Explain the core architecture and algorithms
underpinning big data processing 3 Analyse and visualize large data sets using a range of statistical and big data
technologies 4 Critically evaluate, select and employ appropriate tools and technologies for the development of big data
applications

Plagiarism is presenting somebody else's work as your own. It includes copying


information directly from the Web or books without referencing the material;
submitting joint coursework as an individual effort; copying another student's
coursework; stealing coursework from another student and submitting it as your own
work. Suspected plagiarism will be investigated and if found to have occurred will be
dealt with according to the procedures set down by the University. Please see your
student handbook for further details of what is / isn't plagiarism.

All material copied or amended from any source (e.g. internet, books) must be referenced
correctly according to the reference style you are using.

Your work will be submitted for plagiarism checking. Any attempt to bypass our plagiarism
detection systems will be treated as a severe Assessment Offence.

Coursework Submission Requirements

 An electronic copy of your work for this coursework must be fully uploaded on 8
April 2024 using the link on the coursework Moodle page for COMP1702.
 For this coursework you must submit a single report in PDF format. In general,
any text in the document must not be an image (i.e. must not be scanned) and would
normally be generated from other documents (e.g. MS Office using "Save As ..
PDF"). An exception to this is handwritten mathematical notation, but when scanning
do ensure the file size is not excessive.
 There are limits on the file size (see the relevant course Moodle page).
 Make sure that any files you upload are virus-free and not protected by a password or
corrupted otherwise they will be treated as null submissions.
 Your work will not be printed in colour. Please ensure that any pages with colour are
acceptable when printed in Black and White.
 You must NOT submit a paper copy of this coursework.
 All coursework must be submitted as above. Under no circumstances can they be
accepted by academic staff

The University website has details of the current Coursework Regulations, including details
of penalties for late submission, procedures for Extenuating Circumstances, and penalties for
Assessment Offences. See http://www2.gre.ac.uk/current-students/regs

1
Detailed Specification
You are expected to work individually and complete a report that addresses the following
tasks. You need to cite all sources you rely on with in-text style. You can include material
discussed in the lectures or labs, but independent research might be needed. Note: references
should be in Harvard format.

 Task A (25 marks): Hive Data Warehouse Design

Please design a data warehouse in Hive with your own data (a collection with at least 50
records in at least 3 tables). Please implement 10 different queries on that data. Make sure the
data and queries show adequate variety and complexity. Please provide appropriate
explanation/discussion and adequate screenshots to prove your implementation of data,
queries, and results of queries.

 Task B (30 marks): MapReduce Programming

Suppose that there is a computer science bibliography file stored on Hadoop. Each line of this
file contains information of a paper in the following format:

authors|title|conference|years

The different fields are separated by the “|” character, and authors (the first field) are
separated by commas (“,”). You can assume that there are no duplicate records, and each
conference has a different name.

An example line is:

D Zhang, Daniel H, D Cai, J Lu|Self-Taught Hashing for Fast Similarity Search|SIGIR|2010

Please design a MapReduce algorithm (using Pseudo-codes or Java Codes) to the task
assigned. The algorithm is expected to be as efficient as possible.

Task b.1: Output the number of papers by each author for each year.
Task b.2: Output the average number of papers per conference for each year.
Task b.3: Output the number of authors by each conference for each year.
Task b.4: Output the average number of authors per paper for each year.
Task b.5: Output the number of papers by each conference for each year.
2
You need to select one task based on your student ID number: If the last digit of your
student ID number d:

{
0∨1 ,∧please select Task b .1
2∨3 ,∧please select Task b .2
d= 4∨5 ,∧ please select Task b .3
6∨7 ,∧please select Task b .4
8∨9 ,∧ please select Task b .5
other , please select Task b .5

For example, if your student id number is: “011340894”, the last digit is 4. Then you need to
do task b.3.

Please ignore any number after “-” in your student id. For example, if your student id
shows “011340894 – 5”, please ignore 5. The digit 4 is the last digit and you need to select
task b.3.

In solution, firstly you need to highlight your student ID number and the task you need
to solve. Wrong selection will incur 0 mark for the task. If you are not sure which task
to be selected, please contact your tutor.

Please provide Pseudo-codes or Java codes to your task. You should also explain how the
input is mapped into (key, value) pairs by the map stage, i.e., specify what is the key and
what is the associated value in each pair, and, how the key(s) and value(s) are computed.
Then you should explain how the output (key, value) pairs of the map stage are processed by
the reduce stage to get the final answer(s). You need to discuss the efficiency of your
algorithm (How does your design make your algorithm efficient?).

3
 TASK C (45 marks): Big Data Project Analysis

ABC Investment Bank Ltd wants to utilise social media and other data feeds to produce
trading strategies and portfolio rebalancing decisions to give them a competitive advantage.
The Bank needs an IT solution that will inform them whether they should buy, sell or hold a
Financial Instruments. A Financial Instrument are products like Equity shares (e.g. Tesco
share), Government Bonds (IOU issued by sovereign government e.g. UK Gilts), Corporate
Bonds (e.g. IOU issued by a Corporate), Commodities (e.g. Oil, Gold) and these are traded in
different Currencies and Markets globally. Traditionally, trading strategies are devised
manually by Analysts in the Bank's Research team. These Analyst are very expensive to
employ and do not always have a great track record of detecting market trends in a timely
manner. The Bank Business Head feels there is scope to improve how it makes such
important trading decisions as it can have a significant effect on the success or failure of an
institution. Not only will this information be used by the Bank to buy, sell or hold various
financial instruments, but it will also be used in advising the Bank Clients in managing their
portfolios to match their risk appetite. The Bank Clients are Pensions funds, Insurance
companies, Hedge funds, High net individuals. The Financial Products covered are global in
nature e.g. shares from US, like Google, Ford to UK shares e.g. HSBC, Next to Indian shares
e.g. Reliance Industries, to Australian shares e.g. Northern star resources, etc.

The sources of data to produce this Trading decision will be from social media, market data,
Online news feed, Broker notes, corporate data. For example, if a Company has increased
mentions on Twitter its share price and market performance tend to increase. But this
information should not be used purely in isolation and should be used in conjunction with
market data, corporate data, etc. The Bank's internal enterprise systems containing Corporate
Data, Market Data, Broker Notes is mainly built on Relational Databases (Oracle and MS
SQL Server).

Example
 Data would be:
 Streaming: Click Stream, Streaming Financial product prices (from Market data
vendors like Bloomberg)
 Batch: Call Detail Records
 On-Line: Customer Sentiment
 Unstructured: Text, Pictures, Video

The data volume is expected to be in 200's Petabytes scale. The application needs to the
highly available, scalable, and accessible from the worldwide offices (London, New York,
Tokyo, Hong Kong, Sydney) of ABC Investment Bank. The trading decisions produced is
confidential as it gives the Bank a significant competitive advantage.

Task C.1 [marks 15] The Business Head has a limited understanding of IT and is aware of
data warehouse and has suggested you build this only. Explain why you will be building a
Data Lake for this solution. Discuss the approach required for implementing Data Lakes.

4
Task C.2 [marks 15] The business wants near real time performance. When certain financial
products are being discussed on social media, the company could act in a timely manner. It
has been suggested the parallel distributed processing on a cluster should use MapReduce to
process this requirement. Provide a detailed assessment of whether MapReduce is optimal to
meet this requirement. If not, what would be the best approach.

Task C.3. [marks 15] Devise a detailed hosting strategy for this Big Data project and how
this will meet the scalability, high availability requirements for this global business.

 Appendix:

Marking Scheme

Task A (25%)
Task A (25) partially poorly/not
achieved well
Hive Data Warehouse Design achieved achieved
Correctness of design 10
Clarity of description 5
Variety/complexity 10
Task B (30%)
partially poorly/not
Task B (30) MapReduce Programming achieved well
achieved achieved
Algorithm design 15
Algorithm description 15

Task C (45%)
Task C.1 (15) Suggestion of storage achieved well partially poorly/not
tool for big data analysis achieved achieved
Correctness of choice 7
Clarity of explanation 5
Citations/references 3
Task C.2 (15) Analysis of analytical partially poorly/not
achieved well
data store achieved achieved
Correctness of choice 7
Clarity of explanation 5
Citations/references 3
Task C.3 (15) Suggestions of tools for
partially poorly/not
real time or near real time prediction achieved well
achieved achieved
tasks
Correctness of suggestion 7
Clarity of explanation 5
Citations/references 3

You might also like