Web Scraping Fact Sheet
What web data aggregation is, and what it doesWhat is “web scraping”?
In essence, web data aggregation (“web scraping”) is simply going to a web page, extracting data from it, and sorting that data to make it understandable. When done at scale, the results can be powerful. Whenever you search for something on Google, what you see is the result of web scraping. So you see that there are some very compelling positive use cases.
How does it work?
If you’ve ever copied and pasted something from a website, you’ve operated as a web scraper—but of course we’re talking about scale. Modern web data aggregation is done with bots and database software to gather large amounts of data, parse that data, and present it in a way that’s understandable to humans. You can find open-source data aggregation tools out there; as well as proprietary tools and companies that specialize in web scraping.
Wait, What’s a Bot?
A bot is an autonomous program that performs (often repetitive) tasks on the Internet much faster than a human could. There are good bots and bad bots, based upon how they’re configured and how they’re deployed. There are a lot of bots out there—over 40% of Internet traffic is made up of bot activity.
Examples of good bot behavior: Web crawlers, customer service chatbots
Examples of bad bot behavior: Spambots, DDoS attacks, click fraud automation
What can be done with aggregated web data?
- Search engine functionality
- Gauging customer sentiment
- E-commerce competitor analysis
- Hotel and flight price comparison
- Market research
- Academic research
- Real estate listings
- Weather data monitoring
- Search engine optimization (SEO)
- Website change detection
- Online reputation management
- Data visualization
Is web scraping legal?
Yes—in fact, the modern Internet wouldn’t work very well without it! Different jurisdictions have laws around which types of data can be aggregated, how it can be aggregated, and how that data can be used. For example, The European Union’s General Data Protection Regulation (GDPR) is a widely known example of this, but there are many more—and new laws are being written as we speak.
Web Scraping Does Not Equal Hacking
Web scraping involves extracting data from a website and storing it in a structured format. Hacking involves unauthorized access or manipulation of a computer or network. If web scraping is the equivalent of street photography, hacking would be like setting up a camera in someone’s house.
Why does web data aggregation often have a bad reputation?
Web data aggregation can be done maliciously. For example, web scraping is a core component to DDoS attacks. Also, the collection of personal data at scale often runs afoul of both local laws and the bounds of accepted business behavior. However, this bad reputation comes from a few bad actors and most web scrapers are conducting ethical web data aggregation for one of the reasons listed above.
What is the good that comes out of it?
When collected and sifted in certain ways, a dizzying multitude of anecdotes becomes actionable data. As companies and individuals, we can accomplish much more when we’re able to make sense of the vast amounts of disparate bits of information out there. However, like any technique, web scraping can be misused. That’s why we’re forming this group: to prevent bad actors from sullying an industry that can accomplish a whole lot of good.
The Public Deserves Digital Peace of Mind
The Ethical Web Data Collection Initiative (EWDCI) is an international, industry-led consortium of web data collectors focused on strengthening public trust, promoting ethical guidelines, and helping businesses make informed data aggregation choices.
The EWDCI is dedicated to defining positive and beneficial uses of the important abilities and potential of web data collection and aggregation at scale.
What EWDCI Does:
- Advocate for responsible web data collection and use of personal data
- Educate and guide the industry on the use of ethical resources and tools used in web data collection
- Foster consumer confidence in data collection through transparency and accountability
- Enable commercial innovation
- Promote online safety
Our goal is to prevent harmful legislation from passage worldwide, but also, potentially to seek inclusion in federal laws. In addition, we are building a framework to establish an open, participatory process around the development of legal and ethical web scraper provider principles.