What is data mining aka web scraping?
We're sure you've heard the claim that "data is the new oil".
Web scraping is mining data from the World Wide Web for a specific purpose. In the most simple form, it is copying and pasting a specific set of information to a local database like Microsoft excel or google sheets form for analysis or some other use.
Some of the most widely used examples include aggregator websites which provide price comparisons for online goods.
There are also sites like archive.org which scrape publicly available information and store it even after the original site may be deleted or shadow libraries which make books or articles behind paywalls, publicly available for free.
But web scraping can also be used in fascinating ways, with high social impact.
Most recently, a group of Lithuanian activists created a website that allows Russian speakers from around the world to call people living in Russia with limited access to news about the war in Ukraine.
The idea was to form personal human connections, using one-on-one interactions over the phone, and let people know about the atrocities of war that their government was undertaking in Ukraine.
The website, Call Russia, was made possible by scraping the publicly available phone number data from the web, and repurposing it.
How Does Web Scraping Work?
Web pages include a lot of useful information in text form (built on HTML or XHTML). Usually, a bot called a web crawler, “scrapes” (collects) the data from a site.
Some web pages have built-in mechanisms to prevent web crawlers from scraping data. In response, some web scraping systems have evolved to simulate human browsing using techniques like DOM parsing, computer vision and even natural language processing.
Here is a 5 min video if you are interested to learn more.
A Super Short History of Scraping
The first-ever web crawler was called the Wandex, and it was programmed by an MIT student. This was before the search engine era. The crawler’s main purpose was to measure the size of the internet and it operated from 1993 to 1995.
The First API (Application Programming Interface) crawler came five years later. Today many major websites like Twitter offer web APIs for people to access their public databases.
But why would we want to scrape or mine data in the first place, and why would another party try to prevent us from doing it?
Web scraping applications range from really successful commercial ideas like price comparison tools, to many other issues of social justice and ethical big data.
Web scraping makes us face some important questions.
Should all information be a public good - and equally accessible to all? What about the issue of copyright?
On the commercial side, building a price comparison tool might lead to some businesses losing customers to the competition. Sometimes major corporations like airlines sue scrapers and data miners for copyright infringement on these grounds.
Even though scrapers are technically collecting and displaying data that is already publicly available, the suits tend to argue for copyright infringement. There is no standard outcome for these kinds of lawsuits. It usually depends on a number of factors like the extent of the information collected or the incurred losses.
Is Web Scraping Legal or Not?
The legality of web scraping is still not fully fleshed out. Terms of use on a specific site might “ban” it, but that is not exactly enforced by law in all cases. In order for the mining of the data to be unlawful, it would have to go against an already existing law.
In America, that might be most commonly on grounds of copyright infringement. Other examples include Denmark, where the courts found web scraping or crawling to be legal according to Danish Law.
In France, the French Data Protection Authority ruled that even when publicly available, personal data still cannot be collected and/or repurposed without the knowledge of the person to whom it belongs.
Freedom of Information
When it comes to nonprofit organizations and open access advocates, things get even more interesting.
The Internet Archive (archive.org) is a famous web scraping project. It is a non-profit organization that archives (sometimes deleted) web pages, digital collections, books, pdfs, and videos for researchers, students and anyone else who takes an interest.
They sometimes get caught in legal grey areas every now and then, when individuals or even governments take legal action to remove some specific pieces of content.
When Advocating for Universal Open Access to Information Gets You in Trouble
There are many web scraping projects which advocate for universal open access to information like the PACER project.
PACER is the name of the website that houses legal documents from US courts. It stands for Public Access to Court Electronic Records but the access is not free except for a select number of public libraries.
The Late Aaron Swartz, open-access advocate and early internet prodigy, used a web scraping program to download millions of PACER documents from one of these public libraries and got in a lot of trouble with the US government and the FBI.
Corporations and governments might be incentivized to outlaw web scraping. However, it is an important tool journalists and researchers use to uncover injustices.
A List of Journalistic Investigations That Used Web Scraping
Collecting and analysing data can be incredibly helpful for all types of research and academic study, leading to a new movement in data science. Journalists also now rely on careful data analysis to reveal new things about our societies and communities.
Reveal did a project that revealed the American cops who were members of extremist groups on Facebook posting and engaging with racist, xenophobic, and Islamophobic content.
It was done by scraping data from these extremist groups and groups of police officers on Facebook and cross-referencing to find the overlapping members - and there were many.
Reuters used similar data analysis techniques to uncover a shocking story about sites where Americans “advertise” the children whom they adopted from abroad for purpose of giving them away to strangers when they do not want to deal with them anymore.
Using scrapers, the Verge and the Trace did an investigation revealing online gun sales without a license or background check.
USA Today found out that between 2010 and 2018 more than 10,000 bills introduced in statehouses nationwide were almost entirely copied from bills written by special interests. This investigation was made possible also using web scraping.
The Atlantic runs a COVID tracking project which not only collects the global data on covid on a daily basis but it is also showing the racial disparities of the pandemic.
These are just some of the examples of the ways web scraping can be used for both commercial and social justice purposes. There are many other use cases out there and many more waiting to be realized.
Extensive data analysis and open data science can unlock so many new truths but are we crossing the line with the kind of data we collect, and the methods we use to collect it?
What are the ethics and school of thought around data collection?
How do we balance privacy with open access?
While it is important that we continue the conversation about open access to documents that are relevant to the public, we have to consider privacy issues as well.
Today many people and organizations agree that collecting and using someone’s personal data without their consent is unethical.
However, what about public data such as news articles that are censored in some countries? Or health-related statistics and data that can be used for public health policy suggestions?
In the US, policymakers used an algorithm to identify high-risk patients for a preventative program to provide additional care so that these patients don't end up in the ER.
Later researchers found that black people were sicker than white people, though within the same category. In other words, black patients incur fewer costs than white patients with the same illnesses for a variety of reasons including lack of access to high-quality insurance.
In another instance, automated hiring tools used by companies like Amazon were found to be favouring men over women and white people over people of colour.
This was due to the fact that the data analysis tools were reinforced by sexist and racist algorithms. When the tools searched the web, they determined that executive positions were filled mostly by white men, so the machines learned that this was the type of quality to look for in a candidate.
Scraping public data for the public good does not always lead to positive results for society. Automation and machine learning needs thoughtful intervention. As builders of new technological and social systems, we need to ensure all of our data analysis tools are ethically designed and don't continue our historical systems of injustice and discrimination.
Scraping is highly relevant to the work we do at Mysterium. We care about building an accessible web where freedom of information and open data science become foundational pillars of the new web.
We are collaborating with developers to build the the open web. To learn more about the way Mysterium empowers builders for purpose-driven projects, check out our site at NectoLabs.io
Comments