Recently I’ve been thinking a lot about the gray area that exists with web scraping. Web scraping involves programmatically downloading a webpage and parsing the page’s content in order to extract information of interest.
For example, let’s say you are in the market to buy some new headphones and want to do some research. In order to compare the features of different headphones, you might go to Amazon.com and run a search for headphones. Let’s pretend you aren’t interested in filtering out the results by reviews or price — you are interested in purely comparing the features of all headphones so you can make a decision as to which pair you want to purchase.
One option you have is to go through Amazon’s search results for headphones, open up each resulting link, and copy and paste the name, description, and price of each headphone into a spreadsheet. Since Amazon organizes the name, description, and price information in a standard way on their pages, you get pretty fast at copying this information from your browser to the spreadsheet, maybe 10 seconds per page. Eventually you copy enough information from Amazon, do your comparison, and purchase the headphones that best suit your needs.
Nothing wrong with performing those navigate, copy, and paste actions for personal use right? I mean, sure it’s an inefficient way of doing research to determine which pair of headphones to buy, but there’s nothing illegal about it.
But manually scraping all of this information takes a lot of time. However, you and your four coworkers are all interested in buying some headphones so you decide that you can split up the manual scraping work. Assuming all of your coworkers are just as fast as you at copying and pasting information from Amazon into a spreadsheet, you can accomplish the same amount of scraping in 20% of the time. Personally I still think this is clearly on the side of legal.
What happens if you have 1000 friends though? Let’s say you are active on an internet forum for audiophiles and all of the forums users are interested in comparing and finding the best pair of headphones on Amazon. If you split up the work among 1000 friends, you’ll get the scraping task done 1000 times faster than if you were doing it yourself — leaving you with more time to listen and enjoy the headphones instead of doing research. Is this legal to do? I think this is where we start to enter a gray area. After all, you are still performing the same action — navigate, copy, and paste — but now you are coordinating it en masse so it might disrupt the website’s performance.
Okay, let’s say you don’t have a thousand friends and you don’t want to manually scrape all of those headphone pages on Amazon. You are a computer programmer however and realize that you can write a scraping script that does the same thing — downloads the name, description, and price information of each headphone in the search results on Amazon and saves it into a spreadsheet. You write your program so it basically mimics your on screen movement so it still takes 10 seconds to scrape each page, but at least now you can have your program run 24 hours day. You don’t get your final spreadsheet of results any faster since your requests are still taking 10 seconds each, but is this legal? Assuming the site’s terms of service don’t specify anything about scrapers, and you’ve dutifully checked the robots.txt file to see if scraping is allowed, I think you’re ok. Basically you’re doing the same thing as the first scenario, but you are now just automating the task.
A computer is a powerful machine though, and it can run more than one program at a time. In fact, you could probably run 1000 instances of your program with little difficulty, especially if you divide up the work across multiple computers (everyone has multiple personal computers at home, right?). Is this ok to do?
What if you find a better way to write your program so that it just takes 1 second to scrape each headphone page — after all computers should be able to do this type of work faster than humans. Is that ok? What if you take that 1 second/page scraping program and run a thousand instances of it? Is that ok?
What if you write a program that scrapes a page in .1 second, and you run thousands of instances of it, and you accidently degrade service to Amazon’s other customers with your scraping program. Essentially what you’ve accomplished to do with your scraping program is a denial of service (DoS) attack. If you were running multiple instances of your app on all of your personal computers (or even better, on a cloud computing platform where you can get hundreds or thousands of virtual machines to run your code for pennies) and you manage to take down Amazon, you essentially performed a distributed denial of service attack (DDos). In these instances, what you are doing is clearly illegal — you have wrote a script that, although its primary goal is to scrape headphone information, has accidently taken down Amazon. Time to get a lawyer because you are probably going to be sued.
So obviously that last example is extreme and most everyone would agree is illegal for good reasons — you are negatively affecting Amazon’s ability to do business with other customers. What about all of the other scenarios though? Where is the line drawn if you want to collect all of this information legally? Does scraping, either manual or programmed, only become a problem when you start degrading the website’s service to other users? Or is there some other way to identify what is and isn’t an acceptable way to scrape?
I don’t know the answers to these questions. Vendors of screen scraping services try to anonymize their scraping attempts as if they are doing something bad — and obviously they are doing something bad if they are taking down servers with their high volume of requests. What makes writing a program to replace what you can legally do via a manual copy and paste process wrong?
It feels like web scraping, along with other technologies that don’t have clear legal precedents defined yet, includes a lot of gray area that programmers have to consider and operate in.