Web scraping, also known as Data Scraping, is rapidly evolving as a quick and efficient way to extract information from web pages and databases. We have brought you all the information about Web scraping to keep you updated and protected.
Recently a massive amount of content was harvested from Linkedin and Facebook by unknown actors. The scrapers later put this data to sale on the Dark Web.
Web Scraping is a process of mining large amounts of information from websites in an automated way using computer programs. Most of the data collected using this is HTML data, processed into structured data and stored in a database to be used for different purposes.
Web Scraping has several different options, including online services, particular APIs, and custom-developed code. Several tools are also readily available on Github, including Scrapy and Autoscraper.
The process of data scraping also included a substantial amount of computing resources. Legitimate operators maintain dedicated scraping server farms for this purpose.
Threat actors alternately prefer to use Zombies for Web Scraping. Zombies are compromised systems that are part of the Botnets.
We have earlier done a detailed analysis of Bot networks. If you would like to know more about Botnet, please refer to this blog post.
Difference between Web Scraping and Screen Scraping
Screen scraping is automated and programmatic use of an application or website by impersonating a web browser to simulate user actions. Programmers widely use Screen scraping to access user’s banking data if an open API is not readily available.
Web scraping is an entirely different technique used for several use cases. I have covered some of these use cases in the next section.
What is the use of Web Scraping?
An individual or organization can use web scraping for genuine or malicious purposes. Following are few examples of uses of Web Scraping, including legitimate and illicit usages.
1. Indexing of Web content
Search engines like Google and Bing use Bots for Web Scraping. The collected data is categorized, indexed, and stored for analysis. Google calls its bots Googlebot Desktop and Googlebot Smartphone. These bots identify themselves in the HTTP header.
Legitimate bots also abide by the robots exclusion standard, aka robots.txt
2. Data collection for marketing research
Many organizations use Web Scraping to gather information related to the target markets and customers for market research. They can then use it to determine the feasibility of new businesses, identify trends, etc.
3. Scraping of contact information
Contact information including email id, contact number, social media profiles including Linkedin, Facebook, Twitter, Instagram, etc.
The scraper can use this information for marketing purposes.
4. Monitoring of news and updates
Web Scraping is used for News monitoring primarily to identify the trends and track news regarding specific people or organizations.
5. Scraping of pricing information
This use case may fall under the illicit use of Web Scraping. It is a popular method to keep track of competitive pricing. The information collected can be used to undercut pricing and boost sales. Resellers for popular products in the industries which are price sensitive are at high risk of this method.
For example, a travel agency can keep track of flight booking charges by competition and then undercut the prices. The scraped site, in turn, becomes the victim and loses sales.
6. Content Theft
Content theft is another example of illicit use. Cybercriminals and content thieves can abuse stolen content in multiple ways.
It’s common to create a replica of a legitimate website and dupe end customers by selling it in online marketplaces. Also, online businesses offering directory listings can be scraped and sold online or on the Dark Web.
7. Social web scraping
Social media networks like Linkedin, Facebook, etc., are at a higher risk of scraping information. Cybercriminals can use the scrapped information to identify high-profile targets and create sophisticated Spear Phishing attacks like BEC, Whaling, or Clone Phishing.
If you would like to learn more about Spear Phishing attacks, please refer to our earlier blog post.
How does Web Scraping work?
The scraper configures the web scraper tool with the target website URL. More advanced bots can also accept search queries to execute in leading search engines like google and automatically choose result URLs to extract the information.
The scraper copies the HTML code of these web pages and harvests the intended data like user information, pricing, or customer reviews.
The scraper then converts the data in a readable format, indexes it, and stores it in a database or spreadsheet for further analysis.
Web scrapers often require bypassing website restrictions, like the number of sessions. For this, scrapers support proxies since many websites block scrapers to prevent slowing down the site.
The proxy acts as an intermediary between your device and the target web server and provides IP addresses of multiple locations to overcome the limitations of geo-blocked websites.
Is Web Scraping Legal?
Web Scraping is legal and is a standard practice used by search engines like Google, Yahoo, or Bing. However, it boils down to what you do with the data that you acquire.
What are the common types of Web Scrapers?
We can divide Web Scrapers into the following types:
1. Web Scraping softwares
Several commercial and free Web scraping tools are available in the market, and we have covered some of them in the next section. They work like any other software. Just install it, configure it, and it’s ready to go. However, they are primarily suitable only for smaller Web Scraping requirements.
2. Browser extensions
These are probably the easiest to configure and run and best suitable for scraping particular websites or web pages for smaller projects.
3. Custom coded scrapers
Suppose you have advanced programming knowledge (specifically Python). In that case, you can create a custom solution by yourself with some help from the open-source repositories available on Github. There may, however, be some limitations and require significant efforts in maintaining.
4. Cloud-based Web Scraping services
There are professional services available for large-scale projects for Web Scraping and data analysis. Since Web Scraping is a voracious process, it makes a lot of sense for individuals and organizations to leverage these services for large projects.
5. Web Scraping bots
Cybercriminals and threat actors also use Bots to make large-scale Web Scraping commercially viable for them. Not to mention that Cybercriminals use this data mainly for illegal activities.
Commonly used tools for Web Scraping?
There are a ton of tools available for Web Scraping, including open source tools. Some of the popular tools are as following:
1. Octoparse
Octoparse is easy to use even for non-programmers and prevalent for eCommerce data scraping. Scrapers use it for large-scale content scraping. It can store the information in structured files like CSV or JSON for download. The software offers free and paid plans.
2. Scrapestack
Scrapestack is a web scraping REST API that works in real-time. The solution is trendy and backed by Apilayer. The scrape stack API allows scraping web pages within milliseconds by handling millions of proxy IPs, Captchas, and Browsers.
3. Import.io
Import.io helps you form your datasets by importing the data from specific web pages and exporting the data in a structured format. It also allows you to Integrate data into applications using webhooks and APIs.
4. Scrapingbee
This tool is a web scraping API that provides headless browsers and proxy management. It can rotate proxies for each request and execute Javascript on the pages, which will help you get raw HTML pages without getting blocked. They also offer a dedicated API for Google search scraping.
5. Scrapy
Scrapy is a Web Scraping library that python developers use for building scalable web crawlers. It is a complete framework for web crawling that handles all the problematic functionalities like proxy middleware, querying requests, and many others.
The above mentioned is not an exclusive list, and several different tools are available for Web scraping in the market.
Best Practices For Web Scraping
If you are planning for ethical Web Scraping for a legitimate purpose, it’s a good idea to follow the best practices to avoid getting blocked by the destination website:
1. Always respect robot exclusion standards (Robot.txt)
Robot.txt defines specific rules for good behavior like how frequently you can scrape, which pages allow scraping, and which pages you should not scrape. It is usually the root directory of any website – http://website.com/robot.txt.
2. Slow the Crawler down
A Bot can crawl the website pretty fast, but fast means rough. You should slow the Bot down and treat the websites nicely. You can do that by introducing a delay of 10-20 seconds between clicks.
3. Scrap in non-peak hours
It will be best to scrap the website off-hours. This is also a moral responsibility and will help reduce user impacts, if any, and significantly improve the speed of scraping.
4. Use a Headless Browser
A Headless browser does not have a GUI that makes it considerably faster than regular browsers. Also, it needs not to load the website fully; it can just load the HTML portion, saving time and resources. Some examples of headless browsers include Puppeteer, Playwright, and Selenium.
5. Be careful of the honey pot traps
Some websites have hidden pages that a human will never click, but a Bot clicking on every link will. Programmers design these honeypots for Web Scrapers, and once identified, the webmaster can permanently block you from the site.
6. Do not violate copyright
Always consider if the data you are planning to scrap is copyrighted. Common types of material that may be copyrighted include articles, pictures, videos, databases, etc. Also, you should be well aware that much of the data on the internet is copyrighted work.
Different countries have different policies and exceptions to copyright law. Always make sure that an exception applies within the jurisdiction in which you’re operating.
7. Do not violate GDPR
Respect regional compliances and be extremely careful. Do not scrap any personal data that can identify an individual. Personal data may include Name, Email, Address, Phone Number, etc.
Please be noted that this is not legal advice, and you should discuss it with a legal representative to avoid getting into any trouble with law enforcement agencies.
How to Block Web Scraping?
While you will not be able to find a full-proof way of completely blocking the scraping, by following the steps below, you will be able to stop it significantly:
1. Monitor Traffic logs and Patterns
There are several indicators of Web scraping. If you monitor the incoming traffic logs and patterns, you can limit or block the access.
2. Detect unusual activities
Detect unusual activities and limit the number of actions in a specific time, e.g., only allow a limited number of searches per second from one User/IP address.
Also, use Captcha if the actions are performed faster than a human can do. The Captcha will slow down and make scrapers ineffective.
3. Enforce registration and login
If possible, you should enforce registration and login, which will act as a good deterrent for Data Scrapers but, unfortunately, also for real users. You can use Passwordless Authentication to reduce the impact on real users. To know more about passwordless authentication, please refer to our earlier post.
4. Block access or enforce Captcha
It will help if you block or enforce Captcha for any requests from scraping services IP addresses and Cloud hosting providers.
5. Reveal as little as possible
Do not give much information if you block a suspicious scraping attempt. Having limited information will leave Scrapers clueless on how to fix their scraper
6. Limit access to your dataset
You should limit access to your dataset by enforcing additional restrictions.
7. Insert honey pots
You can insert honeypots for bad bots, where you can create many elements in the page source, but once CSS is involved (in the web browser), only the correct element is visible to the user.
8. Deploy Anti-Bot Protection
It would help if you considered employing Bot protection capabilities with behavioral analysis to identify bad bots and prevent web scraping. Several advanced services are offered by leading providers, including Imperva, Radware, Cloudflare, Akamai.
A more comprehensive resource on how you can protect against web scrapers is available on Github as A guide to preventing Web Scraping.
At Securityfocal, we continuously strive to improve our content and make it helpful for our consumers. Do let us know if you find this article useful, and feel free to share any feedback.
4 comments
Pingback: How to secure REST API: Best practices and tips | SecurityFocal
Pingback: How to Secure Password: Best practices for 2021 | SecurityFocal
Pingback: What is Spear Phishing: Types, techniques, and how to? | SecurityFocal
Pingback: All about Surface, Deep and Dark Web | SecurityFocal