Scraper to scrap websites & mobile apps (with no web interface)
Challenges
Common technical obstacles which a scraper is supposed to be capable of handling are:
Captcha
Captcha presents a formidable obstacle for scrapers, and unfortunately, nearly all websites now employ captcha protection against suspicious requests, making it challenging to bypass.
HTTP fingerprinting
HTTP fingerprinting, in addition to detecting IP addresses, allows server-based solutions to identify requests from clients using HTTP fingerprinting. This means that even if a new IP is used, requests from blacklisted client machines/devices can still be blocked.
DOM manipulation
Most websites employ DOM manipulation and dynamic content, making it difficult for generic scraper scripts to access data. Additionally, scraping APIs directly is hindered unless the dynamically generated security token is included in the request header.
DDoS attack protection
Websites are often equipped with DDoS attack protection, which can also thwart scrapers.
Changes in HTML structure
Frequent changes in website HTML structures are common and can disrupt scrapers, necessitating immediate script updates.
App-only platforms
The rise of “App-only” platforms poses a new challenge for scrapers as companies increasingly shift towards this concept, leaving scrapers wondering what to target.
Our Approach
The aforementioned points highlight common challenges frequently encountered during web/app scraping. Below, we outline a comprehensive strategy for overcoming these challenges:
IP blocking
To circumvent this issue, we can route requests through a diverse array of thousands of IP addresses. Rather than sending requests at a fixed frequency, we can introduce randomness in the request origins.
Captcha
Notably, not all websites incorporate captcha security measures due to their potential disruption of user experience and interaction. To address this selectively, we can follow this approach:
- Identify Captcha Implementation: Determine which websites have deployed captcha protection and discern the specific action or event that triggers the captcha. If a recognizable pattern or cause can be isolated, adapt your requests or code to work around it.
- Intricate Captchas: In cases where no discernible pattern can be identified or when identified patterns cannot be circumvented, the solution may involve breaking the captcha. It’s worth noting that some captcha solutions may contain bugs or exploits, allowing for evasion. If evasion is not possible and the captcha lacks known exploits, consider implementing automated OCR-level captcha recognition. If even OCR-level recognition proves ineffective, manual human interaction (human farms) may become necessary.
App-Only Solution
- To gather data from such applications, we can pinpoint the pertinent APIs and devise strategies for their extraction.
- Typically, these APIs are fortified with header tokens generated on the user’s end by the app itself, making direct API scraping ineffective. In such scenarios, employing techniques akin to those employed in app developers’ automation testing can prove to be a fruitful approach.
Detecting changes in web page DOM, Navigation or API structure
Frequent alterations on websites are a common occurrence and can potentially disrupt scraping operations, necessitating prompt script updates. However, adopting a template-based approach offers a solution: it enables rapid detection of website/API modifications and facilitates swift script adjustments.
HTTP Fingerprinting
This challenge can be surmounted through the utilization of cloud computing technologies. By creating a cloud ephemeral instance based on preconfigured system images, you can swiftly replace any blocked instances with new ones, ensuring continuous operation.
DOM Manipulation
To extract data from these websites, we can develop an application that mimics user behaviour effectively. This script can even log in as a user, navigate, and search within a controlled browser environment. With complete access to the browser’s content, scraping becomes a straightforward process.
DDoS Attack Protection
Combining all the solutions mentioned above can provide a comprehensive approach to overcoming these challenges for the scraper.
Scrapers
It will have two different components
Templates
To streamline data scraping and swiftly adapt to webpage changes, we propose the implementation of a rule-based parsing engine.
- Rule Storage: All rules will be centrally stored in a dedicated file known as a template.
- Flexible Data Extraction: Templates define the data to be scraped and provide instructions on how to locate it within the target webpage.
- Adaptability: If the webpage’s structure evolves, modifications need only be made to the pertinent template, leaving the rest of the program (Worker) unaffected.
- Efficiency for New Data: When extracting new data from the same page, this approach significantly expedites the process.
- Enhanced Extensibility: This method allows for the effortless addition of new rules for different websites, enhancing script extensibility.
- Relevance Across Platforms: Applicable not only to web pages but also to mobile APIs, with each webpage having its distinct template.
Worker
The worker component plays a pivotal role in our solution, encompassing two vital segments:
- Reader: This constitutes the most critical aspect of our approach, as it necessitates a thorough analysis of each website’s unique security measures and page/content loading mechanisms. This evaluation guides the development of an effective reader module. Some websites load content alongside the Document Object Model (DOM), while others employ dynamic content loading through DOM manipulation.
- Extractor: The extractor component utilizes the predefined rules and patterns outlined in the templates to pinpoint and extract the essential data from the HTML/JSON content gathered by the reader. The extracted data is then stored in a cloud-based NoSQL database for future use.