Web scraping consists of using technological means -including software- such as stabler.tech tools to industrially collect public data present on digital media on the Internet.
This can be all types of media, such as blogs or websites, for example.
The objective of the collection thus carried out by web scraping is to restore them in an organised manner in a database.
The regularity of these operations allows users of these web scraping tools to have a real "treasure trove of data" that they can then reuse in a variety of applications: building a pricing and demand prediction strategy, studying the competitive landscape of their company, feeding artificial intelligence models, enriching internal company data, etc. Moreover, web scraping has relevant applications in numerous domains, such as scientific research, real estate, finance, e-commerce or blockchain, to name a few.
Is the practice of extracting large volumes of data legal, what are the precautions or good practices that must be implemented by the software provider and its client to respect the legal framework and thus guarantee the legality of the practices envisaged?
First of all, and it is crucial to note this, the activity of web scraping is absolutely not illicit or prohibited in itself, nor are the tools that allow data to be extracted from the Internet.
The detailed study of the legal context allows us to highlight four essential points of attention in order to respect the rights of third parties and the regulatory requirements to ensure the legality of the operation, points of attention that stabler.tech has integrated into the development of its technologies. It is not our intention here to elaborate on the details of the legal provisions that are intended to apply, only to highlight the principles that constitute the very essence of our practices so that they can be, by design, respectful of the legal environment to which they are subject.
Respecting the sui generis right of the database producer
Some data on the web, when constituted as databases, may receive, in addition to copyright protection, more specific protection by the "sui generis" right of database producers. This protection is acquired when the database producer can demonstrate significant investment in the creation of these databases. The producer with this protection can thus prohibit the extraction of all or part of these contents.
Empirically, the case law teaches that protection is difficult to acquire so that most websites whose data is scraped can hardly claim protection by the sui generis right of the database producer. However, it is recommended that our clients make a measured use of extractions (quantity, frequency and targeting), as would a reasonable Internet user.
Respecting the General Conditions of Use of the digital media concerned
A decision of the Court of Justice of the European Union of 15 January 2015 (C-30/14, Ryanair Ltd v. PR Aviation BV), states that when a digital medium does not benefit from the specific protection provided by the sui generis right of databases, nor from the protection conferred by a copyright, it is possible to envisage by contractual provisions the conditions of use of the data by third parties: this is the principle of the General Conditions of Use - or GCU.
The GCU therefore set the contractual rules to which the parties, the users of the site, are subject. Consent, as long as the terms are clear and the information given, can be deduced from the simple navigation on the site.
stabler.tech always recommends its clients to check the content of the GCU of the digital media they wish to extract, and to comply with the limits stipulated by the texts. In addition, this check should be carried out each time the bots are re-launched to ensure that the GCU has not changed.
stabler.tech only provides the data extraction tools and does not have access to the configurations and data extracted by its customers. Our clients are therefore responsible, according to our terms and conditions, for the proper respect of the GCU of the digital media targeted by their web scraping operations.
Complying with the laws on access and retention in a Automated Data Processing System
When our clients use the tools provided by stabler.tech, our technology allows them to access digital media and, via computer queries, reproduce the human behaviour of an Internet user in an automated manner.
Depending on the case, these requests can be very targeted, or conversely, more massive and systematic and exceed the typical browsing speed of an Internet user.
Criminal law punishes the infringement of automated data processing systems. Therefore, the question may arise as to whether the practice of web scraping can constitute such an offence characterised by the attempt to access or maintain a ADPS (Automated Data Processing System) - i.e. the digital medium?
The activity of web scraping does not necessarily and systematically constitute fraudulent access to an ADPS. It is up to our clients who use bots to use them in compliance with the terms and conditions of the digital medium and to reproduce, despite the use of the bot, the behaviour of a classic Internet user.
Respecting the principles of the GDPR when processing data
Data protection law
The GDPR, and at a national level the Data Protection Act, impose a number of compliance requirements to ensure the lawfulness of personal data processing. Web scraping is undoubtedly a data processing activity and must therefore, when personal data is concerned, respect these requirements in order to comply with the regulations on the protection of personal data. As a French and European company, we are subject to these compliance requirements and we have therefore, at stabler.tech, integrated, by design, these requirements in order to be fully respectful of personal data and of the rights of the persons concerned.
Regulation, a source of new use cases for web scraping
When extracting data from digital media using web scraping tools such as those proposed by stabler.tech, it is therefore necessary to respect a set of good practices in order to comply with the French and European legal framework.
Far from being a constraint, this legal framework allows to protect the rights of database creators, users and their personal data.
Moreover, our solutions can be used to ensure that users' rights are fully respected. The application of standards can thus generate new use cases relevant to web scraping activities. In a future article, we will come back to the European "Omnibus" directive, which came into force in France on 28 May 2022, regulating the notion of "crossed out prices" when displaying promotions.
Thus, a retailer or a website must now ensure that the crossed-out prices displayed on their media refer to a real price actually charged in the last 30 days.
When thousands - or even millions - of products are offered by these retailers, it is immediately clear that it will be crucial in the short term to offer automated devices for extracting these "prices charged in the last 30 days", thus using web scraping.
stabler.tech is at your disposal to discuss this with you, please contact us!
Cover image generated with MidJourney