Scrapy
2 minutes of reading
Scrapy is an open source framework written in Python for processing data from websites. It is a tool designed for web scraping, which is the automatic retrieval of data from websites.
Latest job offers
No job offerts found
Often when programming we use available APIs that provide us with the data we need for our application. For example, building an app that will show us the current weather, we need to get this data from somewhere, and most often we use the available APIs on the market, but what if we can't find the API we are interested in? That's when it's worth considering, page scraping. In this article I will just introduce a tool that will help us scrape pages.
What is page scraping?
Page scraping is nothing more than extracting some content from a page and saving this data for use in your application, for example. Page scraping is used by sites such as ceneo, google, or portals that collect job listings from other portals. Keep in mind that what we do later with such data can sometimes be illegal.
What is Scrapy?
Scrapy is a Python language framework and it is the most popular and powerful tool for scraping websites. Scrapy provides all the necessary tools you need to efficiently extract data from pages, process it and store it in your preferred structure and format. Scrapy is easy to use, has support for asynchronous requests, and automatically adjusts indexing speed with an "Auto-throttling" mechanism.
Scrapy Spider
The most important part in Scrapy are the Spider classes. Scrapy uses them to collect information from the website. They define how our Spider should extract data from the page.
An example of a Spider class that extracts quotes from a page.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
We write such code to the file "quotes_spider.py" and start our scraping bot with the command:
scrapy runspider quotes_spider.py -o quotes.jl
When our bot finishes its work we should get a file "quotes.jl", which will contain a list of quotes saved in json format.
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...
Our offer
Web development
Find out moreMobile development
Find out moreE-commerce
Find out moreUX/UI Design
Find out moreOutsourcing
Find out moreRelated articles
How to handle a website security breach: immediate actions and recovery
30 Apr 2024
A website security breach can be a devastating event for any business, large or small. It ushers in not only immediate damages, but also long term trust issues. This article explores about prompt response to such crisis, and strategic steps to recover and fortify digital assets against future threats.
Responsive vs Adaptive Design: A comparative analysis on the optimal approach
25 Apr 2024
In the world of web design, there are two main methodologies popularly adopted by developers: responsive and adaptive design. Each offering unique capabilities, they cater to users' diverse needs. In this article, we perform a detailed comparative analysis between these two design structures, extensively discussing their pros, cons, and determining the optimal approach depending on certain variables.
Shedding Light on Dark Mode: Key Considerations and Advantages when Adapting Websites
9 Apr 2024
The digital world is increasingly embracing dark mode, with many popular websites now offering this viewing alternative. User-centric and energy-efficient, dark mode has won the heart of night owls and aesthetics-lovers alike. In this article, we illuminate the keys to adapting websites for dark mode, and explore its compelling advantages.
Native vs. Cross-Platform Development: Which Approach is the Best?
8 Apr 2024
In the ever-evolving realm of application development, coming up with the consummate approach is quite ascendant. Developers face several dilemmas, one of them being whether to opt for Native or Cross-Platform Development. This article will conduct a comparative analysis, exploring the pros and cons, to compute the optimal approach.
The Best CMS Platforms to Build and Manage A Blog
26 Mar 2024
In the world of blogging, content management systems (CMS) are critical. The right CMS can help amplify your voice, reach a broader audience, and ease website maintenance. In this article, we'll take you through the superior CMS platforms you need to build and manage a compelling blog.
An Overview of E-commerce Platforms: Which is the Best Fit for Your Business?
23 Mar 2024
E-commerce is a transformative sphere of business, especially in the digital age. This surges the importance of choosing the right e-commerce platform for your business. It's a critical decision that decides the success in the online market. This article aims to throw light on different e-commerce platforms, their strengths, weaknesses helping businesses to choose the most suitable one.
Intelligent Assistance: The Future of Human-Computer Interaction
21 Mar 2024
In the age of rapid digital transformation, Intelligent Assistance is charting a new course for human-computer interaction. From voice commands, predictive analytics to personalized recommendations, it's an emerging paradigm that's transforming our interaction with digital devices, making technology more intuitive and user-friendly. Through this exploration, we'll dive deep into its evolution and potential.
Show all articles