For a Dutch start-up (which prefers to stay anonymous), we developed multiple different web scrapers using the framework Scrapy. The scrapers were deployed as AWS Lambda functions and scheduled with Cloudwatch, keeping the running costs low.
📂 The Case
The start-up approached us and layed out their requirements. They needed different web scrapers that could scrape different kinds of website that they specified and would then save all the data into one global database. This data would then get served on one website, having basically all the content in one place. High availability, automation and low running costs were required.
⚙️ Our approach
For this project we used Scrapy. This is a free and open-source web-crawling framework written in Python. For every website we wrote a different spider and deployed these as AWS Lambda functions and scheduled them with Cloudwatch. The data that got scraped, got then posted to the API of the website that would display all the content fetched from the global database. This API was made using Django Rest Framework. As database we used PostgreSQL and the content got served with Django.