Serverless Web scrape using AWS Lambda and S3 — Python

John Calabrese
4 min readDec 28, 2018

The newest in tech buzzwords: Serverless! Soon to be victim of terminology inappropriately used by every corporate strategist across America — cue blockchain similarities!

But there’s really no denying the merits of serverless. I think its beyond a doubt the future of computing. Whether serverless is coming from centralized AWS, or hopefully on Ethereum (once we get scaling) or tangential projects like Golem, microservices that can be charged per iteration are here to stay. Currently, AWS rules this space with Lambda and its sidecar offerings (see: https://hackernoon.com/the-hitchhikers-guide-to-serverless-ec5efb8075d6)

I wanted to share my quick experience in Lambda with webscraping since there aren’t many simple and practical posts on the topic. Webscraping (which is well covered in many other posts) grabs info from a public site, manipulates the data, and posts its output to be analyzed. Very much self-contained and representative of a true ‘microservice’. And since it’s such a common task, it is a great entry point into serverless.

Let’s start with some background on the project…

I started collecting stats from a site every day around the same time using python and BeautifulSoup. The basic idea is:

  • Get data from a website, store it in a dataframe. In this case, I’m looking for certain words of interest from news headlines.
  • Save it down to csv
  • Run this every day (using Cron) to see interesting things in how the data changes over time

It was super easy to setup. The basic code, which uses BeautifulSoup and Pandas, looks like this:

This was working great locally, but I quickly realized that there was no way that my computer could always be on at the same time everyday. So I setup an AWS EC2 instance to host the code and that worked great for a month or so. Until you start seeing the usage bills for something so simple…

Photo by Ben White on Unsplash

It wasn’t enough to break the bank, but clearly could use improvement — like free. Free is much better.

So I started investigating a solution. The most common solution was AWS Lambda. Lambda is a great tool since you can setup a schedule for the function to run and not worry about starting and stoping the server yourself. It runs using cron too, so its the a similar setup as the local mac.

The problem is that using Lambda is tough. Mainly because there is no persistent local storage like you have in EC2. This means Lambda is intended for data transformation, not data transportation/storage. For that you need to connect to another Amazon storage service — either DynamoDB or S3.

Another reason that Lambda is tough — the documentation is tricky. There are so many tutorials/documents that Amazon gives for this…but all of them get pretty technical, pretty quickly. Take a look here: https://docs.aws.amazon.com/lambda/latest/dg/welcome.html. I just needed to store the generated CSV file into S3 on a schedule. It didn’t need to be that complicated. The handler(event, context) items are interesting and useful but not necessary for something so simple.

So let’s take a look at the storage aspect that I used. Using the boto3 library from Amazon, you can use your access key to place files into a provided bucket. Note: to do this, you’ll need AWS credentials configured. Pretty simple after that. Get the data using the scrape function, add the date to the file name since this will run everyday and I need to identify the file, and put it in S3 using boto:

Seems like it should work fine right? Wrong.

There are a few dependencies on this project already: boto3, pandas, bs4

All of these dependencies need to be packaged along with the function. There are a few ways to do this, but its best to package all of this together using Serverless Framework. I would go into the details about Serverless, but I think the post by Michael Lavers is a great resource and better than anything I could write on the topic.

Once Serverless is setup, we need to now add this new function to our project created from this post with one file containing both web_scrape() and handler(), change the .yaml file to include the [named file].handler

functions:
scraper:
handler: scraper.handler

Also change the requirements.txt file to include our packages:

boto3
pandas
bs4

and rerun:

sls deploy

This will add all of the dependency packages without us doing a thing! It’s great honestly, and saves so much time.

Now that the function is up on Lambda, all we need to do is add a Cron trigger from CloudWatch:

And test it out to see that the file gets added to S3. If the .csv was successfully added to the bucket, then you’re good to go. Now you have a serverless function that will scrape a webpage how ever often you’d like! Since Amazon’s free tier is 1 Million free requests per month, it hasn’t cost me anything to collect this data.

Easy right?

Happy data gathering!

--

--

John Calabrese

Focused on outcomes that build value. Passionate about finance, risk, data and blockchain.