Guide To Web Scraping



  1. What is web scraping? Web scraping, also called web data collection/extraction, data/screen scraping, web/data harvesting, and sometimes called web crawling, is the process of extracting data from websites. The process of scraping a page involves making requests to the page and extracting machine-readable information from it.
  2. After the 2016 election I became much more interested in media bias and the manipulation of individuals through advertising. This series will be a walkthrough of a web scraping project that monitors political news from both left and right wing media outlets and performs an analysis on the rhetoric being used, the ads being displayed, and the sentiment of certain topics.
  3. Web-scraping solution: I keep track of the prices of mangoes, papaya and dragonfruit so I can find out when, and – if you look at multiple websites – where, it’s the cheapest. If you personally don’t have a problem, think about a problem others might be having that you could solve (maybe your friend Jim is bad with the ladies and needs.
  4. Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed. If you’ve ever copied and pasted content from a website into an Excel spreadsheet, this is essentially what web scraping is, but on a very small scale.

Web scraping is the automated process of scraping the data from the web in a format of your choice. Why web scraping has become so critical is because of a set of factors. Firstly, the data that you access on the Internet is not available for download. However, you need it downloaded and in a different format.

As the digital economy expands, the role of web scraping becomes ever more important. Read on to learn what web scraping is, how it works, and why it’s so important for data analytics.

The amount of data in our lives is growing exponentially. With this surge, data analytics has become a hugely important part of the way organizations are run. And while data has many sources, its biggest repository is on the web. As the fields of big data analytics, artificial intelligence and machine learning grow, companies need data analysts who can scrape the web in increasingly sophisticated ways.

This beginner’s guide offers a total introduction to web scraping, what it is, how it’s used, and what the process involves. We’ll cover:

Before we get into the details, though, let’s start with the simple stuff…

1. What is web scraping?

Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed. If you’ve ever copied and pasted content from a website into an Excel spreadsheet, this is essentially what web scraping is, but on a very small scale.

However, when people refer to ‘web scrapers,’ they’re usually talking about software applications. Web scraping applications (or ‘bots’) are programmed to visit websites, grab the relevant pages and extract useful information. By automating this process, these bots can extract huge amounts of data in a very short time. This has obvious benefits in the digital age, when big data—which is constantly updating and changing—plays such a prominent role. You can learn more about the nature of big data in this post.

Guide To Web Scraping

What kinds of data can you scrape from the web?

If there’s data on a website, then in theory, it’s scrapable! Common data types organizations collect include images, videos, text, product information, customer sentiments and reviews (on sites like Twitter, Yell, or Tripadvisor), and pricing from comparison websites. There are some legal rules about what types of information you can scrape, but we’ll cover these later on.

2. What is web scraping used for?

Web scraping has countless applications, especially within the field of data analytics. Market research companies use scrapers to pull data from social media or online forums for things like customer sentiment analysis. Others scrape data from product sites like Amazon or eBay to support competitor analysis.

Meanwhile, Google regularly uses web scraping to analyze, rank, and index their content. Web scraping also allows them to extract information from third-party websites before redirecting it to their own (for instance, they scrape e-commerce sites to populate Google Shopping).

Many companies also carry out contact scraping, which is when they scrape the web for contact information to be used for marketing purposes. If you’ve ever granted a company access to your contacts in exchange for using their services, then you’ve given them permission to do just this.

There are few restrictions on how web scraping can be used. It’s essentially down to how creative you are and what your end goal is. From real estate listings, to weather data, to carrying out SEO audits, the list is pretty much endless!

However, it should be noted that web scraping also has a dark underbelly. Bad players often scrape data like bank details or other personal information to conduct fraud, scams, intellectual property theft, and extortion. It’s good to be aware of these dangers before starting your own web scraping journey. Make sure you keep abreast of the legal rules around web scraping. We’ll cover these a bit more in section six.

3. How does a web scraper function?

So, we now know what web scraping is, and why different organizations use it. But how does a web scraper work? While the exact method differs depending on the software or tools you’re using, all web scraping bots follow three basic principles:

  • Step 1: Making an HTTP request to a server
  • Step 2: Extracting and parsing (or breaking down) the website’s code
  • Step 3: Saving the relevant data locally

Now let’s take a look at each of these in a little more detail.

Step 1: Making an HTTP request to a server

As an individual, when you visit a website via your browser, you send what’s called an HTTP request. This is basically the digital equivalent of knocking on the door, asking to come in. Once your request is approved, you can then access that site and all the information on it. Just like a person, a web scraper needs permission to access a site. Therefore, the first thing a web scraper does is send an HTTP request to the site they’re targeting.

Step 2: Extracting and parsing the website’s code

Once a website gives a scraper access, the bot can read and extract the site’s HTML or XML code. This code determines the website’s content structure. The scraper will then parse the code (which basically means breaking it down into its constituent parts) so that it can identify and extract elements or objects that have been predefined by whoever set the bot loose! These might include specific text, ratings, classes, tags, IDs, or other information.

Step 3: Saving the relevant data locally

Once the HTML or XML has been accessed, scraped, and parsed, the web scraper will then store the relevant data locally. As mentioned, the data extracted is predefined by you (having told the bot what you want it to collect). Data is usually stored as structured data, often in an Excel file, such as a .csv or .xls format.

With these steps complete, you’re ready to start using the data for your intended purposes. Easy, eh? And it’s true…these three steps do make data scraping seem easy. In reality, though, the process isn’t carried out just once, but countless times. This comes with its own swathe of problems that need solving. For instance, badly coded scrapers may send too many HTTP requests, which can crash a site. Every website also has different rules for what bots can and can’t do. Executing web scraping code is just one part of a more involved process. Let’s look at that now.

4. How to scrape the web (step-by-step)

OK, so we understand what a web scraping bot does. But there’s more to it than simply executing code and hoping for the best! In this section, we’ll cover all the steps you need to follow. The exact method for carrying out these steps depends on the tools you’re using, so we’ll focus on the (non-technical) basics.

Step one: Find the URLs you want to scrape

It might sound obvious, but the first thing you need to do is to figure out which website(s) you want to scrape. If you’re investigating customer book reviews, for instance, you might want to scrape relevant data from sites like Amazon, Goodreads, or LibraryThing.

Step two: Inspect the page

Before coding your web scraper, you need to identify what it has to scrape. Right-clicking anywhere on the frontend of a website gives you the option to ‘inspect element’ or ‘view page source.’ This reveals the site’s backend code, which is what the scraper will read.

Step three: Identify the data you want to extract

If you’re looking at book reviews on Amazon, you’ll need to identify where these are located in the backend code. Most browsers automatically highlight selected frontend content with its corresponding code on the backend. Your aim is to identify the unique tags that enclose (or ‘nest’) the relevant content (e.g. <div> tags).

Guide To Web Scraping

Step four: Write the necessary code

Once you’ve found the appropriate nest tags, you’ll need to incorporate these into your preferred scraping software. This basically tells the bot where to look and what to extract. It’s commonly done using Python libraries, which do much of the heavy lifting. You need to specify exactly what data types you want the scraper to parse and store. For instance, if you’re looking for book reviews, you’ll want information such as the book title, author name, and rating.

Step five: Execute the code

Once you’ve written the code, the next step is to execute it. Now to play the waiting game! This is where the scraper requests site access, extracts the data, and parses it (as per the steps outlined in the previous section).

Step six: Storing the data

After extracting, parsing, and collecting the relevant data, you’ll need to store it. You can instruct your algorithm to do this by adding extra lines to your code. Which format you choose is up to you, but as mentioned, Excel formats are the most common. You can also run your code through a Python Regex module (short for ‘regular expressions’) to extract a cleaner set of data that’s easier to read.

Now you’ve got the data you need, you’re free to play around with it.Of course, as we often learn in our explorations of the data analytics process, web scraping isn’t always as straightforward as it at first seems. It’s common to make mistakes and you may need to repeat some steps. But don’t worry, this is normal, and practice makes perfect!

5. What tools can you use to scrape the web?

We’ve covered the basics of how to scrape the web for data, but how does this work from a technical standpoint? Often, web scraping requires some knowledge of programming languages, the most popular for the task being Python. Luckily, Python comes with a huge number of open-source libraries that make web scraping much easier. These include:

BeautifulSoup

BeautifulSoup is another Python library, commonly used to parse data from XML and HTML documents. Organizing this parsed content into more accessible trees, BeautifulSoup makes navigating and searching through large swathes of data much easier. It’s the go-to tool for many data analysts.

Scrapy

Scrapy is a Python-based application framework that crawls and extracts structured data from the web. It’s commonly used for data mining, information processing, and for archiving historical content. As well as web scraping (which it was specifically designed for) it can be used as a general-purpose web crawler, or to extract data through APIs.

Pandas

Pandas is another multi-purpose Python library used for data manipulation and indexing. It can be used to scrape the web in conjunction with BeautifulSoup. The main benefit of using pandas is that analysts can carry out the entire data analytics process using one language (avoiding the need to switch to other languages, such as R).

Parsehub

A bonus tool, in case you’re not an experienced programmer!Parsehub is a free online tool (to be clear, this one’s not a Python library) that makes it easy to scrape online data. The only catch is that for full functionality you’ll need to pay. But the free tool is worth playing around with, and the company offers excellent customer support.

There are many other tools available, from general-purpose scraping tools to those designed for more sophisticated, niche tasks. The best thing to do is to explore which tools suit your interests and skill set, and then add the appropriate ones to your data analytics arsenal!

6. What else do you need to know about web scraping?

We already mentioned that web scraping isn’t always as simple as following a step-by-step process. Here’s a checklist of additional things to consider before scraping a website.

Have you refined your target data?

When you’re coding your web scraper, it’s important to be as specific as possible about what you want to collect. Keep things too vague and you’ll end up with far too much data (and a headache!) It’s best to invest some time upfront to produce a clear plan. This will save you lots of effort cleaning your data in the long run.

Have you checked the site’s robots.txt?

Each website has what’s called a robot.txt file. This must always be your first port of call. This file communicates with web scrapers, telling them which areas of the site are out of bounds. If a site’s robots.txt disallows scraping on certain (or all) pages then you should always abide by these instructions.

Have you checked the site’s terms of service?

In addition to the robots.txt, you should review a website’s terms of service (TOS). While the two should align, this is sometimes overlooked. The TOS might have a formal clause outlining what you can and can’t do with the data on their site. You can get into legal trouble if you break these rules, so make sure you don’t!

Are you following data protection protocols?

Just because certain data is available doesn’t mean you’re allowed to scrape it, free from consequences. Be very careful about the laws in different jurisdictions, and follow each region’s data protection protocols. For instance, in the EU, the General Data Protection Regulation (GDPR) protects certain personal data from extraction, meaning it’s against the law to scrape it without people’s explicit consent.

Are you at risk of crashing a website?

Big websites, like Google or Amazon, are designed to handle high traffic. Smaller sites are not. It’s therefore important that you don’t overload a site with too many HTTP requests, which can slow it down, or even crash it completely. In fact, this is a technique often used by hackers. They flood sites with requests to bring them down, in what’s known as a ‘denial of service’ attack. Make sure you don’t carry one of these out by mistake! Don’t scrape too aggressively, either; include plenty of time intervals between requests, and avoid scraping a site during its peak hours.

Be mindful of all these considerations, be careful with your code, and you should be happily scraping the web in no time at all.

7. In summary

In this post, we’ve looked at what data scraping is, how it’s used, and what the process involves. Key takeaways include:

  • Web scraping can be used to collect all sorts of data types: From images to videos, text, numerical data, and more.
  • Web scraping has multiple uses: From contact scraping and trawling social media for brand mentions to carrying out SEO audits, the possibilities are endless.
  • Planning is important: Taking time to plan what you want to scrape beforehand will save you effort in the long run when it comes to cleaning your data.
  • Python is a popular tool for scraping the web: Python libraries like Beautifulsoup, scrapy, and pandas are all common tools for scraping the web.
  • Don’t break the law: Before scraping the web, check the laws in various jurisdictions, and be mindful not to breach a site’s terms of service.
  • Etiquette is important, too: Consider factors such as a site’s resources—don’t overload them, or you’ll risk bringing them down. It’s nice to be nice!

Data scraping is just one of the steps involved in the broader data analytics process. To learn about data analytics, why not check out our free, five-day data analytics short course? We can also recommend the following posts:

Monday, October 21, 2019

Throughout years of working in the web scraping industry and talking to users from all over the world, job data stands out as being one of the most sought after information on the web. I was honestly a bit overwhelmed until I came across Gallup's 2017 State of the American Workplace report which stated that 51% currently employed adults are searching for new jobs or looking for new work opportunities and 58% job seekers look for jobs online, in another word, this market is huge. At the same time, I was also surprised to find out there are so many ways to utilize job data, just to name a few:

  1. Fueling job aggregator sites with fresh job data.
  2. Collecting data for analyzing job trends and the labor market.
  3. Tracking competitors' open positions, compensations, benefits plan to get yourself a leg up the competition.
  4. Finding leads by pitching your service to companies that are hiring for the same.
  5. Staffing agencies scrape job boards to keep their job databases up-to-date.
And trust me, these are only the tip of an iceberg. That said, scraping job postings isn't always the easiest thing to do.

Challenges for scraping job postings:

First and foremost, you'll need to decide where to extract this information. There are two main types of sources for job data:

  1. Major job aggregator sites like Indeed, Monster, Naukri, ZipRecruiter, Glassdoor, Craiglist, LinkedIn, SimplyHired, reed.co.uk, Jobster, Dice, Facebook jobs and etc.
  2. Every company, big or small, has a career section on their websites. Scraping those pages on a regular basis can give you the most updated list of job openings.

[Further reading: 70 Amazing Free Data Sources You Should Know]


Next, you'll need a web scraper for any of the websites mentioned above. Large job portals can be extremely tricky to scrape because they will almost always implement anti-scraping techniques to prevent scraping bots from collecting information off of them. Some of the more common blocks include IP blocks, tracking for suspicious browsing activities, honeypot traps, or using Captcha to prevent excessive page visits. If you are interested, this article provides good insights into how to go about bypassing some of the most common anti-scraping blocks. On the contrary, the company's career sections are usually easier to scrape. Yet, as each company has its own web interface/website, it requires setting up a crawler for each company separately. Such that, not only the upfront cost is high but it is also challenging to maintain the crawlers as websites undergo changes quite often.

What are the options for scraping job data?

There are a few options for how you can scrape job listings from the web.

1. Hiring a web scraping service (Daas)

These companies provide what is generally known as 'managed service'. Some well-known web scraping vendors are Scrapinghub, Datahen, Data Hero and etc. They will take your requests in and set up whatever is needed to get the job done, such as the scripts, the servers, the IP proxies, etc. Data will be provided to you in the format and frequencies required. Scraping services usually charge based on the number of websites, the amount of data to fetch and the frequencies of the crawl. Some companies charge additional for the number of data fields and data storage. Website complexity is, of course, a major factor that could have affected the final price. For every website setup, there's usually a once-off setup fee and monthly maintenance fee.


Pros:

Web Scraping Online

  1. No learning curve. Data is delivered to you directly.
  2. Highly customizable and tailored to your needs.
Cons:
  1. The cost can be high, especially if you have a lot of websites to scrape ($350 ~ $2500 per project + $60 ~ $500 monthly maintenance fee).
  2. Long term maintenance cost can cause the budget to spiral out of control
  3. Extended development time as each website will need to be set up in its entirety (3 to 10 business days per site).

2. In-house web scraping setup

Doing web scraping in-house with your own tech team and resources comes with its perks and downfalls.

Web Scraping Software Comparison


Pros:

  1. Complete control over the crawling process.
  2. Fewer communication challenges, faster turnaround.
Cons:
  1. High cost. A troop of tech costs a lot (as much as 20x more from what I've heard).
  2. Less expertise. Web scraping is a niche process that requires a high level of technical skills, especially if you need to scrape from some of the more popular websites or if you need to extract a large amount of data on a regular basis. Starting from scratch is tough even if you hire the professionals, whereas data service providers, as well as scraping tools, are expected to be more experienced with tackling the unanticipated obstacles.
  1. Loss of focus. Why not spend more time and energy on growing your business?
  2. Infrastructure requirements. Owning the crawling process also means you'll have to get the servers for running the scripts, data storage, and transfer. There's also a good chance you'll need a proxy service provider and a third-party Captcha solver. The process of getting all of these in place and maintaining on a daily basis can be extremely tiring and inefficient.
  3. Maintenance headache. Scripts need to be updated or even rewritten all the time as they will break whenever websites update layouts or codes.
  4. Legal risks. Web scraping is legal in most cases though there's a lot of debates going around and even the laws had not explicitly enforced one side or the other. Generally speaking, public information is safe to scrape and if you want to be more cautious about it, check and avoid infringing the TOS (terms of service) of the website. That said, should this become a concern, hiring another company/person to do the job will surely reduce the level of risk associated with it.

3. Using a web scraping tool

Technologies's been advancing and just like anything else, web scraping can now be automated. There are many web scraping software that is designed for non-technical people to fetch data from the web. These so-called web scrapers or web extractors transverse the website and capture the designated data by deciphering the HTML structure of the webpage. You'll get to 'tell' the scraper what you need through 'drags' and 'clicks'. The program learns about what you need through its built-in algorithm and performs the scraping automatically. Most scraping tools can be scheduled for regular extraction and can be integrated to your own system.

[Further reading: Top 30 Free Web Scraping Software]

  1. Budget-friendly. Most web scraping tools support monthly payments ($60 ~ $200 per month) and some even offer free plans that are quite robust (such as the one I use).
  2. Non-coder friendly. Most of them are relatively easy to use and can be handled by people with little or no technical knowledge. If you want to save time, some vendors offer crawler setup services as well as training sessions.
  3. Scalable. Easily supports projects of all sizes, from one to thousands of websites. Scale-up as you go.
  4. Fast turnaround. Depending on your efforts, a crawler can be built in 10 minutes.
  5. Complete control. Once you've learned the process, you can set up more crawlers or modify the existing ones without seeking help from the tech team or service provider.
  6. Low maintenance cost. As you won't need a troop of tech to fix the crawlers anymore, you can easily keep the maintenance cost in check.

Cons:

  1. Learning curve. Depending on the product you choose, it can take some time to learn the process. Virtual scrapers such as import.io, dexi.io, and Octoparse are easier to learn.
  2. Compatibility. All web scraping tools claim to cover sites of all kinds but the truth is, there's never going to be 100% compatibility when you try to apply one tool to literally millions of websites.
  3. Captcha. Most of the web scraping tools out there cannot solve Captcha.

A real web scraping example...

In order to make this post more useful to you, I've decided to give you a little tutorial on how to scrape Indeed using my favorite scraping tool of all time, Octoparse. In this example, I will scrape some basic information for data scientists in New York City.

Data to extract

  1. Job title
  2. Job location
  3. Employer name
  4. Job Description
  5. Number of reviews
  6. Page URL

Prerequisites

Download Octoparse and have it installed. It would be best if you are familiar with how Octoparse works in general. Check out Octoparse Scraping 101 if you are new to the tool.

Creating a scraping project

1. Launch Octoparse and create a new project by clicking on '+Task' under Advanced Mode.


2. Enter the target URL (https://www.indeed.com/jobs?q=Data%20Scientist&l=New%20York%20State&_ga=2.92303069.138961637.1571107168-1638621315.1571107168) into the URL box. This is the URL copied from Chrome upon searching for 'data scientists' near 'New York' on Indeed.com. Click 'Save URL' to proceed.

Tips:

since I am using a 17' monitor, I always like to switch to the full-screen mode by toggling the workflow button at the top. This gives me a better view of the webpage.

3. Click on the first job title. Then, click on the second job title (or any other job titles will do).


4. Follow the instructions provided on 'Action Tips', which now reads '10 elements selected'. I obviously want to click open each one of the selected titles, so it makes sense to select 'Loop click each element'.


Tips:

Whenever you have successfully built a list to loop through, a loop will be created and added to the workflow. Switch back to the workflow mode and see if this is the case for you.

5. Now that I am on the job page, I am going to extract the data I need by clicking on it. Click on the title of the job, the location, the number of reviews, the company name, and the job description.


6. Once done selecting the fields needed, click on 'Extract data' on the 'Action Tips'.

7. Next, I am going to capture the Page URL by adding a pre-defined field.

  • Access the task workflow by toggling the workflow button on the top.
  • With the 'Extract data' step of the workflow selected, click on 'Add pre-defined field'
  • Select 'Add current page information', then'Web page URL'. This will get the page URL fetched along with all the other data fields.
Octoparse will automatically generate field names for the data fields captured. If you need to re-name the data fields, simply type on top of the current ones.

8. So far I've managed to extract all the jobs listed on the first page, but I'll definitely want to extract more pages. To do this, I'll set up pagination, ie. have the Octoparse to crawl through the different page numbers.


  • Return to the search result page by clicking on the loop item of the workflow.
  • Scroll down the page and find the 'Next' button, click on it.
  • Select 'Loop click single element' on 'Action Tips'. Octoparse will click the 'Next' button until it reaches the last page ( when 'Next' is no longer found on the page).

You can also specify the number of pages to extract. For example, if you want to extract only the first 3 pages, enter number '2' for 'End loop when execution times reaches X'. This way, Octoparse will only paginate for 2 times and ends when it reaches page-3.

9. As soon as I reach page-2, I've noticed that the 'Next' element is no longer detected correctly as the auto-generated XPath now tracks the 'Previous' button instead. To solve it, I'll have to modify the XPath manually.


Best web scraping tools
  • With the pagination loop selected, change the XPath of the single element to //SPAN[contains(text(), 'Next')].
  • Now we have the correct 'Next' button detected.
Tip:

Learn about how to modify XPath when the auto-generated XPath fails:

10. That's it. You are done. Click on the 'Extract data' button on the top to run the task.


Tips:

Kindly note that if you want to try other recruitment websites (like glassdoor.com), simply check this post!

Final words

To sum up, there's surely going to be pros and cons with any one of the options you choose. The right approach should be one that fits your specific requirements (timeline, budget, project size, etc). Obviously a solution that works well for businesses of the Fortune 500 may not work for a college student. That said, weigh in on all the pros and cons of the various options, and most importantly, fully test the solution before committing to one.

Artículo en español: Una guía completa para las publicaciones de trabajos de web scraping
También puede leer artículos de web scraping en El Website Oficial