Web scraper to get news article content

2174 developers have joined this project.

What you will practice

You'll practice how to use Python libraries to obtain web page content and how to select the elements you need on a web page. These are fundamental skills used to build applications which scrape content from other websites.

Introduction

We want to build a simple web scraper that will return the content of a news article when given a specific URL. Some examples of real products which use similar technologies include price-tracking websites and SEO audit tools which may scrape top search results. This project may take you around 4 to 8 hours to complete.

Requirements

Choose one news website - see article examples below for inspiration. Given a specific article URL from the website of your choice, return the title and content of the article to the user.

Examples article URLs:

For an extra challenge: Parse out information such as the article title, updated date, and byline to return separately to the user.

Suggested Implementation

You can use something similar to this service in command line:

> python scrape_newyorktimes.py news_url

We suggest using a HTTP library like Requests to get the raw HTML file of the URL. Then use a parsing library like Beautiful Soup to parse the content. Alternatively, you can also use a Python scraping tool like Scrapy.

References

You can use xPath to select elements if there’s no class or div for the element
Take note of the Python version you have installed! (reference)

Hit a programming wall?
Get help from our mentors

Post request free
First 15 mins free

Suggested languages and frameworks

Python

Difficulty

easy

Contributed by

Sylvia Shen

Back-End Developer @ Codementor

Interested in this project?

Shorten your learning curve with on-demand programming help

The awesome set of verified mentors will provide guidance and mentoring help when you are stuck.

Suresh Atta

Post request free
First 15 mins free

Shorten your learning curve with on-demand programming help

Other recommended projects

easy

RSS feed reader in terminal

Let's build a RSS feed reader! Most news websites, blogs, podcasts, maintain a RSS feed which gives real-time content updates. You'll build a tool for fetching and converting the feed with a given RSS feed URL.

Node.jsPythonRuby

easy

Discord bot: QR code generator

QR codes have become ubiquitous in many countries in the last few years. With the help of bots, QR codes can be generated easily. Through this project, you'll learn how to use Discord bots to turn command arguments into outputs like QR code, without having to look for a QR code generation website.

Node.jsDiscord.js

easy

Build a screenshot pipeline

Set up a CI/CD workflow which will produce a screenshot of your homepage and keep it updated as you keep changing the code. This kind of always-up-to-date screenshot can be useful for your README or as the basis of marketing materials that include screenshots of your website. You can even use these images to perform [visual regression testing](https://medium.com/loftbr/visual-regression-testing-eb74050f3366)!

Node.jsReact