Web Scraping website with Puppeteer and Node.js

» What is Puppeteer?

Puppeteer is a Node.js library developed by Google that provides a high-level API to control and automate headless versions of popular web browsers, primarily Google Chrome. It allows developers to interact with web pages, simulate user actions, and extract data programmatically.
The term “headless” in this context means that the browser runs without a graphical user interface, making it suitable for automated tasks like web scraping, automated testing, and taking screenshots of web pages.
It can be used to run automated tests on web pages to ensure they work correctly and maintain the expected functionality.
The puppeteer can capture screenshots of web pages or generate PDF files of the rendered pages.
It allows you to fill out forms, click buttons, and interact with various elements on a web page as if a real user were doing it.

» Prerequisites

Make sure that you have Node.js and npm installed by running the commands npm -v and node -v in the terminal.

node -v

npm -v

» Create a new Project

Create a new project named Web-scraper which is initially empty. So open the project in VS code and the terminal let’s initialize this project with npm by running the command

npm init -y

So, it will generate the package.json file which will hold the dependencies.

We also need a puppeteer, let’s run the command

npm install puppeteer

Create a new file, name it index.js, Open the package.json file and add a new script entry named start with the command to execute index.js.

"start": "node index.js"

» Start Scraping

» Importing required modules

a. We starts by importing the necessary modules fs (filesystem) and puppeteer.

b. fs module is used for interacting with the file system, allowing the code to write data to a JSON file.

c. puppeteer is a headless browser automation library used for web scraping and automated browser tasks.

» Defining the async function scrapeTheURL()

We declare an asynchronous function named scrapeTheURL, which will be responsible for scraping data from a specific URL. Async functions allow the use of await, making it easier to handle asynchronous operations.

» Launching Puppeteer

a. Inside the scrapeTheURL function, we will launch Puppeteer by calling puppeteer.launch().

b. The headless: “new” option tells Puppeteer to run the browser in headless mode, meaning it won’t display a visible browser window.

» Opening a new page

After launching the browser, a new page is opened with await browser.newPage().

» Navigating to a URL

We use page.goto() method to navigate to the URL ‘https://books.toscrape.com/'. This website is dedicated to scraping book information.

» Scraping book data

a. Puppeteer allows the code to interact with the webpage and extract information using page.evaluate().

b. The page.evaluate() function takes a function as an argument, which runs in the context of the webpage.

c. Inside the evaluation function, document.querySelectorAll('.product_pod’) selects all elements with the class ‘product_pod’, which represents book containers on the webpage.

d. Then the elements we selected are then converted into an array using Array.from(), and each element is mapped to an object containing title, price, and rating.

e. The innerText property is used to retrieve the text content of the selected elements.

f. The scraped book data is then logged to the console.

» Saving data to a JSON file

a. We will save the data in JSON file named books.json using the fs.writeFile() method. b. The JSON.stringify(books) will convert the books array into a JSON-formatted string before writing it to the file.

c. If any error occurs during the file writing process, it will be caught and logged to the console.

» Closing the browser

a. After we are done scraping and saving the data, the browser is closed to free up system resources using await browser.close().

» Invoking the scrapeTheURL() function

Finally, we call scrapeTheURL() function is to execute the entire scraping process.

const fs = require('fs'); // require the filesystem module

const puppeteer = require('puppeteer'); // import puppeteer

// declare an async function and call it scrapeTheURL 
async function scrapeTheURL() {

    // open the headless browser
    const browser = await puppeteer.launch({ headless: "new" });

    // open a new page
    const page = await browser.newPage();

    // enter url in page
    await page.goto('https://books.toscrape.com/');

    // Bring all books
    const books = await page.evaluate(() => 
    Array.from(document.querySelectorAll('.product_pod'), (e) => ({
        title: e.querySelector('h3 a').innerText,
        price: e.querySelector('.price_color').innerText,
        rating: e.querySelector('.star-rating').classList.item(1),
    })));
    console.log(books);

    // save data to a JSON file
    fs.writeFile('books.json', JSON.stringify(books), (err) => {
        if (err) throw err;
        console.log('Data written to file');
    });

    // close the browser
    await browser.close();
}

// call the function to execute
scrapeTheURL();

Run the command

npm start

And you will the json file named: books.json which will have all the data like title, price, and rating of the book.

And That’s it! You’ve successfully learned how to Web Scrape any website with the help of Puppeteer and Node.js

Thank You and Happy coding!😊