Project

General

Profile

Actions

Link checker

Introduction

The Link Checker script is to help you identify and manage broken links and images on a website. It automates the process of checking URLs within a specified domain and provides detailed reports on the status of each link and image.

Script destination

script location: extension/Resources/Public/scripts
script Name: web_crawler.py

Usage

Configuration

Before using the Link Checker script, you need a configuration file "conf.json".

  • "startUrl": The URL where the link checking will begin.
  • "login_url": URL for logging in if required. If empty it will use the "startUrl" instead.
  • "username and password": Login credentials. If you don't have login credentials, leave this field empty it's important !
  • "max_depth": The maximum depth to crawl links.
  • "target_path": The path to restrict link checking (e.g., /blog).
  • "target_string": Looking for a unique string.
  • "blacklist": URLs to exclude from checking.

Ignore CSS class

This script also ignore the CSS class "link-checker-skip"

Running the Script

You can run the Link Checker script using the following command:

python web_crawler.py conj.json "all or <index>" 

Result/Output

It generate detailed reports. These reports include:
  • Broken links and images with response codes.
  • Denied links with 403 Forbidden errors.
  • Redirects to the home page.
  • Successfully checked links.
  • The results will be saved in log files (detail.log and summary.log) and a CSV file containing broken links.

Summary log:

0 error

1 or more error

Detail log:

Updated by Zhoujie Li 3 months ago · 11 revisions