Link checker » History » Version 11
Zhoujie Li, 22.01.2024 16:34
1 | 1 | Zhoujie Li | h1. Link checker |
---|---|---|---|
2 | 2 | Zhoujie Li | |
3 | 3 | Zhoujie Li | h2. Introduction |
4 | |||
5 | The Link Checker script is to help you identify and manage broken links and images on a website. It automates the process of checking URLs within a specified domain and provides detailed reports on the status of each link and image. |
||
6 | |||
7 | 4 | Zhoujie Li | h2. Script destination |
8 | 3 | Zhoujie Li | |
9 | 2 | Zhoujie Li | script location: extension/Resources/Public/scripts |
10 | script Name: web_crawler.py |
||
11 | 3 | Zhoujie Li | |
12 | h2. Usage |
||
13 | |||
14 | h3. Configuration |
||
15 | |||
16 | 10 | Zhoujie Li | Before using the Link Checker script, you *need* a configuration file "conf.json". |
17 | 3 | Zhoujie Li | !clipboard-202309040834-symyc.png! |
18 | |||
19 | * "startUrl": The URL where the link checking will begin. |
||
20 | * "login_url": URL for logging in if required. If empty it will use the "startUrl" instead. |
||
21 | * "username and password": Login credentials. If you don't have login credentials, leave this field *empty* it's *important !* |
||
22 | * "max_depth": The maximum depth to crawl links. |
||
23 | * "target_path": The path to restrict link checking (e.g., /blog). |
||
24 | * "target_string": Looking for a unique string. |
||
25 | 1 | Zhoujie Li | * "blacklist": URLs to exclude from checking. |
26 | |||
27 | 5 | Zhoujie Li | h3. Ignore CSS class |
28 | 6 | Zhoujie Li | |
29 | 5 | Zhoujie Li | This script also ignore the CSS class "link-checker-skip" |
30 | |||
31 | 3 | Zhoujie Li | h3. Running the Script |
32 | 1 | Zhoujie Li | |
33 | You can run the Link Checker script using the following command: |
||
34 | 5 | Zhoujie Li | !clipboard-202309051340-pj0ak.png! |
35 | 7 | Zhoujie Li | <pre> |
36 | python web_crawler.py conj.json "all or <index>" |
||
37 | </pre> |
||
38 | 5 | Zhoujie Li | |
39 | h2. Result/Output |
||
40 | |||
41 | It generate detailed reports. These reports include: |
||
42 | * Broken links and images with response codes. |
||
43 | * Denied links with 403 Forbidden errors. |
||
44 | * Redirects to the home page. |
||
45 | * Successfully checked links. |
||
46 | * The results will be saved in log files (detail.log and summary.log) and a CSV file containing broken links. |
||
47 | |||
48 | h3. Summary log: |
||
49 | |||
50 | *0 error* |
||
51 | 9 | Zhoujie Li | |
52 | 8 | Zhoujie Li | !clipboard-202309051417-p5lf8.png! |
53 | 5 | Zhoujie Li | |
54 | 1 | Zhoujie Li | |
55 | *1 or more error* |
||
56 | 9 | Zhoujie Li | |
57 | 5 | Zhoujie Li | !clipboard-202309051415-bmmau.png! |
58 | 9 | Zhoujie Li | |
59 | 5 | Zhoujie Li | |
60 | h3. Detail log: |
||
61 | |||
62 | !clipboard-202309051423-xa8x4.png! |
||
63 |