URLs list text scraper
↣
Edit: aberto.
Reference: https://github.com/kitsuyui/scraper
To download the text content of multiple URLs from a list on Windows 11, we’ll create a PowerShell script that’s more robust and flexible than the previously suggested batch file. This approach leverages PowerShell’s strengths and provides better error handling and output formatting.
- First, ensure you have
scraper.exe
set up:
- Download the latest Windows executable from https://github.com/kitsuyui/scraper/releases/latest
- Rename it to
scraper.exe
and place it in a directory that’s in your system PATH
- Create a file named
scraper-config.json
with the following content:
[
{"type": "xpath", "label": "BodyText", "query": "//body//text()"}
]
- Create a text file named
urls.txt
with one URL per line:
https://example.com
https://another-example.com
https://third-example.com
-
Create a new file named Scrape-Urls.ps1
with the following PowerShell script:
# Scrape-Urls.ps1
param(
[string]$UrlFile = "urls.txt",
[string]$ConfigFile = "scraper-config.json",
[string]$OutputDir = "scraped_content"
)
# Ensure the output directory exists
New-Item -ItemType Directory -Force -Path $OutputDir | Out-Null
# Read URLs from file
$urls = Get-Content $UrlFile
foreach ($url in $urls) {
try {
Write-Host "Processing: $url"
# Generate a safe filename
$filename = ($url -replace "https?://", "" -replace "[^a-zA-Z0-9]+", "_") + ".txt"
$outputPath = Join-Path $OutputDir $filename
# Download and scrape content
$content = Invoke-WebRequest -Uri $url -UseBasicParsing | Select-Object -ExpandProperty Content
$scrapedContent = $content | & scraper -c $ConfigFile | ConvertFrom-Json
# Extract text from JSON and save
$bodyText = $scrapedContent | Where-Object { $_.label -eq "BodyText" } | Select-Object -ExpandProperty results
$bodyText -join " " | Out-File -FilePath $outputPath
Write-Host "Saved to: $outputPath"
}
catch {
Write-Host "Error processing $url : $_" -ForegroundColor Red
}
Write-Host
}
Write-Host "All URLs processed." -ForegroundColor Green
-
Open PowerShell and navigate to the directory containing your script and files.
- Run the script:
This improved solution offers several advantages:
- It uses PowerShell, which is more powerful and flexible than batch scripts on Windows.
- It includes error handling to manage issues with individual URLs without stopping the entire process.
- It creates a separate output directory for scraped content, keeping things organized.
- It generates safe filenames based on the URLs, avoiding potential naming conflicts or invalid characters.
- It extracts the actual text content from the JSON output, providing clean text files.
- It’s more customizable, allowing you to specify different input files, config files, or output directories.
Additional notes:
-
This script respects rate limiting by processing URLs sequentially. For a large number of URLs, consider adding a delay between requests.
-
Some websites may block or behave differently with automated requests. You might need to add user-agent headers or other modifications for certain sites:
$headers = @{
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
$content = Invoke-WebRequest -Uri $url -UseBasicParsing -Headers $headers | Select-Object -ExpandProperty Content
-
Always ensure you have permission to scrape the websites you’re targeting and that you’re complying with their terms of service and robots.txt files.
-
For very large lists of URLs, consider implementing parallel processing or breaking the list into smaller batches to improve efficiency.
-
You may want to add more robust URL validation and error checking, depending on your specific needs and the reliability of your URL list.