Web scraping toolkit.
apt update apt full-upgrade apt install python3-selenium
python -m pip install --upgrade pip python -m pip install selenium
Here are a few tool which can be useful for tidying HTML content for evaluation and for storing scraped data in a Wikibase.
python -m pip install lxml python -m pip install beautifulsoup4 python -m pip install "WikibaseIntegrator>=0.12" python -m pip install dotenv
Chromium may be installed within WSL1 but cannot operate so creating an alias to the Windows Python executable allows scripts to run from WSL1 with the Windows browser.
TCSH
alias py "/mnt/c/Users/username/AppData/Local/Programs/Python/Python313/python.exe"
bash
alias py="/mnt/c/Users/username/AppData/Local/Programs/Python/Python313/python.exe"
This test will open chrome.exe or chromium and visit a page.
#! /usr/bin/env python3 from selenium import webdriver # CHROME from selenium.webdriver.chrome.options import Options # FIREFOX #from selenium.webdriver.firefox.options import Options options = Options() # CHROME options.add_argument("--incognito") driver = webdriver.Chrome(options=options) # FIREFOX #options.add_argument("-private") #driver = webdriver.Firefox(options=options) driver.implicitly_wait(60) driver.get("https://www.google.com/") driver.quit()
CHROME incognito mode was found to be a requirement on a Linux host otherwise chromium would wait about 30 seconds before opening the URL.
sget is a simple tool to fetch a URL, strip various tags and save the content to a file.
This can be used to inspect a web page prior to precise scraping.