Table of Contents
CPS Selenium GET
This is a simple tool to fetch a page using Python Selenium.
It has been developed to test scraping web pages and to determine the method required to fetch dynamically created content.
CPS
CPS tools have been written for and provided by the CPS (Computational Publishing Service) as part of NFDI4Culture hosted at Wikibase4Research at TIB.
Setup
Debian
sudo apt install build-essential python3-full python3-pip mercurial hg clone https://hg.kewl.org/pub/cps_sget cd cps_sget make venv
TCSH
source ~/.venvs/cps_sget/bin/activate.csh make install rehash
BASH
source ~/.venvs/cps_sget/bin/activate make install
PyPi
This project is also packaged at PyPi
Use cps-sget as a dependency to include it.
Demo
Deckenmalerei's web pages are comprised of temporary content which after a delay reload the page.
Often the page only contains a subset of the full content and the page needs to be scrolled to reveal it all.
Initial page load
Fetch the temporary page
sget "https://www.deckenmalerei.eu/e8a3bf28-6365-42fe-a4d3-41608ed870e8" d0.html
Reloaded page load
Fetch the reloaded page
sget "https://www.deckenmalerei.eu/e8a3bf28-6365-42fe-a4d3-41608ed870e8" -d10 d10.html
Comparison of initial and reloaded page
diff d0.html d10.html | head -30 9c9 < CbDD - lade e8a3bf28-6365-42fe-a4d3-41608ed870e8 --- > CbDD - Abtsgmünd, Gartenpavillon Schloss Hohenstadt [Text] 50,54c50,1386 < <div class="fullScreenCentered loadingIndicator"> < <div class="spinningCircleBasic spinningCircleTop spinningCircleRight spinningCircleBottom"> < </div> < <div class="fullScreenText"> < lade Daten ... --- > <div class="dataPage"> > <div class="document"> > <div style="margin-top: 0.5rem;"> > <div class="text-right"> > <div class="qrView"> > <div> > <a href="/e8a3bf28-6365-42fe-a4d3-41608ed870e8"> > <span> > QR Code > </span> > <span class="icon-align-big"> > <i class="material-icons md-36"> > arrow_drop_down > </i> > </span> > </a> > </div> > </div> > <div class="mapView">
Full page load
Fetch the page again but this time wait for the reloaded content based on tag attribute
sget "https://www.deckenmalerei.eu/e8a3bf28-6365-42fe-a4d3-41608ed870e8" -b XPATH -v "//div[@class='dataPage']" --scroll dataPage.html

