Table of Contents

CPS Selenium GET

This is a simple tool to fetch a page using Python Selenium.

It has been developed to test scraping web pages and to determine the method required to fetch dynamically created content.

CPS

CPS tools have been written for and provided by the CPS (Computational Publishing Service) as part of NFDI4Culture hosted at Wikibase4Research at TIB.

Setup

Debian
sudo apt install build-essential python3-full python3-pip mercurial
hg clone https://hg.kewl.org/pub/cps_sget
cd cps_sget
make venv

TCSH

source ~/.venvs/cps_sget/bin/activate.csh
make install
rehash

BASH

source ~/.venvs/cps_sget/bin/activate
make install
PyPi

This project is also packaged at PyPi

Use cps-sget as a dependency to include it.

Demo

Deckenmalerei's web pages are comprised of temporary content which after a delay reload the page.

Often the page only contains a subset of the full content and the page needs to be scrolled to reveal it all.

Initial page load

Fetch the temporary page

sget "https://www.deckenmalerei.eu/e8a3bf28-6365-42fe-a4d3-41608ed870e8" d0.html
Reloaded page load

Fetch the reloaded page

sget "https://www.deckenmalerei.eu/e8a3bf28-6365-42fe-a4d3-41608ed870e8" -d10 d10.html
Comparison of initial and reloaded page
diff d0.html d10.html | head -30
9c9
<    CbDD - lade e8a3bf28-6365-42fe-a4d3-41608ed870e8
---
>    CbDD - Abtsgmünd, Gartenpavillon Schloss Hohenstadt [Text]
50,54c50,1386
<     <div class="fullScreenCentered loadingIndicator">
<      <div class="spinningCircleBasic spinningCircleTop spinningCircleRight spinningCircleBottom">
<      </div>
<      <div class="fullScreenText">
<       lade Daten ...
---
>     <div class="dataPage">
>      <div class="document">
>       <div style="margin-top: 0.5rem;">
>        <div class="text-right">
>         <div class="qrView">
>          <div>
>           <a href="/e8a3bf28-6365-42fe-a4d3-41608ed870e8">
>            <span>
>             QR Code
>            </span>
>            <span class="icon-align-big">
>             <i class="material-icons md-36">
>              arrow_drop_down
>             </i>
>            </span>
>           </a>
>          </div>
>         </div>
>         <div class="mapView">
Full page load

Fetch the page again but this time wait for the reloaded content based on tag attribute

sget "https://www.deckenmalerei.eu/e8a3bf28-6365-42fe-a4d3-41608ed870e8" -b XPATH -v "//div[@class='dataPage']" --scroll dataPage.html
This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies