CKG Scrape to MediaWiki

Scrape demo using WGET.

The demo scrapes the content of the IPCC website and stores the data in Mediawiki and Wikibase.

This project depends on CPS Wikibase and uses Wikibase docker deployment.

CKG

Climate Knowledge Graph tools have been written to make the IPCC report cycle 6 accessible by storing it in Mediawiki and Wikibase.

Wikibase deployment

Wikibase will need some extra settings for Mediawiki

$wgEnableUploads = true; # Enable uploads

// Allow images up to 100MB
$wgMaxUploadSize = 104857600;

// Allow page content/wikitext up to 8MB (unit is KiB)
$wgMaxArticleSize = 8192;

$wgShowExceptionDetails = true;

$wgGenerateThumbnailOnParse = false;
$wgThumbnailScriptPath = "{$wgScriptPath}/thumb.php";

Debian

Setup
sudo apt install python3 python3-venv pandoc python3-magic mercurial
hg clone https://hg.kewl.org/pub/ckg_s2mw
cd ckg_s2mw
Create virtual environment
python3 -m venv ~/.venvs/ckg_s2mw

or

make venv
Activate virtual environment

TCSH

source ~/.venvs/ckg_s2mw/bin/activate.csh

BASH

source ~/.venvs/ckg_s2mw/bin/activate
Initial installation
make install
Update installation
hg pull -u
make

Scrape

A web scrape is a snapshot in time. There is no guarantee the this process is repeatable after the date it was written to work.

Setup
mkdir -p /var/www/htdocs/www.example.com/IPCC
cp etc/wget.sh /var/www/htdocs/www.example.com/IPCC
cp etc/urls*txt /var/www/htdocs/www.example.com/IPCC
cd /var/www/htdocs/www.example.com/IPCC
Run
./wget.sh

Env

After the scrape is complete return to the project directory and setup the .env configuration.

cp dotenv .env
vi .env

Eg.

This requires setting up Mediawiki bot.

MEDIAWIKI_URL="https://www.example.com/wiki/"
MEDIAWIKI_HOST="www.example.com"
MEDIAWIKI_USERNAME="User"
MEDIAWIKI_PASSWORD="mybot@xxx"
GATSBY_URL="https://www.ipcc.ch/report/ar6/"
GATSBY_ROOT="/var/www/htdocs/www.example.com/IPCC/Gatsby/www.ipcc.ch/report/ar6/"
WORDPRESS_URL="https://www.ipcc.ch/"
WORDPRESS_ROOT="/var/www/htdocs/www.example.com/IPCC/WordPress/www.ipcc.ch/"
ASSETS_URL="https://www.ipcc.ch/site/"
CACHE_DIR="/var/www/htdocs/www.example.com/IPCC/assets/"

Pandoc

Two scripts are provided to parse the Gatsby and WordPress CMS files to produce intermediary wikitext with Pandoc.

These processes contain various hacks the resultant output is far from perfect.

gatsby -l "etc/urls_gatsby.txt"
wordpress -l "etc/urls_wordpress.txt"

Wikitext

Wikitext clean up processes have two stages.

These processes can only be applied once.

Upload

Parse wikitext and obtain media files and upload to Mediawiki

wikitext -u
Clean

Process wikitext to fix any side effects from the Pandoc process and rewrite any media links and write to disk.

wikitest -w

Mediawiki

The final wikitext process takes the cleaned up wikitext files from above and publishes them.

mediawiki -l "etc/urls_gatsby.txt"
mediawiki -l "etc/urls_wordpress.txt"

Wikibase

Create the wikibase database.

wbset etc/work.xml

Upload Main_Page

work
This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies