Table of Contents
CKG Scrape to MediaWiki
Scrape demo using WGET.
The demo scrapes the content of the IPCC website and stores the data in Mediawiki and Wikibase.
This project depends on CPS Wikibase and uses Wikibase docker deployment.
CKG
Climate Knowledge Graph tools have been written to make the IPCC report cycle 6 accessible by storing it in Mediawiki and Wikibase.
Wikibase deployment
Wikibase will need some extra settings for Mediawiki
$wgEnableUploads = true; # Enable uploads
// Allow images up to 100MB
$wgMaxUploadSize = 104857600;
// Allow page content/wikitext up to 8MB (unit is KiB)
$wgMaxArticleSize = 8192;
$wgShowExceptionDetails = true;
$wgGenerateThumbnailOnParse = false;
$wgThumbnailScriptPath = "{$wgScriptPath}/thumb.php";
Debian
Setup
sudo apt install python3 python3-venv pandoc python3-magic mercurial hg clone https://hg.kewl.org/pub/ckg_s2mw cd ckg_s2mw
Create virtual environment
python3 -m venv ~/.venvs/ckg_s2mw
or
make venv
Activate virtual environment
TCSH
source ~/.venvs/ckg_s2mw/bin/activate.csh
BASH
source ~/.venvs/ckg_s2mw/bin/activate
Initial installation
make install
Update installation
hg pull -u make
Scrape
A web scrape is a snapshot in time. There is no guarantee the this process is repeatable after the date it was written to work.
Setup
mkdir -p /var/www/htdocs/www.example.com/IPCC cp etc/wget.sh /var/www/htdocs/www.example.com/IPCC cp etc/urls*txt /var/www/htdocs/www.example.com/IPCC cd /var/www/htdocs/www.example.com/IPCC
Run
./wget.sh
Env
After the scrape is complete return to the project directory and setup the .env configuration.
cp dotenv .env vi .env
Eg.
This requires setting up Mediawiki bot.
MEDIAWIKI_URL="https://www.example.com/wiki/" MEDIAWIKI_HOST="www.example.com" MEDIAWIKI_USERNAME="User" MEDIAWIKI_PASSWORD="mybot@xxx" GATSBY_URL="https://www.ipcc.ch/report/ar6/" GATSBY_ROOT="/var/www/htdocs/www.example.com/IPCC/Gatsby/www.ipcc.ch/report/ar6/" WORDPRESS_URL="https://www.ipcc.ch/" WORDPRESS_ROOT="/var/www/htdocs/www.example.com/IPCC/WordPress/www.ipcc.ch/" ASSETS_URL="https://www.ipcc.ch/site/" CACHE_DIR="/var/www/htdocs/www.example.com/IPCC/assets/"
Pandoc
Two scripts are provided to parse the Gatsby and WordPress CMS files to produce intermediary wikitext with Pandoc.
These processes contain various hacks the resultant output is far from perfect.
gatsby -l "etc/urls_gatsby.txt"
wordpress -l "etc/urls_wordpress.txt"
Wikitext
Wikitext clean up processes have two stages.
These processes can only be applied once.
Upload
Parse wikitext and obtain media files and upload to Mediawiki
wikitext -u
Clean
Process wikitext to fix any side effects from the Pandoc process and rewrite any media links and write to disk.
wikitest -w
Mediawiki
The final wikitext process takes the cleaned up wikitext files from above and publishes them.
mediawiki -l "etc/urls_gatsby.txt"
mediawiki -l "etc/urls_wordpress.txt"
Wikibase
Create the wikibase database.
wbset etc/work.xml
Upload Main_Page
work

