pl_from-> Id of the "from" page of this link- (
pl_namespace) -> We keep only if equals 0 (= namespace of the "from" page of this link) pl_target_id-> Target of this link (foreign key tolinktarget)
lt_id-> Id of this link (index)- (
lt_ns) -> We keep only if equals 0 (= namespace of the targeted page) lt_title-> Title of the targeted page
page_id-> Id of the page- (
page_namespace) -> We keep only if equals 0 (= namespace of this page) page_title-> Title of this pagepage_is_redirect-> Boolean whether this page is a redirect- Ignore the eight following
rd_from-> Id of the page from which we are redirected- (
rd_namespace) -> We keep only if equals 0 (= namespace of the page we are redirected to) rd_title-> Title of the page we are redirected to- Ignore the two following
Replaces for each redirection, rd_title with the targeted page_id by matching on page_title.
The targetted page_id is then computed as a redirect recursively, until we get on a "final" page.
rd_from-> The id of the page we are redirected frompage_id-> The id of the page we get to following redirections recursively
Replaces, for each linktarget, lt_title with the targeted page_id by matching on page_title.
We then compute the "final" page obtained from this page following redirection, with the file redirects.with_ids.txt.
lt_id-> Id of this linkpage_id-> The id of the page this link is pointing to, after having followed all redirections
Replaces, for each pagelink, lt_id with the targeted page_id by joining with links.with_ids.txt.
pl_from-> Id of the "from" page, after having followed all redirectionspage_id-> Id of the "to" page, after having followed all redirections
Prunes the pages file by removing pages which are marked as redirects but have no corresponding redirect in the redirects file.
Then we sorts the links.with_ids.txt according to the first "source" id, into
the file links.sorted_by_source_id.txt, and according to the second "target" id
into the file links.sorted_by_target_id.txt.
Then, we use those two files to GROUP BY the links by source and by target.
The file links.grouped_by_source_id.txt is like this
pl_from-> Id of the "from" pagetargets-> A|-separated string of the ids the "from" page targets
The file links.grouped_by_target_id.txt is like this
froms-> A|-separated string of the ids of the pages targeting the "target" pagepl_target-> Id of the "target" page
We stream-merge ( we dont load the two files into memory directly ) the two files links.grouped_by_source_id.txt.gz and links.grouped_by_target_id.txt.gz into a single sorted links file links.with_counts.txt.
We create three tables : redirects, tables, links from the files : redirects.with_ids.txt.gz, pages.pruned.txt.gz, links.with_counts.txt.gz
Depending on your hardware, you might need to change some values in the buildDatabase.sh script, most notably how much of the cpu the sort command should take, you can define the value as a percentage or in gigabytes, while running on a DigitalOcean droplet with 16 gb of ram it was best to give the sort command 4 gb of ram, The whole database generation took about two hours ( 14gb final size for the sdow.sqlite file ).
If the script keeps exiting with no error message, its possible that the system is killing it for taking too much ressources, you can investigate that by running : dmesg | grep -i "killed process"
after creating the database you can inspect its content using sqlitebrowser