Creating usable sample data for WordPress

This is a side-step to my series of notes on How to preserve a WordPress site using the WP REST API

To have actual (but faked) content I can share in examples, and also allow anyone to run the API queries against for testing and examining the result, I set up a sample site at lab.webit.nu.

What I wasn’t prepared for was that it’s hard to find useful test content to fill the site with (the official “Theme Unit Test” was more or less of no use for my purpose). I finally found two candidates and decided on the second one which I describe below.

wp-cli-fixtures seems really flexible, but refused to run because of conflicts in the code. I managed to “fix” these conflicts, but I still couldn’t have it connect images to the posts. I also tested Faker which ‘wp-cli-fixtures’ is based on, but it hasn’t been updated for many years, and failed because the use of the flickr API has changed.

test-content-generator acts both as a plugin and as an extension to wp-cli. It has options to generate taxonomies (categories and tags, but not to specify which one separately), users (specify user role or randomize), add images (from picsum.photos) in a specified size, create posts and comments to them, and (from the WP backend) create pages.

Creating fake content for the test site

As mentioned (and true for all three alternatives for creating fake content) wp-cli is required since the data-creators are extensions to it.
Simple enough to install according to the instructions on the wp-cli homepage:

## download and test
curl -O https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar
php wp-cli.phar --info
## then if it works
chmod +x wp-cli.phar
sudo mv wp-cli.phar /usr/local/bin/wp

Install the ‘test-content-generator’ plugin as described:

wp plugin install test-content-generator --activate

To have the data itself created in the right order (and to not have to type it over again if I wipe the database), I created a script to do it:

#!/bin/sh
wp test users --amount=5 --role_keys=editor,author
wp test users --amount=15 --role_keys=subscriber
wp test users --amount=3 --role_keys=contributor
wp test users --amount=10
wp test taxonomies --amount=50
wp test images --amount=20 --image_width=1500 --image_height=410
wp test images --amount=20 --image_width=1500 --image_height=410
wp test images --amount=20 --image_width=1500 --image_height=410
wp test posts --amount=60
wp test comments --amount=100
wp test comments --amount=33

I wanted more comments than the maximum of 100 for each command run, so this is the reason for two of this command. The limit for images is 20, so I ran this three times to create enough different images for each post. The four ‘wp test users’ commands in the beginning is to create a set amount of one kind of users, then add 10 more users with randomized roles.

I also uploaded 10 images from stocksnap.io the normal way through wp-admin to have more images to examine the database content for. Five of these will be attached to each of the test pages I create using ‘test-content-generator’ from within wp-admin.

My next post will be the continuation of the series on how to preserve that WordPress site…

Preserving a WordPress site using the WP REST API

This is not a tutorial for someone who likes to copy/paste stuff. It’s my notes on how to recreate a WordPress site where the admin login has been lost and the site is still running

Do not use the methods described here on any site you do not own or have permission to dig through. Doing intense wp-json queries might have you banned or give the site problems (bandwidth or technical).

I have blocked wp-json access to this site (tech.webit.nu) because of my posts about how to collect content. I have however set up another site, lab.webit.nu, which you are allowed to try out some fetching commands on.

The only requirement is that (at least part of) the WP REST API (wp-json) is available on the site. This will let you access most of the content visible to those who visit the site using a web browser.

I came across a site that needs to be recovered/preserved which had all its users deleted (probably including all admins as well), and access to the post comments were not possible using the API. This will later be parsed out from the saved rendered posts of the site.

The focus is on preserving, not cloning. There are plugins available for cloning sites to a new location or domain, but those requires admin access on both locations.

The WordPress REST API

Read someone else’s tutorial on this, there are a couple out there. I will only go into details on what parts of the json output belongs to what table in the WordPress database and how to get the content back where it belongs.
A few pages I stumbled on doing my research for this post:
This is a very short introduction to the API:
https://jalalnasser.com/wordpress-rest-api-endpoints/

Also, I found an article about the WordPress API on SitePoint:
https://www.sitepoint.com/wordpress-json-rest-api/

Another cloning/backup plugin (WP Migrate) claims to have the Ultimate Developer’s Guide to the WordPress Database

The WordPress REST API LinkedIn Course was probably the best resource I found to get started:
https://www.linkedin.com/learning/wordpress-rest-api-2
What I found confusing is how Morten used the term “Endpoint” for the METHOD and “Route” (which is correct) for the URL-part following “wp-json” on the URL. According to me with limited knowledge about this, the GET/POST/DELETE is what I will call “method” and I will only use “GET”. I will use the term “Endpoint” or “Route” for the part of the URL after “wp-json”.

Begin digging

The most useful endpoints are, besides “posts” and “media”, “taxonomies” and “types” which will give you all the taxonomies and post types to retrieve and parse for the parts that will be put back into a new database.
For a WordPress site without any custom post types or taxonomies, “taxonomies” will only be “categories” and “tags”, and “types” of interest will be “pages”, “posts” and “media” (“attachment”). If the site has a WooCommerce shop there are specific endpoints for product categories and tags.

Step 1: Post index

Luckily enough the site I was going to preserve had a (more or less) complete index of the public posts (probably auto-generated by the theme template), so I was able to download the rendered HTML of each post as well as the json for each of them. I didn’t really need to save json for each post, but the code I used for parsing the HTML pages will be used later when I go on recreating the comments.
At this point I had html and json for each post (but no related files or content to them)

Step 2: Get taxonomies (terms)

Taxonomies are as I described earlier the tags and categories. These can be fetched all at once and saved down to one file per type.
These can be easily inserted into the WordPress database.
There are two tables of interest in this step:
‘wp_terms’ (the words) and ‘wp_term_taxonomy’ (connecting each term to a taxonomy, and contains the description and setting for ‘parent’ for categories). A third table connecting the terms with the posts (‘wp_term_relationships’) will come in use when the posts are imported. Lastly, the table ‘wp_termmeta’ optionally contains more information for the terms (meta_key and meta_value added by plugins)

Step 3: Get json for the posts

Although I already had these as separate json files, I now reworked my script to fetch the posts in batches, so I got them in batches of 10 and 100. The sets of 100 posts per fetch is a complete set, and the files with 10 posts each will be used for testing further routines.
The API endpoint /posts is just for the post type of ‘post’.
As the ‘wp_posts’ table also contains the pages and media file information (post type “attachment”), these will have to be fetched in the next step.

Step 4: Get json for pages

As step 2, but now I get the pages. As the pages are a small amount on most sites, I decided to get these as one item per file. This to reduce the risk for parsing errors.

Step 5: Get json for entries in the media library

As the other steps for getting posts, as the media items also are a post type (‘attachment’) with some special fields (source URLs for the files). Media items were grabbed in batches of 100, as they are most likely to be problem free with the limited content of the entries.

Parsing time

Now things get more complicated when we start to parse the data we got. This will be described in part 2 of this series of notes.
Part 2: The WordPress database and parsing taxonomies

Preserving a WordPress site using the WP REST API – the WordPress database and parsing taxonomies

To make the most out of my notes, you should have your own sites (source and destination) set up for testing and analyzing the WordPress database content.

If you’re intend to try out something from my notes, you should have your own site to try things out against, or at least have access to a site you can dig into without any legal issues. You will also need to set up a destination site somewhere, and I recommend that you do it on a virtual machine with shell access somewhere (Oracle Cloud Free Tier is a forever-free alternative – and no, I’m not getting paid for recommending them, I just use and like their services – for stability AVOID their old AMD machines and only create machines on the Ampere A1 platform).
For just trying out some GET requests, I have set up lab.webit.nu and populated it with random content using wp-cli. This site might break at any time, so do not rely on it being available.

Some commands I give as examples has to be run on a Linux/Unix machine. This might also be possible in windows using microsoft ubuntu.

The WordPress database tables

It’s now a good time to examine the WordPress database tables used to store the terms. The descriptions below are my own findings in my own words. I later found a resource on the WP Staging plugin home page:
https://wp-staging.com/docs/the-wordpress-database-structure/

wp_terms
Terms such as categories, tags and menu names

+------------+-----------------+------+-----+---------+----------------+
| Field      | Type            | Null | Key | Default | Extra          |
+------------+-----------------+------+-----+---------+----------------+
| term_id    | bigint unsigned | NO   | PRI | NULL    | auto_increment |
| name       | varchar(200)    | NO   | MUL |         |                |
| slug       | varchar(200)    | NO   | MUL |         |                |
| term_group | bigint          | NO   |     | 0       |                |
+------------+-----------------+------+-----+---------+----------------+

Note: in the database from a larger site with about 3500 terms I manage, I have not seen any other value than 0 (zero) in the ‘term_group’ field.

wp_term_taxonomy
Connects the terms (tags, categories with their names) to the taxonomy they belong to. This table also holds the ‘description’ and ‘parent’ fields for the term.
“category” and “post_tag” are the most used ones.
Every term should have the corresponding entry in this table.

+------------------+-----------------+------+-----+---------+----------------+
| Field            | Type            | Null | Key | Default | Extra          |
+------------------+-----------------+------+-----+---------+----------------+
| term_taxonomy_id | bigint unsigned | NO   | PRI | NULL    | auto_increment |
| term_id          | bigint unsigned | NO   | MUL | 0       |                |
| taxonomy         | varchar(32)     | NO   | MUL |         |                |
| description      | longtext        | NO   |     | NULL    |                |
| parent           | bigint unsigned | NO   |     | 0       |                |
| count            | bigint          | NO   |     | 0       |                |
+------------------+-----------------+------+-----+---------+----------------+

wp_term_relationships
Connect page or post with term
object_id: post, page etc id (any object that supports tags or categories)
term_taxonomy_id: id in wp_term_taxonomy

+------------------+-----------------+------+-----+---------+-------+
| Field            | Type            | Null | Key | Default | Extra |
+------------------+-----------------+------+-----+---------+-------+
| object_id        | bigint unsigned | NO   | PRI | 0       |       |
| term_taxonomy_id | bigint unsigned | NO   | PRI | 0       |       |
| term_order       | int             | NO   |     | 0       |       |
+------------------+-----------------+------+-----+---------+-------+

Note: in the database from a larger site with about 3500 terms I manage, I have not seen any other value than 0 (zero) in the ‘term_order’ field.

wp_termmeta
Additional data for term items. This table is used by plugins.

+------------+-----------------+------+-----+---------+----------------+
| Field      | Type            | Null | Key | Default | Extra          |
+------------+-----------------+------+-----+---------+----------------+
| meta_id    | bigint unsigned | NO   | PRI | NULL    | auto_increment |
| term_id    | bigint unsigned | NO   | MUL | 0       |                |
| meta_key   | varchar(255)    | YES  | MUL | NULL    |                |
| meta_value | longtext        | YES  |     | NULL    |                |
+------------+-----------------+------+-----+---------+----------------+

Adding the terms to the WordPress database

At this time, only the wp_terms and wp_term_taxonomy tables are to be populated. As I later will parse objects from the media library, I will convert the json response to an associative array for easier manipulation of the meta value for images (more on that later).
PHP has the function json_decode() that has the option to return an array instead of an object.
Below is incomplete code of my fully working code, as you read this, I assume you are able to put things together from my hints.

$db = mysqli_connect(your db connection details here);
$file = "your-file-with-10-categories-or-tags.json";
$jsondata = json_decode(file_get_contents($file),true);
print count($jsondata) . " items\n";
foreach($jsondata as $post)
{
  if (!empty($post['taxonomy']))
  {
    $parent = !empty($post['parent']) ? $post['parent'] : 0;
    $name = mysqli_real_escape_string($db,$post['name']);
    $desc = mysqli_real_escape_string($db,$post['description']);
    $sql1 = <<<EOM
INSERT IGNORE INTO wp_terms(term_id,name,slug)
 VALUES({$post['id']},"{$name}","{$post['slug']}");
EOM;
  print "$sql1\n";

    $sql2 = <<<EOM
INSERT IGNORE INTO wp_term_taxonomy(term_taxonomy_id,term_id,taxonomy,description,parent)
 VALUES ({$post['id']},{$post['id']},"{$post['taxonomy']}","{$desc}",{$parent});
EOM;
    print "$sql2\n";
  }
}

After this step, you will be able to see the categories in the wp-admin backend of the destination site.