Preserving a WordPress site using the WP REST API

This is not a tutorial for someone who likes to copy/paste stuff. It’s my notes on how to recreate a WordPress site where the admin login has been lost and the site is still running

Do not use the methods described here on any site you do not own or have permission to dig through. Doing intense wp-json queries might have you banned or give the site problems (bandwidth or technical).

I have blocked wp-json access to this site (tech.webit.nu) because of my posts about how to collect content. I have however set up another site, lab.webit.nu, which you are allowed to try out some fetching commands on.

The only requirement is that (at least part of) the WP REST API (wp-json) is available on the site. This will let you access most of the content visible to those who visit the site using a web browser.

I came across a site that needs to be recovered/preserved which had all its users deleted (probably including all admins as well), and access to the post comments were not possible using the API. This will later be parsed out from the saved rendered posts of the site.

The focus is on preserving, not cloning. There are plugins available for cloning sites to a new location or domain, but those requires admin access on both locations.

The WordPress REST API

Read someone else’s tutorial on this, there are a couple out there. I will only go into details on what parts of the json output belongs to what table in the WordPress database and how to get the content back where it belongs.
A few pages I stumbled on doing my research for this post:
This is a very short introduction to the API:
https://jalalnasser.com/wordpress-rest-api-endpoints/

Also, I found an article about the WordPress API on SitePoint:
https://www.sitepoint.com/wordpress-json-rest-api/

Another cloning/backup plugin (WP Migrate) claims to have the Ultimate Developer’s Guide to the WordPress Database

The WordPress REST API LinkedIn Course was probably the best resource I found to get started:
https://www.linkedin.com/learning/wordpress-rest-api-2
What I found confusing is how Morten used the term “Endpoint” for the METHOD and “Route” (which is correct) for the URL-part following “wp-json” on the URL. According to me with limited knowledge about this, the GET/POST/DELETE is what I will call “method” and I will only use “GET”. I will use the term “Endpoint” or “Route” for the part of the URL after “wp-json”.

Begin digging

The most useful endpoints are, besides “posts” and “media”, “taxonomies” and “types” which will give you all the taxonomies and post types to retrieve and parse for the parts that will be put back into a new database.
For a WordPress site without any custom post types or taxonomies, “taxonomies” will only be “categories” and “tags”, and “types” of interest will be “pages”, “posts” and “media” (“attachment”). If the site has a WooCommerce shop there are specific endpoints for product categories and tags.

Step 1: Post index

Luckily enough the site I was going to preserve had a (more or less) complete index of the public posts (probably auto-generated by the theme template), so I was able to download the rendered HTML of each post as well as the json for each of them. I didn’t really need to save json for each post, but the code I used for parsing the HTML pages will be used later when I go on recreating the comments.
At this point I had html and json for each post (but no related files or content to them)

Step 2: Get taxonomies (terms)

Taxonomies are as I described earlier the tags and categories. These can be fetched all at once and saved down to one file per type.
These can be easily inserted into the WordPress database.
There are two tables of interest in this step:
‘wp_terms’ (the words) and ‘wp_term_taxonomy’ (connecting each term to a taxonomy, and contains the description and setting for ‘parent’ for categories). A third table connecting the terms with the posts (‘wp_term_relationships’) will come in use when the posts are imported. Lastly, the table ‘wp_termmeta’ optionally contains more information for the terms (meta_key and meta_value added by plugins)

Step 3: Get json for the posts

Although I already had these as separate json files, I now reworked my script to fetch the posts in batches, so I got them in batches of 10 and 100. The sets of 100 posts per fetch is a complete set, and the files with 10 posts each will be used for testing further routines.
The API endpoint /posts is just for the post type of ‘post’.
As the ‘wp_posts’ table also contains the pages and media file information (post type “attachment”), these will have to be fetched in the next step.

Step 4: Get json for pages

As step 2, but now I get the pages. As the pages are a small amount on most sites, I decided to get these as one item per file. This to reduce the risk for parsing errors.

Step 5: Get json for entries in the media library

As the other steps for getting posts, as the media items also are a post type (‘attachment’) with some special fields (source URLs for the files). Media items were grabbed in batches of 100, as they are most likely to be problem free with the limited content of the entries.

Parsing time

Now things get more complicated when we start to parse the data we got. This will be described in part 2 of this series of notes.
Part 2: The WordPress database and parsing taxonomies