WordCamp Europe 2020 (online)

Abstract

‘You cannot make an omelette without breaking a few eggs.’ Things are inevitably going to break. And that’s stressful. If your triaging process involves manually seeking through log files or you only find out something is broken after a customer calls you, then you’re probably not fixing problems efficiently.

In this lightning talk I will discuss how to prevent, proactively detect and quickly react to issues. He will provide a short overview of tools that help with testing and preventing issues. The talk will cover logging frameworks and platforms that will actively alert you and help find out what’s going wrong. Finally, I will advise on adding metrics to your site in order to prevent future issues.

Slides

Resources

Video

Transscript

Hi Everyone, Thanks for joining! My name is Niels de Blaauw and I’m the lead developer at Level Level, a WordPress agency from the Netherlands.

Today we are going to explore this statement ‘everything is broken’, and more importantly, how to prepare for things breaking down. First let’s explore this statement.

In computer development, the industry average for defects is around 15 errors per thousand lines of code. Even in multi billion dollar companies, defects can not be completely eliminated. Which is the reason even huge companies need error pages, like this one GitHub….

How does this translate to WordPress where contributions are made by thousands of volunteers…

WordPress currently tracks a lot of known bugs. The oldest open ticket was created 14 years ago. Many of these are mild annoyances at best.

But there are about 200 major issues currently. And of course that’s just looking at WordPress Core ranging from crashes on pages with many revisions(https://core.trac.wordpress.org/ticket/34560) , to infinite redirects when changing post urls (https://core.trac.wordpress.org/ticket/23074).

You’re probably using a lot more software to run your website. Most of which you don’t directly control. And almost every part is vital for your website.

Furthermore, an increasing number of businesses rely on WordPress and it’s ecosystem for their income.

That’s pretty scary, right?

But we still love WordPress for it’s amazing ecosystem of plugins, volunteers, and events.

The bad news is: You can’t prevent mistakes from happening. You won’t be able to control and check every single part of your site, every single time.

However, You can prepare for mistakes happening. And now we’ll explore how!

For now we are going to look at 4 milestones that will help with this preperation. This way you will be able to identify issues sooner, and fix them quicker.

I will show some implementation examples during these slides. If you are not able to see these, or you want more details, don’t worry. I will publish these slides, more detailed code examples and additional resources. I’ll share the link in the final slides.

Let’s start with logging exceptions. You might know logging from the PHP error log, or the access log in your webserver.

For the first milestone, we want to take this a step further and implement logging on the application level.

A Log event consists of four vital parts:

The message is the most direct way of explaining what is happening, like an undefined function call..
A log event should also contain some context which answers questions like when did this happen?, who triggered it?, where did this happen
It helps you figure out if a certain page is generating errors, or that an issue is happening only during the WordPress CRON.
The level indicates how bad things are.
There are different types of levels, ranging from debugging and informational all the way up to emergencies, indicating that a system is unusable.
Depending on the level we specify we might choose different ways we want to output this data. As an example: all warnings and more severe issues can be saved to a logfile, While all emergencies are directly sent to your phone in a text message.

Now, lets see what this might look like in practice.

First, we create our logger and add an output for our log events to Slack.

Next, we’ll add the WebProcessor to add the URL, referrer, hostname and more to any logged message.

This helps us understand the context.

As the last part of our setup, we add a Default Error Handler, to capture any fatal errors and forward these events to our logger.

Finally, we call an undefined function, to test whether our setup works.

Now that we can catch every fatal error in our website we move on to the next milestone.

We are going to focus on using tools that help us find and prevent undefined behaviour early.

For this talk, we’ll limit this milestone to 3 major technologies, that each fill different gaps and assist in writing flawless code.

First up are coding standards.

Coding standards can do much more than just assist with whitespace.

There are a lot of rules that catch errors in themes and plugins as well. For instance, the WordPress standards prohibit you from accidentally using reserved post types.

Integration tests make sure all parts of your website are correctly working together.

In an integration test you specify a scenario and compare it to a fixed outcome.

It’s best to start with processes that are vital for your website.

In my scenario, we set up a 15 euro product and then we create an order containing 4 of these.

After calculating the total we expect the result to be 60.

If we now install or update a plugin that somehow changes this calculation, we know immediately.

Third: static analysis tools can assist in finding invalid code.

Where a code sniffer will only look at files in isolation, static analysis will look at your code in its context. What functions are called? What values do these return and how are the values used.

In the example on screen now, We see one happy code path that leads to valid execution of our process.

However, a static analysis tool will find all possible code paths.

In this case, If register_post_type fails, it can return a WordPress_Error Object. Calling get_rest_controller on this object, will result in a fatal error.

So now that we can identify undefined behaviour early, we can move on to the next milestone: turning exceptions into defined behaviour.

Let’s take a look at this exception. “The XML document is well formed but the document is not valid”.

Can we really understand what’s happening when we read this message? It can be improved by answering two main questions.

What is the situation? Explaining what should be happening, and what happened instead.

What can we do to resolve the situation? Help someone else, or your future self to understand what can be done to fix this.

Logging provides asynchronous communication between creator and the maintainer.

Let’s go back to the example I showed when I was explaining static analysis.

Before defining bahaviour for this error, we get a cryptic error message. Something like ‘get_rest_controller does not exist on WordPress Error’.

By creating a new codepath in case register_post_type fails we can improve this.

We can add a severity level, custom message and some context, like the post type slug we tried to register.

Instead of ‘get_rest_controller does not exist’ we can now provide a much more descriptive and actionable message.

Finally, some issues may be caused by things outside of the application.

Also, understanding how the complete system is working gives greater insight in the impact of failures..

In any given system, a lot of data and logs are collected about individual parts that are running the website. However, we want to combine this data so we can get insights in how different parts of the system are influencing each other.

For instance, high CPU usage might result in 504 response codes in the access log and lead to less orders on your webshop.

By combining these data sources in a centralised data store, we can start to gather information on a higher level.

This information can be used to create insightful dashboards, that allow you to understand the system as a whole.

Additionally, you might set up alerts, which trigger when some data point exceeds or drops below your expected values. An example of this would be disk space usage exceeding 80%.

An example setup would include stats.D to provide application level data, such as the number of orders.

A datastore like graphite can be used to collect and query timeseries data.

Grafana can then be utilized to create an insightful dashboard.

To recap:

You can’t prevent mistakes from happening. You can prepare for mistakes happening.

And we explored 4 milestones:

Log exceptions to find out whats happening on the server,
Catch undefined behaviour early in order to prevent the majority of defects,
Turn exceptions into defined behaviour, so we can help each other debug faster
Monitor the rest, so we can predict and see the impact of failures when they do happen.

You can review the code examples and find many more detailed implementations and resources on my Twitter account right now.

If you have any feedback or questions about the topics I’ve talked about today, don’t hesitate to reach out. Stay safe and take care of each other. Thanks for watching. Lets keep in touch.

Event data

Event date	June 5th, 2020
Event page	https://2020.europe.wordcam...
Format	Lightning Talk