I remember in 1991 an ad that appeared in the weekly news-magazine, popular at the time, Information Technology (IT) Weekly. The ad showed some poor schmuck laboring away at a terminal with reams of untidy green-bar printouts coiled around him, opened manuals everywhere. Head in hands, the man just stared at the screen. The caption read: Where do you go for help when you're the expert? Obviously I took that to heart. Many, many times over the course of my career I have been the expert in some system I developed, or designed, purchased, configured, or all of the above. Pride of accomplishment shared equal space and time with sheer raving terror over the eventuality of being called to task in the event of utter failure of a critical system. Hearing those dreaded and hated words: The system is down.
At one assignment, for three years, I was not allowed to take vacation, or attend off-site training, with out bringing a laptop with me so that I could dial-in to the system every morning to check the results of the previous night's run. I've had my share of 2:00am calls, and Sunday afternoon calls and calls pretty at the weirdest and most incredible events you can imagine.
The time I spent working tech support only honed my diagnostic skills - think about it: you're fixing a system you can't see, based mostly on the panic'ed observations of the remote "expert". I came to truly appreciate the depth of the term "mission critical system" because I learned that there were a lot of techs out there that had it a lot worse than I did. (Worst case scenario: a PPP link had gone down affecting currency trading at a large, well-known, banking company between their New York and Zurich office costing them approximate $300,000 per hour in down time.)
So, to the point, I've learned a few things about debugging and diagnostics over the years. I'm currently working on an assignment where I've coded a back-end data system that's 100% scalable, vertically and horizontally, uses open-source services such as mongo, mysql, rabbitmq and memcache, the framework is 100% PHP and written by.... me. We recently hired another programmer to join our team and work the code on each side of the message broker, beginning with the front-end. He's a good, intelligent, man -- straight out of college on his first assignment. (ahhh, memories.)
Seeing the system, my system, though his eyes has been, to say the least, educational. Watching an intelligent person who writes amazing code completely spin-up when stuff starts to break - also educational. I'm happy to report that after so many decades as an alumni of the computer sciences, colleges and universities still appear to be lacking a comprehensive program in software (or system) diagnosis and debugging.
Which, in turn, got me thinking. I'm certainly not what I would consider to be an "expert" in terms of software development especially, strictly speaking, nor am I in the practice of following any sort of formalized SDLC. I still believe test-driven-development is niche -- and in most real-world environments with ever-increasing time-to-market constraints, a luxury. Formalized spec is something you see in the movies or at behemoth companies that take years to develop a single, usually internal, product. Feature-creep is standard, tolerated and even encouraged. Managers ignore time estimates and develop deadlines based on the "Rule of divisible by 5 or 10" rather than anything based on logic.
Debugging issues across diverse systems, integrated open-source systems, or within your own code is, in my opinion, a learned art. Over the last few decades, I've compiled an informal list of diagnostic techniques which I would be honored to share with you today.
So, in order of "importance", or probability if you will, here's my list of Debugging "Rules" for programming:
Rule 1 -- The last change you implemented/installed broke it.
One thing I learned in the time I spent doing telephone technical support, pre-Web, was that if you listen long enough, the customer will tell you what the solution is to their problem. In most cases when a system or piece of software suddenly stops working, it's because you introduced some change into the environment, to the configuration, to the code, that adversely impacted the stability of the system.
If you back out the change, roll back the code, recover the old database, un-install the package, will the system/software work again? If the answer is "yes", then you know what caused the system breakage and you have a start-point for determining why the system broke.
Seriously -- you'd be surprised how often this rule is overlooked by seasoned programmers or admins. Software, today, is pretty stable. It usually doesn't mutate on it's own. If it suddenly stops running, then something happened to cause it to fail and you need to determine what the something was and why it caused the failure. Chances are you introduced the change. If not, then you need to question the team and find out who staged the last patch, who su'd over to root and accidentally nuked a file permission, etc.
Look at the last thing changed first.
Rule 2 -- Can I reproduce it?
Specific to software development, when diagnosing an issue, your very first task should be to reproduce the issue or problem. This will involve you actually getting involved with, possibly, the end-user to ascertain what was the exact sequence of events they did in order to generate the failure.
As a software engineer, it's really hard to fix code when you don't know what's broken. You can't know what (or if) is broken if you can't reproduce the problem.
If the same actions reproduce inconsistent results, e.g.: it works one time, but not the next, works the third, fourth and fifth time, but fails the sixth. Your problem is probably in data and you need to examine run-time configuration options, payloads, inputs and results. The code is not the issue. The code is processing the same way every time.
Of course, there is an exception to this generalization - but the exception has become more scarce over the years as computers have evolved practices of protecting memory. At one point, decades ago, I was giving a class on kernel crash analyses. The stable was system until I hit a certain mouse button sequence. I had re-coded the mouse driver to, on receiving that sequence, tromp through memory assigned to the printer driver. When a print command was issued - core dump!
If you can reproduce the problem, consistently, then you're most-likely dealing with a code issue given consistent inputs. Reproducing the problem, on demand, should generate diagnostic output of sufficient quality so that you've immediately narrowed the suspect method or routine.
Rule 3 -- Treat the Cause - Not the Symptom
Because systems are so interconnected, it's a common mistake to be led down the wrong path when diagnosing a problem. Usually called the "ripple effect", problems in software, or in a system, can manifest in a diverse location who's only redeeming quality is it's visibility as a problem. When a problem first surfaces, you need to logically address the question: am I dealing with a problem or with a symptom?
In one example, mySQL logins suddenly started failing on a system when no code, or code configuration changes, had been made. The error logs for the application showed that the 'user'@'localhost' was not authorized to use the system. Existing user permissions in the db showed correct configurations for 'user'@'%' - a wildcard domain setting. Nothing had changed, as mentioned in the software stack, but the sysadmin had just finished updating the /etc/hosts file correcting some entries as the server was temporarily assigned a static-ip. Changing the machine aliases from the localhost entry of 127.0.0.1 to the assigned static IP cause mySQL to hiccup on authentication as the software configuration was pointing to mySQL using a domain name instead of an IP or, better, the localhost domain. Changing the configuration to point the stack at localhost instead of the fqdn solved the login issue since port 3306 is blocked on this (cloud-based) server.
The take-away is that sometimes, when dealing with a problem that just leaves you mentally wandering about in circles, you have to stop your mental processes and just simply determine if the problem your dealing with is symptomatic or causal.
It's imperative that you view your system holistically instead of focusing (or obsessing) on one particular piece of the whole.
Rule 4 -- Have Faith
I used to teach beginning program to non-traditional students. Some of which were absolutely possessed of the opinion that computers were intelligent, malevolent, plotting, scheming, insidious, prank-pulling boxes of hell on a stick.
From the movie Short-Circuit:
It's a machine, Schroeder. It doesn't get pissed off, it doesn't get happy, it doesn't get sad, it doesn't laugh at your jokes...IT JUST RUNS PROGRAMS!
People tend to lose sight of this. Software that runs perfectly one time should run perfectly all the time given the same inputs and a stable environment. If I can complete a test run on two boxes and it fails on a third, I shouldn't waste time scanning lines of code for the cause as the problem isn't in the software - it's somewhere in the environment configuration, or possibly the data inputs, on the third machine.
Unfortunately, the less, ahem, technically astute, will always perceive the problem to be software based because that's pretty much all they know.
"The server's broken." "The software's broken."
Is their version of in-depth problem analysis. You can't be distracted, mislead, or swayed by these types of commentaries. Especially if you're the author/developer/sysadmin responsible for configuring/writing the system or software. You have to have faith that the code, the platform, the database will behave the same time, every time, given consistent inputs and a stable environment.
From outward appearances, sure, the software or system is "broken". But that doesn't mean you should just dive into the relative code chunk and start churning. Analyse the scope of the problems, be aware of your pre-conditions, the integrity of your data feeds/sources, and the quality of your output. If all things remain consistent, as it should, then you need to look outside the box, sorry, for the source of the issue.
Rule 5 -- Log, Log and More Log.
Logging in a development environment simply cannot be abused. By logging, of course, I am talking about the output of diagnostic or informative messages embedded into the code base that provide me with a first-pass of problem deterministics. Log often, log frequently. No one cares about transaction performance in a development environment. Generate 100 log messages for every query if you have to.
Log messages not only provide you with insights into trouble areas in your code, they also provide you with confidence that things are working/processing/calculating as they should be. Logs work to provide you with confirmation that this query, this calculation, or this method was successful.
I, personally, always hard-code two levels of logging into as routine into my code. I always code a trace output at the entry point to every method, and I code a debug conditional where critical decision points occur. I can then use my log output as a means of tracing program flow while keeping an eye on critical values. Because these are run-time methods, a single boolean in the configuration controls their execution so that the messaging is squelched for non-development environments.
Additionally, log everything. PHP, for example, has global constants: __FILE__, __LINE__, __METHOD__, and __FUNCTION__ that provide you with a template for robust log messages. While have some log message like:
log::record('line 1200 -- key: ' . $key . ', with value: ' . $value);
As you continue developing, your hard-coded line number is going to quickly become meaningless as anything other than a search key. Code a log message that uses all the information and remains relevant for as long as the code is active:
log::record('(' . __FILE__ . ')[' . __METHOD__ . ']@' . __LINE__ . ': ' . $msg);
Differentiate between log levels -- I strongly urge you to read this article on the 10-Commandments of Logging. You may not agree with everything in the article, but I can almost guarantee you'll walk away with at least one resolution to change how you log.
Bonus Rule - Debuggers Save Your Butt
Part of the bonus of growing old in this relatively young field is the ability to reflect on the wonderful tools available today to programmers as opposed to what was commonly used twenty (or more) years ago. My first debugging tools was printf() and (sh) more. Then, when printers became wide-spread, dumping a 50,000-line C-code listing to fan-fold paper, a red felt-tip pen, and lots of corridor space became my debugger.
Modern debuggers are your friend, your lifesaving grace, the tool that convinces your boss you're a genius by allowing you to step through your code and detect problems while still in the development stage. (Before test-driven development for you purists.) Being able to step-through code, attach watches to variables, jump back in the stack, just makes things so much easier, it's almost like you're cheating.
If you don't develop using a debugger, for whatever reason, please take the time necessary to install and configure a debugging package into your IDE. The time it saves you will be repaid so quickly and it only takes a couple of usages for you to start wondering how you ever coded without one.
Debugging is learned skill. All the points I've mentioned (advocated) above won't help you if you don't know your system, your application, or basic rules of computer science. You can't approach the problem any other way than head-on, full-bore, with the intent of absolutely crushing the problem out of existence.
Arthur C. Clarke said it: Any sufficiently advanced technology is indistinguishable from magic.
Watching a good developer or system administrator solve a complex program is just that - magic - to those of lesser skills. However, that feeling should serve you as a challenge to learn those skills necessary to join the wizardry ranks.
I've attempted to outline a few things that I've learned, some of which a ways of thinking and some are ways of doing, but when combined, hopefully they'll serve you as well as they serve me almost every day.