Saturday, 25 August 2012

Errors – I made a mistake


(25th August 2012 Signing on at 10:54)
In “The Morning Pages” yesterday, I started writing something about errors in computer systems. In this instance I’m thinking about data errors, faults that get saved to the files or the database. This is something which (perhaps unfortunately) I know something about. At the time I wrote it, I thought that it was interesting. I’m going to continue with it here. I’m not referring to what I wrote yesterday, this is written from the start.

At the beginning I suppose that it is useful to define errors for this context and then divide them into groups so that I can write about each group independently. After that (if I can be bothered, or get that far) I can compare and contrast the different groups. The world can be divided into two sorts of people: those who divide it into parts and those who do not!

What is an error? In this context it is a value in the database (let’s assume we are using databases) which is incorrect and causes some kind of problem with subsequent processing. Although for the purposes of “stream of consciousness” I don’t usually use bullets, I am going to here. There are no rules here, so I’m not breaking my non-rule, just moving away from a convention which I have followed up till now. In “error” I am not including those values where a value has been entered into the database which complies with all the rules, but which does not actually match reality. This is an interesting case, which I think I will deal with separately.

What are the possible sources of these errors?
  • ·         Hardware faults: A sector on the disk is damaged, so what is written there is not what is subsequently retrieved.
  • ·         Outside events: External events; power spikes or radiation produce changes to the data as it is recorded.
  • ·         Software errors: A fault in a programme results in something other than what was intended being recorded.
  • ·         Imports: Data has been imported which by-passed normal validation

Let’s look at the errors themselves. What is an initial classification? Some ways of looking at it is to consider how widely you will have to look in order to detect the error, at what level the error could be detected and what you could do to correct the error once it has been detected.

First of all there are errors which occur at the level of the value of a single field, in a single record. Of course all of these may repeated and be caused by any of the root cause sources above. The simplest is where the value in the storage is something which is incompatible with the definition of the field. At its most extreme this should be detected by the operating system or the database manager. One or other of these will throw some kind of exception. Rectification is going to require work at a low level.

Next level up is when the value which is stored in the field is compatible with the definition which is used by the database manager, but is not valid according to the application which is using the data. An example of this is where a database contains an “integer” field which is used to hold an enumerated value (let’s say “1,2,3,4” are valid, but anything else is not). Of course someone will say that the database should be designed with field level validation to prevent this sort of thing. Yes, it probably should (although that introduces its own problems) but many are not.

Supposing such data has got into the database, how will we find out? If we are lucky, then the application will have practiced defensive programming and when the data is encountered something will throw an error. (Pause at 11:21 resumed 11:36) If we are not so luck, then the application “will carry on regardless” and it is quite difficult to know what will happen. Let’s stick with the case that we can understand: the application, or the database or he operating system throws an error which we catch. What do we do then?    

The first thing to do, is get someone who understands the application to identify which module (term used very loosely) is throwing the error and identify what is causing the error to be thrown (if you see what I mean). The objective of this is to find a “pattern” which can be used to identify the offending data. At this stage, we are not really sure whether the problem is “data” or “program”. We could have a piece of faulty data, or we could have a program which is not handling a valid (but possibly unusual) piece of data.

Let’s assume for the time being that what we have is invalid data. This is something that the application should really be rejecting. There are now two things to be done: first, identify how widespread this problem is, and two, identify how the data came to be corrupt in the first place. Both of these objectives are addressed by finding all the records which suffer from this particular problem. This is done by using the “pattern” which describes the offending data to create a query to identify the affected records. Examining these records will tell us a number of things: is the problem widespread? Is there a pattern to the affected records? By pattern, I mean: do they all come from the same place? Are they all in the same state? Are they updated about the same time? The background information accumulated may help us to identify the root cause.

The number of records affected is a vital piece of information. How are we going to fix the values? First of all: Is it possible to infer the correct value? This can be difficult. At this stage it is well worth taking a moment to remember that with any “data correction” we are going to be “changing the record of reality”. If some action is taken which modifies data, then there should be a procedure for keeping a record of what changes were made (This often takes the form of a log in the database).If it is possible to identify “correct” values, then that is the preferred option. The next possible option is to set the data value to some kind of innocuous default, especially one which will prompt for an update the next time the record is opened. The last options are to mark the field or maybe the whole record as faulty.

So much for faults with individual fields; what about faults which need a wider scope to detect them? An example of this is where the values of two or more fields are not compatible. This is usually associated with denormalisation in the database.

Let’s take two cases: simple denormalisation (the same field is repeated in more than one record) and derived attribute (where one attribute summarises other attributes).
Take the first one first. Someone has decided to repeat the same attribute in more than one place. Let’s not argue about whether this should have been done. Someone has decided to do it! First confirm that the values are always supposed to be the same. If they are, which is what I would expect, then the pattern to use for detecting the error, is “where are they different?”

Now the pattern for handing this is pretty much the same as an error in a single field. Find out how many records are affected, examine the affected records looking for any pattern and try and identify the root cause (in my experience, a root cause which is sometimes overlooked is data which has been migrated into the system).

Now the question is “what do we do about it?” Well, obviously (?) correct the root cause. As for correcting the data, is there one of the records which can be regarded as the “master”? For denormalization, there should be. If there isn’t then this indicates a (possibly serious) fault in the application design. If there is, then update the denormalized copy to match the master. Remember that all these data changes which are done outside the application should be logged.

Now let’s look at the case of which represents a derived value. A typical example of this would be where we had a transaction which consisted of individual line items. The transaction record contains an attribute which contains the total value of individual transactions. It might also contain the number of transactions. An error would occur when the total in the “header” did not have the same numerical value as the totals in the individual “line items”.

Before we go any further, it is worth pointing out that there is a special case root cause which may be at work here. That is the possibility of “rounding” or other “arithmetic” related errors.
The pattern to use for (breaking off 12:23, resuming 12:42 after lunch) for detecting this kind of error is that the value calculated from the individual items is not equal to the value in the header record. Now there is another implicit trap here: if we use the database manager to calculate total of individual items, then we are probably not using the same code as the application, so be careful if “rounding” or similar problems are involved.

Having identified the records involved, the next steps are as before: examine the records, identify the root cause and develop a strategy for correcting the affected records. Someone is almost certain to suggest that all totals should equal the sum of their parts. I know I would consider this as a correction strategy, but it may not be correct! Why not? Consider the following two possibilities: that one transaction has been missed from the total; so the total is correct but a transaction is missing, and a record has been saved twice into the individual items table! Both these problems would be insidious, and difficult to fix reliably. Possible strategies for dealing with them involve careful examination of the records and the code doing the work in parallel.
 Now let’s cast the scope wider: are there cases where we have data errors which are even wider spread? Well, obviously we do, but (until I can think of a better way) I would handle them as variants of the “derived attribute” class. The patterns used to detect them become progressively more complicated and correction strategies become harder.

Can “human error” be detected? Possibly it can be detected, but not always reliably. What we would have to do is develop a pattern for what constituted “human error”. This might be supported by other factors which are not specifically related to the data itself like “particular operators”, “particular locations” and “particular times”.  

Now, this is where it gets interesting for me. I’ve come to the end of what I wanted to say, but I want to write more. This is where the free association comes in. The concept of data can be extended to “memory” and system to the mind or data can be extended to dna and system can be extending to living things. Does the speculation above have any validity? I think it does, moreover, in a strange metaphysical kind of way what I am trying to do at the moment is perform exercises which reshape thinking and habit. I want to get into the habit of writing a certain amount at a sitting. I’m very good at planning what I write, but I don’t always get round to doing the actual writing. Here I am forcing myself to perform the act of writing (well typing if you must) with the intent of developing the habit. Repetition is at the heart of all habit. Habit implies repetition and repetition trains habit. What are the factors which determine strength of learning? Primacy, Recency, Repetition and I think there is something else as well. Of course we have to bring in conditioning factors like reward and punishment and triggers as well.

Back to data errors (see what I mean about rambling?) Programs are data too! I can extend the thinking to other forms of data which are not exclusively record based, such as some files and, of course, programs.
Let’s look at files first. What about a word processing file, is it possible to detect (and correct) errors in a word processing file? Well of course it is! I do it all the time, and at least some of the time Word helps me. A very small portion of the time, Word hinders me, but that is another matter. Remember, right now, my objective is to train myself to write a certain amount. I’m good at planning, I’m good at editing but I want to get better at writing.

How do I detect errors in a word processing file? In different ways, at different levels: I use Word to tell me when words are spelled incorrectly. Where a word is underlined red, I can either use Word to tell me what it thinks the correct spelling is, I can look it up in the dictionary or I can completely ignore what Word is telling me. Word and I are working together. Word does the checking, but I am in control. This is similar to the “single field” class of errors which can be found in a database. Of course, we know in advance what the root cause of these errors is: my typing is a bit dodgy, as is my spelling.

A similar, in fact almost identical approach applies when we extend the scope outwards to sentences and paragraphs. This is a more complex situation which does not really have a corresponding level in the database record model. Word contains a model of English grammar. I actually have no idea how good Word’s model of grammar is. I am sure it is better than mine! Where word flags a sentence as “needing its grammar looked at” I can look at what Word suggests. Most of the time I need to make the corrections myself. These can be simple things, like adjusting the ends of sentences, but can be more complex. In either case Word is helping me to write. I am looking at the screen and concentrating on what I am writing. Writing in this way I can write much more quickly.

One of my objectives here is to reach a “Flow state”. That implies that I try and reduce the amount of interruption. After all, almost my definition, interruption interrupts flow!

What Word cannot do is help me with the sense of the sentences or especially the paragraphs. What I write is sense or non-sense because of what I think, not because of the way it is structured.

What about programs? Is it possible to detect errors in a program? Well of course it is, I do it all the time. It is what de-bugging is all about. But what about it the general sense, as data? Here I think we are into more awkward territory. Some kinds of error are detectable. For example, these days syntactic errors are detected by the parsing editors.  If they are not detected during editing, then they are detected immediately by the compiler. Data errors, references to the wrong objects are likewise detected, but what about “the sense”? I am much more doubtful about that. Sense can only be checked by comparing the results of the program with the original specification.

If I extend that idea to writing for entertainment or even technical writing, then I get the idea that there should be a “specification”. Now I suppose ultimately, there is a debate about whether the specification can be right or wrong. I guess it just “is”, in the sense that it exists. For correctness to come into play, we have to have more than one expression of the same idea. In coding this takes the form (or perhaps the forms) of: the specification, the code itself, and the tests. This is the reason why many people argue for using the tests as the specification. Apart from efficiency (why have three artefacts when you only need two?), there is the point that tests are generally expressed more precisely than specifications (at least in coding).

Hmm. I have to be careful that I don’t talk myself round in circles here. I observed something interesting just then. One was in instance where Word could not tell the word I intended to use. Even from context, it did not seem to be able to tell whether “two” or “tow” was the word that I intended to use. Of course to a human reader that sort of typographical error is fairly obvious. It is also a case where over-reliance of a word-processor spell checker can lead you astray if you allow the machine to make corrections itself.
I remember when I used to exploit the features of the terminal hardware so that I could read spelling corrections (“highlighted”) out of context of the rest of the script (which was in normal brightness). That allowed me to work very quickly and accurately. Strange that using a monochrome terminal was sometimes preferable to using a colour terminal.

When I reach the end of the 5th page of text I’m just going to stop. Remember, this is primarily an exercise in training. I’m not deliberately trying to write nonsense but the content is much less important than the activity. There is a reason that I am keeping this stuff, and that is that I want to be able to mine the “free association” aspect in the future. I’m not sure what that will yield but it seems like a potentially interesting experiment. Having the text online means that I may be able to use it in some unexpected and unplanned (at least for the time being) way.

I am going to separate this “rambling” stuff from the “reflective” stuff. I’m not sure I care who reads the ramblings (although I may change my mind) but I feel that I care about the reflection. Today seems to have been quite successful. You may find this strange but I do not expect to read what I have written for quite some time.  

And that seems like quite at appropriate place to end. I’ve produced five pages of ramblings. I started with a topic and then went on from there. For a while I am going to take that approach, but I may develop it in different ways. I expect to experiment with all sorts of different things. One of the things I will probably steer away from is formatting.  Put simply “I don’t care”.  This is about writing, not reading! (stopped at 13:49) 

No comments:

Post a Comment