(25th August 2012 Signing on at 10:54)
In “The Morning Pages” yesterday, I started writing
something about errors in computer systems. In this instance I’m thinking about
data errors, faults that get saved to the files or the database. This is
something which (perhaps unfortunately) I know something about. At the time I
wrote it, I thought that it was interesting. I’m going to continue with it
here. I’m not referring to what I wrote yesterday, this is written from the
start.
At the beginning I suppose that it is useful to define
errors for this context and then divide them into groups so that I can write
about each group independently. After that (if I can be bothered, or get that
far) I can compare and contrast the different groups. The world can be divided
into two sorts of people: those who divide it into parts and those who do not!
What is an error? In this context it is a value in the
database (let’s assume we are using databases) which is incorrect and causes
some kind of problem with subsequent processing. Although for the purposes of
“stream of consciousness” I don’t usually use bullets, I am going to here.
There are no rules here, so I’m not breaking my non-rule, just moving away from
a convention which I have followed up till now. In “error” I am not including
those values where a value has been entered into the database which complies
with all the rules, but which does not actually match reality. This is an
interesting case, which I think I will deal with separately.
What are the possible sources of these errors?
- · Hardware faults: A sector on the disk is damaged, so what is written there is not what is subsequently retrieved.
- · Outside events: External events; power spikes or radiation produce changes to the data as it is recorded.
- · Software errors: A fault in a programme results in something other than what was intended being recorded.
- · Imports: Data has been imported which by-passed normal validation
Let’s look at the errors themselves. What is an initial
classification? Some ways of looking at it is to consider how widely you will
have to look in order to detect the error, at what level the error could be
detected and what you could do to correct the error once it has been detected.
First of all there are errors which occur at the level of
the value of a single field, in a single record. Of course all of these may
repeated and be caused by any of the root cause sources above. The simplest is
where the value in the storage is something which is incompatible with the
definition of the field. At its most extreme this should be detected by the
operating system or the database manager. One or other of these will throw some
kind of exception. Rectification is going to require work at a low level.
Next level up is when the value which is stored in the field
is compatible with the definition which is used by the database manager, but is
not valid according to the application which is using the data. An example of
this is where a database contains an “integer” field which is used to hold an
enumerated value (let’s say “1,2,3,4” are valid, but anything else is not). Of
course someone will say that the database should be designed with field level
validation to prevent this sort of thing. Yes, it probably should (although
that introduces its own problems) but many are not.
Supposing such data has got into the database, how will we
find out? If we are lucky, then the application will have practiced defensive
programming and when the data is encountered something will throw an error.
(Pause at 11:21 resumed 11:36) If we are not so luck, then the application
“will carry on regardless” and it is quite difficult to know what will happen.
Let’s stick with the case that we can understand: the application, or the
database or he operating system throws an error which we catch. What do we do
then?
The first thing to do, is get someone who understands the application
to identify which module (term used very loosely) is throwing the error and
identify what is causing the error to be thrown (if you see what I mean). The
objective of this is to find a “pattern” which can be used to identify the
offending data. At this stage, we are not really sure whether the problem is “data”
or “program”. We could have a piece of faulty data, or we could have a program
which is not handling a valid (but possibly unusual) piece of data.
Let’s assume for the time being that what we have is invalid
data. This is something that the application should really be rejecting. There
are now two things to be done: first, identify how widespread this problem is,
and two, identify how the data came to be corrupt in the first place. Both of
these objectives are addressed by finding all the records which suffer from this
particular problem. This is done by using the “pattern” which describes the
offending data to create a query to identify the affected records. Examining
these records will tell us a number of things: is the problem widespread? Is
there a pattern to the affected records? By pattern, I mean: do they all come
from the same place? Are they all in the same state? Are they updated about the
same time? The background information accumulated may help us to identify the
root cause.
The number of records affected is a vital piece of
information. How are we going to fix the values? First of all: Is it possible
to infer the correct value? This can be difficult. At this stage it is well
worth taking a moment to remember that with any “data correction” we are going
to be “changing the record of reality”. If some action is taken which modifies
data, then there should be a procedure for keeping a record of what changes were
made (This often takes the form of a log in the database).If it is possible to
identify “correct” values, then that is the preferred option. The next possible
option is to set the data value to some kind of innocuous default, especially
one which will prompt for an update the next time the record is opened. The
last options are to mark the field or maybe the whole record as faulty.
So much for faults with individual fields; what about faults
which need a wider scope to detect them? An example of this is where the values
of two or more fields are not compatible. This is usually associated with
denormalisation in the database.
Let’s take two cases: simple denormalisation (the same field
is repeated in more than one record) and derived attribute (where one attribute
summarises other attributes).
Take the first one first. Someone has decided to repeat the
same attribute in more than one place. Let’s not argue about whether this
should have been done. Someone has decided to do it! First confirm that the
values are always supposed to be the same. If they are, which is what I would
expect, then the pattern to use for detecting the error, is “where are they
different?”
Now the pattern for handing this is pretty much the same as
an error in a single field. Find out how many records are affected, examine the
affected records looking for any pattern and try and identify the root cause
(in my experience, a root cause which is sometimes overlooked is data which has
been migrated into the system).
Now the question is “what do we do about it?” Well,
obviously (?) correct the root cause. As for correcting the data, is there one
of the records which can be regarded as the “master”? For denormalization,
there should be. If there isn’t then this indicates a (possibly serious) fault
in the application design. If there is, then update the denormalized copy to
match the master. Remember that all these data changes which are done outside
the application should be logged.
Now let’s look at the case of which represents a derived
value. A typical example of this would be where we had a transaction which
consisted of individual line items. The transaction record contains an
attribute which contains the total value of individual transactions. It might
also contain the number of transactions. An error would occur when the total in
the “header” did not have the same numerical value as the totals in the
individual “line items”.
Before we go any further, it is worth pointing out that
there is a special case root cause which may be at work here. That is the
possibility of “rounding” or other “arithmetic” related errors.
The pattern to use for (breaking off 12:23, resuming 12:42
after lunch) for detecting this kind of error is that the value calculated from
the individual items is not equal to the value in the header record. Now there
is another implicit trap here: if we use the database manager to calculate
total of individual items, then we are probably not using the same code as the
application, so be careful if “rounding” or similar problems are involved.
Having identified the records involved, the next steps are
as before: examine the records, identify the root cause and develop a strategy
for correcting the affected records. Someone is almost certain to suggest that
all totals should equal the sum of their parts. I know I would consider this as
a correction strategy, but it may not be correct! Why not? Consider the
following two possibilities: that one transaction has been missed from the
total; so the total is correct but a transaction is missing, and a record has
been saved twice into the individual items table! Both these problems would be insidious,
and difficult to fix reliably. Possible strategies for dealing with them
involve careful examination of the records and the code doing the work in
parallel.
Now let’s cast the
scope wider: are there cases where we have data errors which are even wider
spread? Well, obviously we do, but (until I can think of a better way) I would
handle them as variants of the “derived attribute” class. The patterns used to
detect them become progressively more complicated and correction strategies
become harder.
Can “human error” be detected? Possibly it can be detected,
but not always reliably. What we would have to do is develop a pattern for what
constituted “human error”. This might be supported by other factors which are
not specifically related to the data itself like “particular operators”, “particular
locations” and “particular times”.
Now, this is where it gets interesting for me. I’ve come to
the end of what I wanted to say, but I want to write more. This is where the
free association comes in. The concept of data can be extended to “memory” and
system to the mind or data can be extended to dna and system can be extending
to living things. Does the speculation above have any validity? I think it
does, moreover, in a strange metaphysical kind of way what I am trying to do at
the moment is perform exercises which reshape thinking and habit. I want to get
into the habit of writing a certain amount at a sitting. I’m very good at
planning what I write, but I don’t always get round to doing the actual
writing. Here I am forcing myself to perform the act of writing (well typing if
you must) with the intent of developing the habit. Repetition is at the heart
of all habit. Habit implies repetition and repetition trains habit. What are the
factors which determine strength of learning? Primacy, Recency, Repetition and
I think there is something else as well. Of course we have to bring in
conditioning factors like reward and punishment and triggers as well.
Back to data errors (see what I mean about rambling?)
Programs are data too! I can extend the thinking to other forms of data which
are not exclusively record based, such as some files and, of course, programs.
Let’s look at files first. What about a word processing file,
is it possible to detect (and correct) errors in a word processing file? Well
of course it is! I do it all the time, and at least some of the time Word helps
me. A very small portion of the time, Word hinders me, but that is another
matter. Remember, right now, my objective is to train myself to write a certain
amount. I’m good at planning, I’m good at editing but I want to get better at
writing.
How do I detect errors in a word processing file? In
different ways, at different levels: I use Word to tell me when words are
spelled incorrectly. Where a word is underlined red, I can either use Word to
tell me what it thinks the correct spelling is, I can look it up in the
dictionary or I can completely ignore what Word is telling me. Word and I are
working together. Word does the checking, but I am in control. This is similar
to the “single field” class of errors which can be found in a database. Of
course, we know in advance what the root cause of these errors is: my typing is
a bit dodgy, as is my spelling.
A similar, in fact almost identical approach applies when we
extend the scope outwards to sentences and paragraphs. This is a more complex
situation which does not really have a corresponding level in the database
record model. Word contains a model of English grammar. I actually have no idea
how good Word’s model of grammar is. I am sure it is better than mine! Where
word flags a sentence as “needing its grammar looked at” I can look at what
Word suggests. Most of the time I need to make the corrections myself. These
can be simple things, like adjusting the ends of sentences, but can be more
complex. In either case Word is helping me to write. I am looking at the screen
and concentrating on what I am writing. Writing in this way I can write much
more quickly.
One of my objectives here is to reach a “Flow state”. That
implies that I try and reduce the amount of interruption. After all, almost my
definition, interruption interrupts flow!
What Word cannot do is help me with the sense of the
sentences or especially the paragraphs. What I write is sense or non-sense
because of what I think, not because of the way it is structured.
What about programs? Is it possible to detect errors in a
program? Well of course it is, I do it all the time. It is what de-bugging is
all about. But what about it the general sense, as data? Here I think we are
into more awkward territory. Some kinds of error are detectable. For example,
these days syntactic errors are detected by the parsing editors. If they are not detected during editing, then
they are detected immediately by the compiler. Data errors, references to the
wrong objects are likewise detected, but what about “the sense”? I am much more
doubtful about that. Sense can only be checked by comparing the results of the
program with the original specification.
If I extend that idea to writing for entertainment or even
technical writing, then I get the idea that there should be a “specification”.
Now I suppose ultimately, there is a debate about whether the specification can
be right or wrong. I guess it just “is”, in the sense that it exists. For
correctness to come into play, we have to have more than one expression of the
same idea. In coding this takes the form (or perhaps the forms) of: the
specification, the code itself, and the tests. This is the reason why many
people argue for using the tests as the specification. Apart from efficiency
(why have three artefacts when you only need two?), there is the point that
tests are generally expressed more precisely than specifications (at least in
coding).
Hmm. I have to be careful that I don’t talk myself round in
circles here. I observed something interesting just then. One was in instance
where Word could not tell the word I intended to use. Even from context, it did
not seem to be able to tell whether “two” or “tow” was the word that I intended
to use. Of course to a human reader that sort of typographical error is fairly
obvious. It is also a case where over-reliance of a word-processor spell
checker can lead you astray if you allow the machine to make corrections
itself.
I remember when I used to exploit the features of the
terminal hardware so that I could read spelling corrections (“highlighted”) out
of context of the rest of the script (which was in normal brightness). That
allowed me to work very quickly and accurately. Strange that using a monochrome
terminal was sometimes preferable to using a colour terminal.
When I reach the end of the 5th page of text I’m
just going to stop. Remember, this is primarily an exercise in training. I’m
not deliberately trying to write nonsense but the content is much less
important than the activity. There is a reason that I am keeping this stuff,
and that is that I want to be able to mine the “free association” aspect in the
future. I’m not sure what that will yield but it seems like a potentially
interesting experiment. Having the text online means that I may be able to use
it in some unexpected and unplanned (at least for the time being) way.
I am going to separate this “rambling” stuff from the “reflective”
stuff. I’m not sure I care who reads the ramblings (although I may change my
mind) but I feel that I care about the reflection. Today seems to have been
quite successful. You may find this strange but I do not expect to read what I
have written for quite some time.
And that seems like quite at appropriate place to end. I’ve
produced five pages of ramblings. I started with a topic and then went on from
there. For a while I am going to take that approach, but I may develop it in
different ways. I expect to experiment with all sorts of different things. One
of the things I will probably steer away from is formatting. Put simply “I don’t care”. This is about writing, not reading! (stopped
at 13:49)
No comments:
Post a Comment