Improving Usability for Data Entry

Data analysis results are worthless unless the data is correctly collected and entered without error.

This page describes a number of best practices to improve the accuracy of data collection and entry into a computerized system (spreadsheet, web data entry, database or other computer system).These tips are in three groups below: Verification of Data Entered, More Complex Strategies, and Avoiding Manual Data Entry. Most add logic in the data entry system to detect probable errors. This can be as simple as additional formulas in a spreadsheet or more formal programming for a database or web entry.

To have a positive influence on data quality, the data entry verification and validation has to be designed to support the person, not fight against them. In short, do not force the person to guess about missing data or semi-legible entries. ...

Guiding Principles
  • It is important to implement any data validation in a manner that is useful for the person entering the data. If that person is directly entering data into a computer while making the actual observation, he/she is likely to be able to respond to a warning about data being out of range and correct the typographical error or re-observe the phenomena in question if the data itself is unreasonable.
  • On the other hand, someone entering data from data sheets prepared by someone else is less likely to be able to "fix" the problem during data during entry. The important thing in designing the data entry validation logic is to allow the person to save the data, not force the person to fix it immediately.
  • If the data sheet is meaningless with the problematic data, then it makes sense to put that sheet aside for later correction and entry. If the data is not critical, the suspicious value should be able to be saved in the system. The criteria used for data entry verification should be included in the QA/QC processing of the data after entry, so the suspect data would be identified at that time, presumably allowing resolution by people familiar with the observation and overall research/data collection protocol.
see more

Best Expert Tip

It is easier, faster, and less error prone if the layout of the data sheet corresponds to the layout of the data entry screen.


Verification of data entered

It is fastest to detect and correct errors as close to the source as possible. The following strategies can help the person entering the data to notice errors. These can be both data entry errors (typographical errors and other mistakes) as well as apparent problems with the source data (field notes, data forms, etc.).

Required Data

Check that data that is truly required is not missing. To achieve a higher level of data integrity, values entered for required field can be compared to “unknown” or similar terms entered in an attempt to circumvent the prohibition on missing or blank values. ... Only data that you are sure will always be known should be "required" in the sense of not allowing the rest of the data to be saved without the required data being entered. Typically data that meets this high standard for being classified as required is information such as the unique identifier for the observation. This can include the thing being observed (bird band identifier, water quality sampling location ID, transect ID) and whatever distinguishes this particular observation of that thing from all others. This distinguishing characteristic is typically includes date and time, or a unique, probably sequential, number assigned either when the observation was made or when it is entered into the system. see more

Important Data

Less critical but important information should just provide a reminder to enter the data instead of preventing the data from being saved.

Expected Format

Any data collection form and the corresponding data entry screen should clearly indicate the format that is expected. A simple example is that month, day, and year are in the expected sequence. Is time expressed in 24-hour format or are "AM" and "PM" used? A more elaborate example is geographic locations. ... At first glance, space to record latitude and longitude would seem sufficient. However, these are often represented as degrees-minutes-seconds, degrees-decimal_minutes, and decimal degrees. There are simple mathematical formulas to precisely convert between these, but which format the data is in must be known in order for later processing to be performed. Handling of east/west and north/south indications is also needed (for example, is west longitude indicated with a W before the numbers, a W after the numbers, or a negative value). For data that is going to be used for small scale GIS work, knowing the latitude/longitude is not sufficient without also knowing the datum that is the basis for the lat/lon. (A simple explanation of datum, is these are variations on approximating an a non-spherical Earth to spherical coordinates.) Conversion between datums is possible, but either the data must be entered only using one datum (generally preferred) or the datum used must also be included with each lat/lon. Just in case this is not enough to consider in storing location data, there are other systems for measuring location other than latitude and longitude. In the United States, "state plane" and Universal Transverse Mercator (UTM) are commonly used. Both are essentially distances north and east from a pre-defined origin. UTM distances are measured in meters but various distance units are used for state plane. Again, the geographic system must be consistent or the particular one recorded along with the location data. see more

List of Values

In many cases, data values will be from a pre-defined list of values. Examples include codes for location or site names (or identifiers), name (or ID) of person(s) making the observation, genus and species of living things observed, or US state abbreviations. It is better to avoid making the user have to type the code, risking typographical errors. ...

Allow them to choose from a list of allowable values. If the list is very long, forcing the person to scroll through the list is not efficient. The list could be subdivided (for example, choose animal or plant before getting the species list), or the list can automatically scroll as the person types the starting letters.

In a database system, it is generally best to keep the list of allowable entries in a table for each type of entry (e.g. a table for weather codes, another table for observers, etc.). Those tables (often called "look up tables" since they are frequently read and rarely changed) are easier to update (e.g. adding a new observer) than if the values are just written in the validation logic (so called, "hard coded").

Also modern databases (the "relational database model") can be used to ensure that values to be saved are already in the table of allowed values. In database terminology this concept could be stated as "enforcing referential integrity" for key values (unique identifies) in a table with the corresponding look up table.

see more

Range Checking - Strict

Range checks verify that values are between minimum and maximum valid values. Strict limits can be enforced for defined values, such as 60 minutes per hour. ... Seconds must be between 0 and 60, hours between 1 and 12 (or 0 and 23). Hours and minutes must be integers but seconds could have decimal places. Geographic degrees, minutes and seconds have similar range limits. Some range checks can be more complicated, such as validating dates which may require handling leap year in determining the last valid day for each month. Since the example ranges are defined values, there can never be a correct value that is not within the limits. Be careful to avoid setting strict range limits that even very rarely are exceeded. A perhaps extreme example and not generally applicable to environmental data is that a "leap second" can result in a minute with more than 60 seconds. see more

Range Checking - Moderate

This is range checking for values that are not rigidly defined, but are defined by reasonable ranges. For example, you can expect the temperature of liquid water to be between 0 and 100 degrees Celsius since it would normally not be liquid outside that range. However, if impurities and air pressure are considered, liquid water can exist outside of the 0 to 100 range.

Range Checking - Relaxed

This is the most common category. You can set reasonable ranges and then provide a notification to the user during data entry when a value is outside that range. ... For example, a unexpectedly high or low barometric pressure could be the result of transposing digits. You could warn when lat/lon values are outside the geographic range of the data collection area even if the values meet the degree, minute, and seconds each meet range checks limits. see more

Range Checking - Statistical

The range limits can be computed as some number of standard deviations from the mean value of prior observations. ... The standard deviation could be computed periodically or each time data is entered. For example, the user could be notified if the weight of something was more than 3 standard deviations from the mean. HOWEVER, there are still some correct observations that fall outside of 3 standard deviations of the mean. It is important to allow those values to be saved, so 3 standard deviations should not be used to prevent saving the observation record. To be more sophisticated, the statistics could be computed for different values of another observation.
A good example is outdoor temperature where using the annual mean and standard deviation to detect data entry errors is not as good as using 12 monthly means and standard deviations. Hot summer days and cold winter nights are more likely to be beyond 3 standard deviations based on the annual data, than 3 standard deviations for that month of the year.
see more

Consistency Checking

Consistency checking compares the values entered for more than one data field. The benefit is these help verify both data items in a way that provides more benefits than checking either value alone. ... Examples include testing that the ending time is after the starting time, northern geographic boundary is truly north of the southern boundary (not vice versa). Other physical relationships can be used for consistency checking. Air temperature, dew point and relative humidity can be calculated from the other other two values. Dissolved oxygen measurements in percent saturation and mg/liter are also related but require temperatures and barometric pressure to compute accurately. see more


More Complex Strategies

Check Digits

"Check digits" can be used to reduce errors when entering numeric unique identifies. Check digits are used in UPC (Universal Product Codes) on consumer products, credit card numbers, and the ISBN number for books. ... The theory behind check digits is that the value of the check digit is computed from the remaining digits. When the number along with the check digit is entered, the validity of the value can be tested by comparing the newly computed check digit with the value entered. If it matches, the number is valid, if not, there was a data entry error. Various implementation strategies are possible including printing a list of numbers with check digits embedded to be used to assign the next number and crossing the number off the list as it is used. Another strategy is to pre-print labels that have the numbers with check digit. see more

Double entry

A more labor intensive data entry technique is to have two people independently enter the same data and have the computer compare the data. This is based on the assumption that both people are not likely to make the same typographical error. It also provides a second opinion for interpretation of cryptic handwriting.


Avoiding manual data entry into the computer

These strategies reduce the need for a person to enter (or re-enter data).

Direct Download

Directly download data from the collection instruments. For example, locations stored in a GPS, values in a water quality meter, data-logger content. This requires planning to be able to match the electronic data with a manual observation. For example, the name (or ID number) for the GPS location ("waypoint" or "point of interest (POI)" needs to be recorded on the data sheet along with the rest of the observations.

Scanners

Reduce manual typing by using scanners reading bar-codes, RFID (radio frequency identification) tags, PIT (Passive Integrated Transponder) tags or other similar approaches to label samples, data sheets, etc. Bar codes can be pre-printed on adhesive labels and used as needed. Be sure the adhesive and ink are resistant to field/lab conditions. Alcohol-based preservatives are notorious for dissolving pen ink and adhesives.