Chemometrics* is an umbrella term for a set of mathematical techniques applied to information that is of chemical interest. Similar techniques are applied to other fields of study, and the names reflect the interest (e.g., econometrics, sociometrics, psychometrics, etc.).
There’s a fairly wide range of application of the mathematical methods, although some of them have restricted utility. For example, Paul Gemperline, a professor at East Carolina University, uses mathematics to extract reaction rate constants from data measured from complicated mixtures of multiple successive reactions. However, the vast majority of interest and applications comes in the area of analytical chemistry, where the mathematical techniques *Editor’s Note - This is the first in a series of articles, Demystifying Chemometrics, that will address this subject.
are applied to data from analytical instruments, notably spectrometers.
Spectroscopy is a premier method of performing both qualitative and quantitative analysis, but runs into difficulties when the sample is a complex mixture of ingredients. Ingredients typically have different spectra and when sampled individually, each is simple to recognize. When an unknown number of them are mixed together in unknown amounts, it can be difficult to determine what and how much of each ingredient is present.
Historically, this problem was solved by chemists doing laboratory separations. Using the chemical properties of the known or suspected materials to selectively cause reactions, chemists could weed out the other materials until only the desired analyte was left. Various forms of measurement of the quantity were used, but even here, sometimes the final measurement was made using spectroscopic methods.
More recently, various types of chromatography were used. The physical properties of the different ingredients were used to separate the analyte from the interfering materials, allowing identification and measurement. Again, various types of measurement detectors were used: thermal conductivity, flame detection, etc. Often, a spectrometer was incorporated into the chromatograph to measure the quantity of the analyte as it emerged from the chromatographic process.
Both wet chemical and chromatographic methods have limitations. They are tedious and time-consuming, require an expert chemist and necessitate a continual supply of solvents and possibly other chemicals for the analysis. Along with the supply of chemicals comes a concomitant requirement for the disposal of those chemicals – and the time is long past since it was legally and socially permissible to just dump them down the drain!
Modern Spectroscopic Analysis
Modern spectroscopic methods have been developed to eliminate the need to physically separate the analyte from the other constituents in samples. By doing so, they eliminate the need for obtaining and disposing of the chemicals, provide for a much speedier analytical method that has a lower cost per analysis and can be performed by less-highly trained personnel. In addition, the analysis is non-destructive to the sample, and the amounts or concentrations of multiple ingredients can be measured using a single spectrum from a single sample. Since a computer is available to perform the required manipulations of the data, it can also be used as part of a process-control system. Results can immediately and directly be sent to the process-control computer, or programmed to perform these functions itself.
In order to achieve these benefits, a good amount of “upfront” effort is needed. The chemometric algorithms must relate the properties of interest in the sample to the measured spectra.
Instead of separating the analyte from the matrix using a chemical or physical method and then measuring spectra of the relatively “clean” material, the spectrum of “dirty” material – that is, the analyte in its natural form and in its natural “environment” (the sample) – is measured, and then the spectral information needed to analyze the sample is extracted mathematically. These mathematical methods are “chemometrics.”
The chemometric methods are based on the use of various computer algorithms. They are used either to do quantitative or qualitative analysis. Using chemometric methods with spectroscopic measurements results in a fast analytical method that is simple enough for technicians to perform. Basically, the user “calibrates” or “trains” the instrument to perform the analysis. This training step should be done by an expert to ensure accuracy.
“Training the chemometrics” requires that the user obtain samples representative of all the future variations of those types of samples that the instrument will analyze during routine operation. Choosing the number and the appropriate samples is one of the key steps in creating a good calibration model, i.e., the mathematical model describing the samples that the computer will use in the future to do the actual analyses. Training the spectrometer requires measuring the spectra from all the samples and then saving them to a computer file.
If quantitative analysis is the goal (i.e., quantifying the amount of analyte in that type of material), then the actual values (sometimes called “true values” or “reference values”) of the analyte in each of those samples also must be measured. Typically, this measurement is made using the wet chemistry, or chromatography methods that were already being used for the analysis.
Ensuring the accuracy of the values obtained by applying the reference method to the samples is also critical to the accuracy of the chemometric method. The computer only knows what it’s been “told,” so if erroneous reference values are used, then the chemometric method will produce those erroneous values for its answer.
One simple way to check the reference values is to have each reference sample measured twice by the reference laboratory, preferably under conditions where the reference lab doesn’t know which of the samples it gets comprise the repeat readings. Then, the values of the repeat pairs can be compared; if they agree then they are most likely accurate.
Very often, the measured spectra are not used “as-is,” but are subjected to various types of data transformations before being introduced into any calibration algorithm. The purpose of these transformations is to reduce or eliminate extraneous variations. Typical transformations include smoothing (averaging sets of adjacent spectral data points), computing a derivative (first or second derivative of the spectrum with respect to wavelength are common), and any of several more specialized transformations of the data, such as applying the Kubelka-Munk function or performing Multiplicative Scatter Correction.
The computer only knows what it’s been “told,” so if erroneous reference values are used, then the chemometric method will produce those erroneous values for its answer.
Typically, a careful analyst will inspect the spectral data both before and after performing any transformation to ensure that the data form a coherent whole, and that no spectra or samples appear to be outliers.
Performing the Calibration
This is the point where we get to what might be considered the “hard core” of chemometrics: the algorithms that relate the measured spectral data to the samples’ properties. There are numerous algorithms available for both qualitative and quantitative analysis, commonly described by an alphabet soup of abbreviations and acronyms. Some of the most common ones are:
Multiple Linear Regression (MLR, also called “Inverse Beer’s Law” or “P-matrix” method) – This algorithm, initiated by Karl Norris in the early 1970s, is the one first applied to quantitative analysis using modern methods, and is the forerunner of all chemometric algorithms in use today. It is based on the “least squares” concept; data at a relatively small number of wavelengths selected from the spectrum are fitted to the reference values using mathematics that guarantee that the error (the differences between the values calculated by the chemometrics, and the reference values, for a set of samples) is the smallest possible value that can be produced from those data. However, in recent times it has fallen into disuse, partly because of difficulties in selecting the wavelengths to use for the calibration, and partly because of the perceived lack of “pizzazz” of this algorithm compared with other calibration methods.
Principal Component Analysis (PCA) – This is an algorithm, also for quantitative analysis, that was developed to circumvent the difficulties encountered in selecting wavelengths when using MLR. It uses the entire measured spectrum of the samples, and tries to break that down into fundamental underlying spectral structures that can be related to the constituent composition, instead of data at individual, specified wavelengths. An analogy would be the use of the spectra of the ingredients of a mixture to recreate the spectrum of the mixture, except that in PCA, the “spectra” that are extracted from the data are abstract mathematical structures related to the spectra, as opposed to actual spectra of the ingredients. These abstract mathematical structures are themselves the result of a least-squares calculation, but they are designed to recreate the data spectra in a least-square manner, rather than the constituent composition.
Partial Least Squares (PLS) – This algorithm is actually very similar to PCA, the main difference being that during the least-squares calculations the ingredient composition is used, so that the resulting abstract “spectra” comprise a better way to relate to the composition than do the PCR abstract spectra.
Mahalanobis Distance (MD) – This was the first general-purpose algorithm developed for performing qualitative analysis using spectroscopy. It provides a way to define a “distance” between spectra of various materials in such a way that materials that are similar will have similar spectra and will be closer together (in terms of that distance) than materials that are different. Thus, by training the algorithm to recognize several materials, an unknown sample can be identified as being one of the known materials by virtue of it being “close” to that material and “far away” from others. If a material is measured that the algorithm was not trained to recognize, then it will be “far away” from all of the known materials.
The basic Mahalanobis Distance algorithm can also be applied to the spectra created by a Principal Component Analysis. The combination of these two algorithms is called the SIMCA algorithm, the acronym for “Simple Identification by Modeling of Class Analogies.”
The previous discussion of chemometrics and its application to spectroscopic analysis is very general and could be applied (and indeed, is applied) to any industry. What distinguishes the pharmaceutical industry from most others is the fact that this industry is regulated by a government agency. This regulatory environment imposes strict requirements on the ability of the industry to show that everything it does is “suitable for the intended purpose,” and that includes the methods of chemical analysis that it uses to certify that its products remain safe and effective.
Therefore, no discussion of chemometrics could be complete without at least some mention regarding the validation of these methods. Chemometric analysis applied to NIR data was reported in the pharmaceutical industry as early as 1984 by John Rose of Squibb, and several other reports, notably by Rich Whitfield of Upjohn and Emil Ciurczak, when he worked for Sandoz, followed in short order. These studies were largely for qualitative analysis; a common application was for identifying raw ingredients on the loading dock.
Quantitative applications lagged until the FDA’s recent PAT initiative gave impetus to the development of these methods. Gary Ritchie, Emil Ciurczak and colleagues published a two-part article for a recommended protocol for validation of methods for quantitative analysis in J. Pharm Biomed. Anal. in 2001 and 2003, when they worked for Purdue Pharma. It is expected that several companies will be validating and submitting these methods for FDA approval as part of their NDA applications in the near future.
Chemometric algorithms, in combination with modern computer technologies and rapid spectroscopic analysis, provide the basis for the modern-day development of methods of chemical analysis that are fast, inexpensive, simple to use, amenable to inclusion in and for control of automated processes, and environmentally friendly. What more could anyone ask?
About the Author
Howard Mark, Ph.D., heads up Mark Electronics, Suffern, N.Y. Founder of The NIR Research Corporation, he won the 2003 Eastern Analytical Symposium’s award for Achievement in NIR Spectroscopy. Mark created the first universal calibration, performed the first application of NIR to many industries, designed the first extended-range filter instruments and internal-reference double-beam monochromator instruments, and created the first qualitative analysis software for NIR.