Demystifying Multivariate Analysis
In order to identify sources of process variability, you need to be able to integrate information ranging from raw material and intermediate measurements to processing and environmental data. Multivariate analysis techniques such as principle component analysis (PCA) and partial least squares (PLS) can be highly effective. Chris McCready, an engineering specialist with the software firm UMetrics, shows how MVA methods can be applied to improve your process IQ.
By Chris McCready, PAT Program Director, Umetrics
The FDA recognized that significant regulatory barriers have inhibited pharmaceutical manufacturers from adopting state-of-the-art manufacturing practices within the pharmaceutical industry. The Agency’s new risk-based approach for Current Good Manufacturing Practices (cGMPs) seeks to address the problem by modernizing the regulation of pharmaceutical manufacturing, to enhance product quality and allow manufacturers to implement continuous improvement, leading to lower production costs.
This initiative’s premise is that, if manufacturers demonstrate that they understand their processes, they will reduce the risk of producing bad product. They can then implement improvements within the boundaries of their knowledge without the need for regulatory review and will become a low priority for inspections.
The FDA defines process understanding as:
- identification of critical sources of variability
- management of this variability by the manufacturing process
- ability to predict quality attributes, accurately and reliably.
We must now ask ourselves what methods and tools are available for generation of this process understanding. The analysis method(s) must be capable of integrating the various types of information generated during a full production cycle including
- raw material measurements
- processing data from various unit operations
- intermediate quality measurements
- environmental data
These data can then be used to identify which parameters have a critical impact on product quality.
Multivariate (MV) methods such as principle component analysis (PCA) and partial least squares (PLS) are ideal for analysis of the various types of data generated from pharmaceutical manufacturing processes.
Practical concepts are provided to describe identification of critical sources of variability, calibration of robust and reliable predictive models, application of these models in real-time production and ultimately how the information generated is used for improvement of manufacturing processes.The Challenge
The greatest hurdle involved in almost any analysis is generation, integration and organization of data. This is particularly true for the pharmaceutical industry, where data are often stored in vast warehouses but rarely, if ever, retrieved.
Past regulatory environments did not provide incentives for analysis of manufacturing processes because implementing improvements required revalidation and the current condition of pharmaceutical data infrastructures reflects this.
As a result, a great deal of effort is required to assemble meaningful datasets. This challenge is further complicated by the fact that laboratory and production data are usually scattered in various disconnected databases.
Since product quality is influenced by all stages of production, including variability of the raw materials, one can only develop process understanding by uncovering the cumulative affect of all processing steps and their interactions. Integrating, synchronizing and aligning data from all relevant sources is therefore a prerequisite before analysis can begin.
Assuming datasets are available for analysis, there must be mathematical tools to interpret what the data is telling us. Multivariate techniques including PCA and PLS are efficient methods for visualizing the variability of data, modeling the correlation structure of operational data, calibrating predictive models of quality metrics and identifying critical sources of variability.
These MV methods are particularly useful for analysis of production data that contain large numbers of variables of different formats. For instance, production datasets will typically contain a set of raw material measurements, a number of processing steps, intermediate quality measurements and final product quality.
The raw, intermediate and final product quality measurements are sampled once per batch and the production processing steps produce a table of time series measurements sampled during production. The complete dataset for a number of batches is shown in Figure 1, where each row in an operating step represents the data generated for each batch.Figure 1 -- A typical pharmaceutical production dataset including raw, in-process and final product quality measurements.
Review of the types of data generated throughout the complete production cycle of a product yields a new set of challenges. There are a mix of real-time measurements, both univariate (temperature, pressure, pH) and multivariate (NIR or other spectroscopic method), sampled during processing, as well as static data sampled from raw materials, intermediates and finished product.
MV methods are an excellent choice for analysis for many reasons. The greatest strength of PCA and PLS methods is their ability to extract information from large, highly correlated sets of data with many variables and relatively few observations. Models generated for prediction of quality attributes also provide information on which of the potentially thousands of variables are most highly correlated with quality. This is an important property for identification of critical parameters.
Other strengths include performance on data with significant noise, missing values and the ability to model not only relationships between the X (raw materials and in-process data) and Y space (quality metrics), but also the internal correlational structure of X.
The ability to model the internal structure of the X space is of fundamental importance, because a prediction method is only valid within the range of its calibration. Modeling the X space provides a means for recognizing whether a new set of data is similar or different from the training set the model was built on.