Data Mining Procedures
Step on: Preprocessing Techniques
Among the whole data mining procedures, in fact, it is well-known that the preprocessing techniques are the most important and difficult part.
Handling and processing a different kind of metabolomic data
There have been many kinds of metabolomic data mentioned above. Therefore, there is need for the processing technique to carefully handle all these kinds of data in consideration to nature of each data..Sennsichip Bioinformatics Analysis Platform, a software platform including metabolomic data analysis in our company has developed, supports various data format.
Normalization of data
The noise and background can occur when using electrospray for ionization of samples from chromatography, and thus there should be noise reduction and baseline reduction techniques. To deal with these problems, many effective algorithm has been developed to adjust and can properly treat different data such as the lowness-based normalization technique as preprocessing methods for these issues.
Identification and quantification of metabolites
After removing noise and background described above, there should be peak alignment techniques for peak shift problems caused by variation of arrival time of compounds from multiple samples. Sennsichip Bioinformatics Analysis Platform constructed a novel peak alignment algorithm. As an alternative approach, the algorithm that performs the alignment by clustering retention time of each peak corresponding to each compound has been also proposed Second, there can be overlapped chromatographic peaks in chromatography results, and for these peaks the algorithm to identify each peak is needed.
Dimension Reduction Techniques
Once we obtain metabolic profile data after proper preprocessing steps, in order to see the data directly, reduction of the dimension of the data into 2 or 3 dimensions is needed. For this purpose, there are a representative methods, PCA (principal component analysis), which are an unsupervised and supervised method respectively. Sennsichip Bioinformatics Analysis Platform utilizes a PCA as dimension reduction and visualization method of data.

Figure 3 PCA scores plot discriminating specimens from normal specimens based on marker metabolites.
Feature Analysis and Selection Techniques
The main characteristic of metabolomic data is that there are large amounts of features. Therefore, there is need for techniques of analysis about features and selection among them. Moreover, to avoid over-fitting to given data and keep general properties of classifiers that we have generated, also it is essential to use feature selection techniques. In addition, because by the feature selection techniques we are able to find a group of the most associated metabolites to the particular researches (e.g. diseases), the findings can be used as bio-markers and can be practically applied. Sennsichip Bioinformatics Analysis Platform also has tried to develop a new method on it based on a genetic algorithm in careful consideration to nature of metabolomic data.

Figure 4 compounds from multiple samples
Classification Techniques
From given metabolic data, we can generate diagnosis models by classification techniques, and then using the generated models, we can diagnose patients by applying the data from them to the models. There are a variety of classification algorithms, and in our consideration, receiver–operator characteristic (ROC) curves can be suitable choice

Figure 5 the classification accuracies for sample data by ROC curve classifier
bio-equip.cn