^{1}

^{*}

^{1}

Machine learning consists in the creation and development of algorithms that allow a machine to learn itself, gradually improving its behavior over time. This learning is more effective, the more representative is the features of the dataset used to describe the problem. An important objective is therefore the correct selection (and, possibly, reduction of the number) of the most relevant features, which is typically carried out through dimensional reduction tools such as Principal Component Analysis (PCA), which is not linear in the more general case. In this work, an approach to the calculation of the reduced space of the PCA is proposed through the definition and implementation of appropriate models of artificial neural network, which allows to obtain an accurate and at the same time flexible reduction of the dimensionality of the problem.

The term machine learning [

Today machine learning technologies [

When it comes to machine learning, you don’t necessarily have to think about robotics, driving independently or the games DeepMind won [

The effective use of machine learning techniques depends strongly on the correct modelling of the problem by the researcher, who must be able to capture the fundamental characteristics that allow an effective implementation of the predictive model. If the selected features are excessive with respect to the available cases, the “power’’ [

Therefore, it is of fundamental importance to reduce the number of features, obviously without losing the model’s informative capacity too much. This reduction is usually made through the use of mathematical techniques to reduce the size of the problem such as the Principal Component Analysis (PCA) [

In this paper an approach to feature reduction by non-linear PCA is presented, this being the most general case. In our approach, the determination of the reduced space of components is done by setting up appropriate artificial neural network models, of hierarchical or symmetrical type, so as to arrive at the calculation of the main components through the progressive self-learning typical of a neural network.

In the following sections, the general principles of artificial neural networks and PCA are briefly discussed; then the method of calculating PCA through appropriate neural network models is presented, and the results are discussed, as well as the conclusions and future directions of research.

Learning by example plays a fundamental role in the process of understanding by humans (in newborns for example, learning is done by imitation, rehearsal and error): the learner learns on the basis of specific cases, not general theories. In essence, learning from examples is a process of reasoning that leads to the identification of general rules based on observations of specific cases (inductive inference).

There are two typical characteristics of the process of learning from examples: first, the knowledge learnt is more compact than the equivalent form with explicit examples, therefore requires less memory capacity; second, the knowledge learnt contains more information than the examples observed (being a generalization is applicable also to cases never observed). In inductive inference, however, starting from a set of true or false facts, we arrive at the formulation of a general rule that is not necessarily always correct: in fact, only one false assertion is sufficient to exclude a rule. An inductive system therefore offers the possibility of automatically generating knowledge that can be false. The frequency of errors depends strongly on how the set of examples on which the system is to be learned was chosen and how representative this is of the universe of possible cases.

Artificial neural networks (ANNs) [

Neural networks are non-linear structures that can be used to simulate complex relationships between inputs and outputs that other analytical functions cannot represent. The external signals are processed and processed by a set of input notes, in turn connected with multiple internal nodes (organized into levels): each node processes the signals received and transmits the result to the following nodes. Since neural networks are trained using data, connections between neurons are strengthened and output gradually forms patterns, which are well-defined patterns that can be used by the machine to make decisions.

Largely abandoned during the winter of artificial intelligence, neural networks are now at the centre of most projects focused on artificial intelligence and machine learning in particular [

The functioning of a neural network can be schematically outlined in two phases: the “training’’ (learning) phase and the “testing’’ (recognition) phase. In the learning phase the network is instructed on a sample of data taken from the set of data that will then be processed; in the testing phase, which is then the normal operating phase, the network is used to process the input data based on the configuration reached in the previous phase.

As for the realizations, even if the networks have an autonomous structure, generally computer simulations are used in order to allow even substantial modifications in a short time and with limited costs. However, the first neural chips [

Pattern recognition is currently the area of greatest use of neural networks. It consists in the classification of objects of the most varied nature in classes defined a priori or automatically created by the application based on the similarities between the objects in input (in this case we speak of clustering).

To perform classification tasks through a computer, real objects must be represented in numerical form and this is done by performing, in an appropriate way, a modeling of reality that associates each object with a pattern (vector of numerical attributes) that identifies it. This first phase is called feature extraction [

Suppose, more formally, that you need to classify a pattern p = ( p 1 , ⋯ , p n ) in a class belonging to the set C = { c 1 , ⋯ , c k } . Against the p input pattern, the classifier will output the binary vector z = ( z 1 , ⋯ , z k ) where z i = 1 if the pattern belongs to the class c i , otherwise 0.

Neural networks can be effectively used as classifiers thanks to their ability to learn from examples and generalize. The idea is to let the neural network learn (through special training algorithms) the correct classification of a representative sample of patterns, and then make the same network work on the set of all possible patterns. At this point we distinguish two different types of learning: supervised and unsupervised.

In “supervised learning’’, the set of patterns on which the network must learn (training set) is accompanied by a set of labels that show the correct classification of each pattern. In this way, the network makes a regulation of its structure and internal parameters (connection weights and thresholds) until it obtains a correct classification of training patterns. Given the above mentioned generalization capabilities, the network will work correctly even on external patterns and independent from the training set, provided that the training set itself is sufficiently representative.

In “unsupervised learning’’, a set of labels cannot be associated with the training set. This can happen for various reasons: the corresponding classes can be simply unknown and not obtainable manually or only inaccurately or slowly or, again, the a-priori knowledge could be ambiguous (the same pattern could be labeled differently by different experts). In this type of learning, the network tries to organize the patterns of the training set into subgroups called clusters [

These two different approaches to learning give rise to the different types of neural networks [

Rarely are the characteristics obtained during the extraction phase used as input for a classification, but often some transformation is necessary to facilitate the pattern classification. One of the most frequent problems to solve is the decrease of the pattern dimensionality (of the number of characteristics) in order to make the machine learning algorithms functioning more efficient and faster.

Increasing the number of features measured on the objects to be classified generally improves network performance because, intuitively, there is more information available on which to base learning. In reality this is true only to a certain extent, after which, the performance of the network tends to decrease (more wrong classifications are obtained). This is because we are forced to work on a limited set of data and therefore, increasing the size of the pattern space involves a thinning out of our training set that will become a poor representation of the distribution. We will need larger sets (growth must be exponential) that will slow down the training process and bring infinitesimal improvements. This problem is known in the literature as curse of dimensionality. It is better to prefer a network with few inputs because it has fewer adaptive parameters to determine and therefore even small training sets are sufficient. This will create a faster network with greater capacity for generalization. The problem now is to choose, among the characteristics we have available, those to be preserved and those to be discarded, trying to lose as little information as possible. The PCA helps us in this.

Principal Component Analysis is a statistical technique whose aim is to reduce the size of patterns and is based on the selection of the most significant characteristics, that is those that bring more information. It is used in many fields and under different names: Karhunen-Loeve expansion, Hotelling transformation, approach to signal subspace, etc.

Given a statistical distribution of data in an L-dimensional space, this technique examines the properties of distribution and tries to determine the components that maximize variance or, alternatively, minimize the misrepresentation. These components are called “main components’’ and are linear combinations of random variables with the property of maximizing the variance in relation to the eigenvalues (and therefore the eigenvectors) of the covariance matrix of the distribution. For example, the first main component is a linear normalized combination that has maximum variance, the second main component has second maximum variance and so on. Geometrically, the PCA corresponds to a rotation of the coordinated axes in a new coordinate system such that the projection of the points on the first axis has maximum variance, the projection on the second axis has second maximum variance and so on (see

Mathematically, the PCA is defined as follows. Consider an M-dimensional vector p obtained from some distribution centered around the average E ( p ) = 0 and define X = E ( p p T ) the distribution’s covariance matrix. The i-th main component of p is defined as v i T p , where v represents the normalized eigenvector of X corresponding to the i-th largest eigenvalue λ . The subspace obtained by the eigenvectors v 1 , ⋯ , v L with L < M , is called the PCA subspace

of size L. We will then examine the mathematical and neural methods to compute the principal components of a distribution.

Consider N points in an M-dimensional space (which could be a space of features) by indicating each of them with p i = [ p i 1 , ⋯ , p i M ] T , i = 1 , ⋯ , N and suppose, without loss of generality, that E ( p i ) = 0 . We can represent the generic vector p i as a linear combination of a set of M vectors in this way:

p i = ∑ j = 1 M a i j v j (1)

where the a i j are coefficients such that E ( a i j ) = 0 when varying by i, while v j are orthonormal vectors ( v i T v j = δ i j ) such that v j = [ v j 1 , ⋯ , v j M ] T , j = 1 , ⋯ , M . If we define a i = [ a i 1 , ⋯ , a i M ] T , i = 1 , ⋯ , N then Equation (1) can be expressed in matrix form as: p i = V ⋅ a i where v is the column matrix of the v j vectors. In this case the explicit expression for the vectors a i is: a i = v T p i .

Suppose you want to reduce the size of the space from M to L with L < M in order to lose as little information as possible. The first step is to rewrite equation (1) this way:

p i = ∑ j = 1 L a i j v j + ∑ j = L + 1 M a i j v j (2)

to then replace all a i j (for j = L + 1 , ⋯ , M ) with constant k j so that each initial p i vector can be approximated by a new p ¯ i vector so defined:

p ¯ i = ∑ j = 1 L a i j v j + ∑ j = L + 1 M k j v j (3)

In this way we get a size reduction, since the second sum is constant and therefore each M-dimensional vector p i can be expressed in an approximate way using an L-dimensional vector a i . Let’s now see how to find the base vectors v j and the coefficients k j to minimize the loss of information. The error on p i obtained from the size reduction is given by:

p i − p ¯ i = ∑ j = L + 1 M ( a i j − k j ) ⋅ v j (4)

We can then define a E L function that calculates the sum of the squares of the errors as follows:

E L = 1 2 ∑ i = 1 N ‖ p i − p ¯ ‖ 2 = 1 2 ∑ i = 1 N ∑ j = L + 1 M ( a i j − k j ) 2 (5)

where we used the relation of “orthonormality’’. If we put the derivative before E L compared to k j equal to zero, we get that:

k j = 1 N ∑ i = 1 N a i j = 0 (6)

by virtue of the fact that we considered a i j such that E ( a i j ) = 0 . The error function can then be rewritten as:

E L = 1 2 ∑ j = L + 1 M ∑ i = 1 N a i j 2 = 1 2 ∑ j = L + 1 M v j T [ ∑ i = 1 N p i p i T ] v j = 1 2 ∑ j = L + 1 M v j T X v j (7)

where the first step follows from the fact that a i j = v j T p i and X is the covariance matrix of the distribution so defined:

X = ∑ i = 1 N ( p i − E ( p i ) ) ( p i − E ( p i ) ) T = ∑ i = 1 N ( p i p i T ) (8)

having set E ( p i ) = 0 . Now you just have to minimize the E L function compared to the choice of base vectors v j .

The best choice is when the base vectors meet the condition: X v j = λ j v j for constants λ j corresponding to the eigenvectors of the matrix X. It should also be noted that, since the covariance matrix is real and symmetrical, its eigenvectors can be chosen orthonormal as required. Returning to the analysis of the error function, we notice that:

E L = 1 2 ∑ i = L + 1 M λ i (9)

so the minimum error is obtained by discarding the smaller M − L eigenvalues and their corresponding autovectors, and keeping the larger L that normalized will build the V matrix. The procedure for calculating the principal components is shown in

The Principal Component Analysis can also be carried out through a neural network in which the weight vectors of the neurons converge, during the learning phase, to the main eigenvectors v j ( j = 1 , ⋯ , L ) . Such networks have a learning of Hebbian type: the value of a synaptic connection in input to a neuron is increased if and only if the input and the output of the neuron are simultaneously active. They are composed of a layer of M input neurons designed to perform the sole task of passing the inputs to the next layer, and a layer of L output neurons totally connected to the previous one. The weights of each output neuron form an M-dimensional weight vector representing an eigenvector. Feedback connections exist during learning: if the output of the generic neuron reaches all output neurons indiscriminately as input, we are in front of a symmetrical network; if instead there is an order of neurons according to which each neuron sends its output to itself and to neurons with higher indexes we are in front of a hierarchical network (see

This network carries out Hebbian learning (therefore unsupervised). The synaptic modification law, however, is not the standard Hebbian rule:

w j ( t + 1 ) = w j ( t ) + η ⋅ z ( t ) ⋅ p j ( t ) (10)

where p ( t ) , w ( t ) and z ( t ) are, respectively, the value of j-th input, j-th weight and the output of the network at time t (the network is supposed to be composed of a single neuron), while η is the learning rate. Direct application of this rule would still make the network unstable. Oja [

w j ( t + 1 ) = w j ( t ) + η ⋅ z ( t ) ⋅ [ p j ( t ) − z ( t ) w j ( t ) ] (11)

where η ⋅ z ( t ) ⋅ p j ( t ) is the usual Hebbian increase, while − z ( t ) w j ( t ) is the stabilizing term that makes the sum of the

∑ j = 1 L ( w j ( t ) ) 2 (12)

limited and close to 1 without explicit normalisation appearing. The Oja rule can be generalized for networks that have multiple output neurons by obtaining the two algorithms in

The PCA emerges as an excellent solution to several problems of information representation including:

• Maximization of variances subject to linear transformations or outputs of a linear network under orthonality constraints;

• Minimization of quadratic mean error when the input data is approximated using a linear subspace of smaller size;

• Non correlation of outputs after an orthonormal transformation;

• Minimization of entropy of representation;

• At the same time, the PCA network has some limitations that make it less attractive;

• The network is able to carry out only linear input-output correspondences;

• Eigenvectors can be calculated much more efficiently using mathematical techniques;

• The principal components take into consideration only the data covariances that completely characterize only Gaussian distributions;

• The network is not able to separate independent subsignals from their linear combinations.

For these reasons, it is interesting to study non-linear generalizations of PCA or learning algorithms derived from the generalization of the optimization problem of standard PCA. They can be divided into two classes: robust PCA algorithms (paragraphs 6.1 and 6.2) and non-linear PCA algorithms in the strict sense (Section 7). In the former, the criterion to be optimized is characterized by a function that grows more slowly than the quadratic function, and the initial conditions are the same as those of the standard PCA (the neuron weight vectors must be mutually orthonormal). In these algorithms, non-linearity appears only at certain points. In non-linear PCA algorithms, however, all neuron outputs are a nonlinear function of the response. It is also interesting to note that, while the standard PCA to obtain the main components needs some form of hierarchy to differentiate the output neurons (the symmetric algorithm obtains only linear combinations of the main components), in the non-linear generalizations the hierarchy is not so important, since the nonlinear function breaks the symmetry during the learning phase [

The standard quadratic problem leading to a PCA solution can also be achieved by maximizing output variances E [ z i T z i ] = E [ w i T p p T w i ] = w i T X w i of a linear network under orthonality constraints. This problem is not well defined until the M-dimensional weight vectors of the neurons are bound in some way. In the absence of a priori knowledge, orthogonality constraints are the most natural because they allow the measurement of variances along directions that differ in a maximum way from each other.

If we refer to Hierarchical Networks, the i-th weight vector w i is bound to have unitary norm and be orthogonal to vectors w j ( j = 1 , ⋯ , i − 1 ) . Mathematically, this can be expressed as follows: w i T w j = δ i j for j ≤ i . The optimum vector w i will then be the i-th principal eigenvector v i of the covariance matrix X and the outputs of the PCA network become the principal components of the data vectors. The same problem can be solved with symmetrical networks by adopting the following constraint: w i T w j = δ i j for j ? i . In matrix form you have W T W = I where W = [ w 1 , ⋯ , w L ] and I is the unit matrix. If we now consider the z output of the linear PCA network, the problem can be expressed in compact form as maximization of:

E ( ‖ z ‖ 2 ) = tr ( W T X W ) (13)

The best solution in this casen is any orthonormal basis of the PCA subspace. It is therefore not unique. The problem of maximizing variance under symmetrical orthonality constraints therefore leads to symmetrical networks, the so-called PCA subspace networks.

Let us now consider the generalization of the problem of maximizing variance for robust PCA. Instead of using the previously defined root mean square, we can maximize a more general average as follows:

E [ c ( p T w i ) ] (14)

The c ( t ) function must be a valid cost function that grows slower than the square, at least for large values of t. In particular we hypothesize that c ( t ) is equal, not negative, almost everywhere it continues, differentiable and that c ( t ) ≤ t 2 / 2 for big values of | t | . In addition its only minimum is reached for t = 0 and c ( t 1 ) ≤ c ( t 2 ) if | t 1 | ≤ | t 2 | . Valid cost functions are: ln ( cosh ( θ t ) ) , tanh 2 ( θ t ) where θ represents a scaling factor that depends on the range within which the input values vary. In that case the criterion to maximize, for each weight vector w i , is:

G ( w i ) = E [ c ( p T w i ) ] + ∑ j = 1 l ( i ) λ t j [ w i T w j − δ i j ] (15)

In the summation, the Lagrange λ coefficients impose the necessary orthonormality constraints.

Both the hierarchical and the symmetrical problem can be discussed under the general G criterion. In the symmetrical standard case the upper limit of the summation index is l ( i ) = L ; in the hierarchical case it is l ( i ) = i . The optimal weight vector of the i-th neuron then defines the robust component of the i-th principal eigenvector v i . The gradient of G ( w i ) relative to w i is:

d ( i ) = ∂ G ( w i ) ∂ w i = E [ p e ( x T w i ) ] + 2 λ i i w i + ∑ j = 1 , j ≠ i l ( i ) λ i j w j (16)

where e ( t ) is the derivative ∂ c ( t ) / ∂ t of c ( t ) . At optimum the gradient must be zero for i = 1 , ⋯ , L . In addition, Lagrange coefficients’ differentiation leads to orthonality constraints w i T w j = δ i j .

A gradient descent algorithm to maximize Equation (14) is obtained by entering the d ( i ) estimation of the gradient vector (Equation (16)) at the step of the weight update, which becomes:

w i ( t + 1 ) = w i ( t ) + η ⋅ d ( t ) ( i ) (17)

To obtain estimates of the standard instantaneous gradient, the average values are simply omitted and the instantaneous values of the quantities in question are used instead. The update then becomes:

w i ( t + 1 ) = w i ( t ) + η ⋅ [ I − ∑ j = 1 l ( i ) w j ( t ) w j ( t ) T ] ⋅ p ( t ) ⋅ f [ p ( t ) T w i ( t ) ] (18)

Reconsidering the cost function, the assumptions made about it imply that its derivative f ( t ) is a non-decreasing odd function of t. For stability reasons, it is required to be at least f ( t ) ≤ 0 for t < 0 or f ( t ) ≤ 0 for t > 0 . If we define the instantaneous representation error vector as:

e i ( t ) = p ( t ) − ∑ j = 1 l ( i ) ( p ( t ) T w j ( t ) ) ⋅ w j ( t ) = p ( t ) − ∑ j = 1 l ( i ) z j ( t ) w j ( t ) (19)

we can summarize the step of weight updating (Equation (18)) as in

W ( t + 1 ) = W ( t ) + η ⋅ ( I − W ( t ) W ( t ) T ) ⋅ p ⋅ f ( p T W ( t ) ) = W ( t ) + η ⋅ e ( t ) ⋅ f ( z ( t ) T ) (20)

It is interesting to note that the optimal solution for the robust criterion in general does not coincide with the standard solution but is very close to it. For example if we consider c ( t ) = | t | , the w i directions that maximize E [ | p T w i | ] are, for some arbitrary non-symmetrical distribution, different from the directions that maximize the E [ ( p T w i ) 2 ] variance under orthonormality conditions.

Let’s consider the linear approximation p ¯ of the vectors p in terms of a set of vectors w j for j = 1 , ⋯ , l ( i ) . Since the l ( i ) number of base vectors w j is usually smaller than the M size of the data vectors, there will be some error said instantaneous error of representation e i ( t ) = p i ( t ) − p ¯ i ( t ) for each vector p ( t ) . The standard PCA solutions are obtained by minimizing the square of this error, namely the quantity: E [ ‖ e i ‖ 2 ] = E [ ‖ p i − p ¯ i ‖ 2 ] . Now let’s see how to carry out the robust generalization of the quadratic mean representation error. Robust PCA algorithms can be achieved by minimizing the criterion:

G ( e i ) = 1 T E [ c ( e i ) ] (21)

where the M-dimensional vector 1 and c ( t ) meet the above mentioned assumptions. By minimizing (21) against w we obtain the gradient descent algorithm shown in

W ( t + 1 ) = W ( t ) + η ⋅ ( p ( t ) ⋅ f ( e ( t ) T ) ⋅ W ( t ) + f ( e ( t ) ) ⋅ p ( t ) T ⋅ W ( t ) ) (22)

The first term w j ( t ) T ⋅ f ( e ( t ) ) ⋅ p ( t ) in the equation of

Comparing the algorithms of

Now let’s consider the non-linear version of PCA. One heuristic way of doing this is to require that neuron outputs are always non-linear and that they are: f ( z i ) = f ( w i T p ) . Applying this to the equation in

w j ( t + 1 ) = w j ( t ) + η ⋅ f ( z j ( t ) ) ⋅ k j ( t ) (23)

which is similar to the previous one except that now the error vector is defined as:

k j ( t ) = p ( t ) − ∑ i = 1 l ( j ) f ( z i ( t ) ) w i ( t ) (24)

All this is summarised in

W ( t + 1 ) = W ( t ) + η ⋅ k ( t ) ⋅ f ( z ( t ) T ) (25)

The biggest advantage of this network seems to be the fact that non-linear coefficients implicitly take into account statistical information of degree higher than two, and outputs become more independent than standard PCA networks.

In this context, let’s go on to illustrate the experiments done to identify the type of PCA with the best performance.

Different types of images from Wolfram Mathematica’s public image repository CIFAR-100^{1} were selected for performance testing. After fixing some significant parts on each of these, we ran the PCA algorithms of the previous sections to get the main components. The average convergence speed performance for a training set of 10000 images and a learning threshold of 0.00001 is shown in

In order to choose the most effective method, PCA networks have been divided into two main classes: networks with linear input-output matching and networks with non-linear input-output matching. The networks of the first type (linear) identify in the images only very bright objects and/or with a clearly distinct outline, confusing the weakest objects with the background. The networks of the second type allow, instead, under certain conditions, to identify also less defined objects. The condition to obtain this result is the use, in the algorithms of

The aim of this paper is to construct an algorithm capable of implementing both standard (linear) and non-linear Principal Component Analysis (PCA) through the use of artificial neural network models. PCA is mainly used to reduce the size (number of features) of a problem but, in the traditional approach, the determination of the main components most representative of the phenomenon has the following limitations:

1) The computation is of algebraic (matrix) nature and, for a high number of variables, can involve a high processing time;

^{1}https://datarepository.wolframcloud.com/resources/CIFAR-100

2) The standard PCA is suitable for problems with linear relationships between variables.

The approach presented is an algorithm for calculating the principal components for both standard and non-linear problems. The algorithm makes use of

Network type | Average convergence speed (epochs/sec) |
---|---|

Nonlinear PCA (hierarchical and symmetrical) | 0.722 |

Linear PCA (GHA and Oja subspace) | 0.703 |

Variance maximization (robust hierarchical and symmetrical) | 0.696 |

Approximate algorithm (robust hierarchical and symmetrical) | 0.455 |

Error minimization (robust hierarchical and symmetrical) | 0.382 |

artificial neural network models, with an iterative processing given by the “convergence’’ of the network towards the optimal weights, which correspond to the final solution of the problem. The neural network models proposed in the algorithm make use of multiple layers of neurons (see Section 6) with the application of the hyperbolic tangent function to the PCA output.

The performance of the proposed approach has been evaluated in a test implementation, and can be further improved both in the definition phase of the neural network architecture (number of hidden layers and neurons) and in the learning and validation phase (e.g. through the introduction of cross-validation or leave-one-out depending on the size of the input dataset).

The scaled conjugate gradient learning algorithm leads to a rapidly decreasing average error, up to a level of stabilization. Newton’s method has much slower iterations but, on the other hand, is able to reach values lower than the average error. You can then think of hybridizing the two algorithms using the first one until the average error drops to then exploit the second one starting from the final weight configuration of the first one. This will lower the error function without wasting too much time.

The performance of the proposed approach, while very good, can be further improved at both the detection and classification stages:

• In order to improve the percentage of correctness of the recognition, it is desirable to create an algorithm capable of automatically recognising and eliminating only spurious objects present on a plate;

• In order to speed up the learning of the supervised networks for the classification it is possible to think of a hybrid training that exploits the potentialities of more algorithms in contemporary.

As for the second point, it has been noted that using the scaled conjugate gradient learning algorithm, the average error quickly decreases to a certain level, after which, it tends to stabilise. Newton’s method has much slower iterations but, on the other hand, manages to reach values lower than the mean error. One can therefore think of hybridizing the two algorithms using the first one until the average error drops and then exploit the second one starting from the final weight configuration of the first one. In this way, it would be possible to go lower on the error function without an excessive loss of time.

The authors declare no conflicts of interest regarding the publication of this paper.

Gallo, C. and Capozzi, V. (2019) Feature Selection with Non Linear PCA: A Neural Network Approach. Journal of Applied Mathematics and Physics, 7, 2537-2554. https://doi.org/10.4236/jamp.2019.710173