The Xfuzzy 3 development environment

The supervised learning tool - Xfsl

xfsl is a tool that allows the user to apply supervised learning algorithms to tune fuzzy systems into the design flow of Xfuzzy 3. The tool can be executed in graphical mode or in command mode. The graphical mode is used when executing the tool from the main window of the environment (using the option "Supervised learning" in the Tuning menu). The command mode is used when executing the tool from the command line with the expression "xfsl file.xfl [file.cfg]", where the first file contains the system definition in XFL3 format, and the second one contains the configuration of the learning process (see configuration file below).

The figure above illustrates the main window of xfsl. This window is divided into four parts. The left upper corner is the area to configure the learning process. The state of the learning process is shown at the right upper part. The central area illustrates the evolution of the learning, and the bottom part contains several control buttons to run or stop the process, to save the results, and to exit.

In order to configure the learning process, the first step is to select a training file that contains the input/output data of the desired behavior. A test file, whose data are used to check the generalization of the learning, can be also selected. The format of these two patterns files is just an enumeration of numeric values that are assigned to the input and output variables in the same order that they appear in the definition of the system module in the XFL3 description. This is an example of a pattern file for a fuzzy system with two inputs and one output:

 0.00 0.00 0.5
 0.00 0.05 0.622459
 0.00 0.10 0.731059 
 ...  

The log file allows to save the learning evolution in an external file. The selection of this file is optional.

The following step in the configuration of the tuning process is the selection of the learning algorithm. xfsl admits many learning algorithms (see section algorithms below). Regarding gradient descent algorithms, it admits Steepest Descent, Backpropagation, Backpropagation with Momentum, Adaptive Learning Rate, Adaptive Step Size, Manhattan, QuickProp and RProp. Among conjugate gradient algorithms, the following are included: Polak-Ribiere, Fletcher-Reeves, Hestenes-Stiefel, One-step Secant and Scaled Conjugate Gradient. The second-order algorithms included are: Broyden-Fletcher-Goldarfb-Shanno, Davidon-Fletcher-Powell, Gauss-Newton and Mardquardt-Levenberg. Regarding algorithms without derivatives, the Downhill Simplex and Powell's method can be applied. Finally, the statistical algorithms included are Blind Search and Simulated Annealing (with linear, exponential, classic, fast, and adaptive annealing schemes).

Once the algorithm is selected, an error function must be chosen. The tool offers several error functions that can be used to express the deviation between the actual and the desired behavior (see section error function below). By default, the Mean Square Error is selected.

xfsl contains two processing algorithms to simplify the designed fuzzy system. The first algorithm prunes the rules and reduces the membership functions that do not reach a significant activation or membership degree. There are three versions of the algorithm: pruning all rules that are never activated over a certain threshold, pruning the worst N rules, and pruning all rules except the best N ones. The second algorithm clusters the membership functions of the output variables. The number of clusters can be fixed to a certain quantity, or computed automatically. These two processing algorithms can be applied to the system before the tuning process (preprocessing option) or after it (postprocessing option).

An end condition has to be specified to finish the learning process. This condition is a limit imposed over the number of iterations, the maximum error goal, or the maximum absolute or relative deviation (considering both the training and the test error).

The tool allows the user to choose the parameters to be tuned. The following window is used to enable or disable the tuning of the parameters. The three upper lists are used to select a parameter, or a set of parameters, by selecting the variable type, the membership function of that type, and the parameter index in that membership function. The lower list shows the actual settings. These settings are interpreted in the order that they appear in the list. In this example, all the parameters are first disabled, and then the parameters of the type Tout are enabled, so only the parameters of the Tout type are going to be tuned.

A complete learning configuration can be saved into an external file that will be available for subsequent processes. The format of this file is described in section configuration file.

xfsl can be applied to any fuzzy system described by the XFL3 language, even to systems that employ particular functions defined by the user. What must be considered is that the features of the system may impose limitations over the learning algorithms to apply (for instance, a non derivative system cannot be tuned by a gradient-descent algorithm).

Algorithms

Since the objective of supervised learning algorithms is to minimize an error function that summarizes the deviation between the actual and the desired system behavior, they can be considered as algorithms for function optimization. xfsl contains many different supervised learning algorithms, which are briefly described in the following.

A) Gradient Descent Algorithms

The equivalence between fuzzy systems and neural networks led to apply the neural learning processes to fuzzy inference systems. In this sense, a well-known algorithm employed in fuzzy systems is the BackPropagation algorithm, which modifies the parameter values proportionally to the gradient of the error function in order to reach a local minimum. Since the convergence speed of this algorithm is slow, several modifications were proposed like using a different learning rate for each parameter or adapting heuristically the control variables of the algorithm. An interesting modification that improves greatly the convergence speed is to take into account the gradient value of two successive iterations because this provides information about the curvature of the error function. The algorithms QuickProp and RProp follow this idea.

xfsl admits Backpropagation, Backpropagation with Momentum, Adaptive Learning Rate, Adaptive Step Size, Manhattan, QuickProp and RProp.

B) Conjugate Gradient Algorithms

The gradient-descent algorithms generate a change step in the parameter values that is a function of the gradient value at each iteration (and possibly at previous iterations). Since the gradient indicates the direction of maximum function variation, it may be convenient to generate not only one step but several steps which minimize the function error in that direction. This idea, which is the basis of the steepest-descent algorithm, has the drawback of producing a zig-zag advancing because the optimization in one direction may deteriorate previous optimizations. The solution is to advance by conjugate directions that do not interfere each other. The several conjugate gradient algorithms reported in the literature differ in the equations used to generate the conjugate directions.

The main drawback of the conjugate gradient algorithms is the implementation of a linear search in each direction, which may be costly in terms of function evaluations. The linear search can be avoided by using second-order information, that is, by approximating the second derivative with two close first derivatives. The scaled conjugate gradient algorithm is based on this idea.

Among conjugate gradient algorithms, the following are included in xfsl: Steepest Descent, Polak-Ribiere, Fletcher-Reeves, Hestenes-Stiefel, One-step Secant and Scaled Conjugate Gradient.

C) Second-Order Algorithms

A forward step towards speeding up the convergence of learning algorithms is to make use of second-order information of the error function, that is, of its second derivatives or, in matricial form, of its Hessian. Since the calculus of the second derivatives is complex, one solution is to approximate the Hessian by means of the gradient values of successive iterations. This is the idea of Broyden-Fletcher-Goldarfb-Shanno and Davidon-Fletcher-Powell algorithms.

A particular case is when the function to minimize is a quadratic error because the Hessian can be approximated by only the first derivatives of the system outputs, as done by the Gauss-Newton algorithm. Since this algorithm can lead to unstability when the approximated Hessian is not positive defined, the Marquardt-Levenberg algorithm solves this problem by introducing an adaptive term.

The second-order algorithms included in the tool are: Broyden-Fletcher-Goldarfb-Shanno, Davidon-Fletcher-Powell, Gauss-Newton and Mardquardt-Levenberg.

D) Algorithms Without Derivatives

The gradient of the error function cannot be always calculated because it can be too costly or not defined. In these cases, optimization algorithms without derivatives can be employed. An example is the Downhill Simplex algorithm, which considers a set of function evaluations to decide a parameter change. Another example is Powell's method, which implements linear searches by a set of directions that evolve to be conjugate. The algorithms of this kind are too much slower than the previous ones. A best solution can be to estimate the derivatives from the secants or to employ not the derivative value but its sign (as RProp does), which can be estimated from small perturbations of the parameters.

All the above commented algorithms do not reach the global but a local minimum of the error function. The statistical algorithms can discover the global minimum because they generate different system configurations that spread the search space. One way of broadening the space explored is to generate random configurations and choose the best one. This is done by the Blind Search algorithm, whose convergence speed is extremely slow. Another way is to perform small perturbations in the parameters to find a better configuration as done by the algorithm of iterative improvements. A better solution is to employ Simulated Annealing algorithms. They are based on an analogy between the learning process, which is intended to minimize the error function, and the evolution of a physical system, which tends to lower its energy as its temperature decreases. Simulated annealing provides good results when the number of parameters to adjust is low. When it is high, the convergence speed can be so slow than it can be preferred to generate random configurations, apply gradient descent algorithms and select the best solution.

Regarding algorithms without derivatives, the Downhill Simplex and Powell's method can be applied. The statistical algorithms included are Blind Search and Simulated Annealing (with linear, exponential, classic, fast, and adaptive annealing schemes).

When optimizing a differentiable system, Broyden-Fletcher-Goldarfb-Shanno and Mardquardt-Levenberg algorithms are the most adequate. When using BFGS, control values (0.1,10) may be a good choice. In ML algorithm, control values (0.1,10,0.1) are a good initial option. If it is not possible to compute the system derivatives, as in hierarchical fuzzy systems, the best choice is to use these algorithms with the option of estimating the derivative. Simulated Annealing is only recommended when there are a few parameters to tune and the second order algorithms drive the system to a non-optimal minimum.

Error function

The error function expresses the deviation between the actual behavior of the fuzzy system and the desired one by comparing the input/output patterns with the output of the system for those input values. xfsl defines seven error functions: mean_square_error (MSE), weighted_mean_square_error (WMSE), mean_absolute_error (MAE), weighted_mean_absolute_error (WMAE), classification_error (CE), advanced_classification_error (ACE), and classification_square_error (CSE).

All these function are normalized by the number of patterns, the number of output variables, and the range of each output variable, so that the range of the error function is from 0 to 1. The first four functions are adequate for systems with continuous output variables, while the last three functions are dedicated to classification systems. These are the equation for the first functions:

MSE = Sum( ((Y-y)/range)**2 )/(num_pattern*num_output)
WMSE = Sum( w * ((Y-y)/range)**2 )/(num_pattern*Sum(w))
MAE = Sum( |((Y-y)/range)| )/(num_pattern*num_output)
WMAE = Sum( w * |((Y-y)/range)| )/(num_pattern*Sum(w))

The output of a fuzzy classification system is the linguistic label that has the greatest activation degree. A common way of expressing the deviation of these systems is the number of classification failures (classification_error, CE). This is not a very good choice for tuning because many system configurations produce the same number of failures. A useful modification is to add a term that measures the distance of the selected label to the desired one (advanced_classification_error, ACE). These two error functions are not differentiable, so they cannot be used with derivative-based learning algorithms (which are the fastest). A better choice is to consider the activation degree of each linguistic label as the actual output and the desired output as 1 for the correct label and 0 for the others. The error function is computed as the square error of this system (classification_square_error, CSE), which is differentiable and can be used with derivative-based learning algorithms.

Configuration file

The configuration of a tuning process can be saved to and loaded from an extern file. The content of this file is formed by the following directives:

 xfsl_training("file_name")
 xfsl_test("file_name")
 xfsl_log("file_name")
 xfsl_output("file_name")
 xfsl_algorithm(algorithm_name, value, value, ...)
 xfsl_option(option_name, value, value, ...)
 xfsl_errorfunction(function_name, value, value, ...)
 xfsl_preprocessing(process_name, value, value, ...)
 xfsl_postprocessing(process_name, value, value, ...)
 xfsl_endcondition(condition_name, value, value, ...)
 xfsl_enable(type.mf.number)
 xfsl_disable(type.mf.number)

The directives xfsl_training and xfsl_test select the pattern files for training and testing the system. The log file for saving the learning evolution is selected by the directive xfsl_log. The directive xfsl_output contains the name of the XFL3 file to which the tuned system is saved. By default, this file is "xfsl_out.xfl".

The learning algorithm is set by the directive xfsl_algorithm. The values refer to the control variables of the algorithm. Once the algorithm has been chosen, any algorithm option can be selected by the directive xfsl_option.

The error function selection is made by the directive xfsl_errorfunction. The values contain the weights of the output variables for weighted error functions.

The directives xfsl_preprocessing and xfsl_postprocessing specify any process that has to be made before or after the system tuning. The different options are: prune_threshold, prune_worst, prune_except, and output_clustering. When option output_clustering contains a value, it refers to the number of clusters to be created, otherwise the number is computed automatically.

The end condition, selected by xfsl_endcondition, can be one of the following: epoch, training_error, training_RMSE, training_MXAE, training_variation, test_error, test_RMSE, test_MXAE, and test_variation.

The selection of the parameters to be tuned is made by the directives xfsl_enable and xfsl_disable. The fields type, mf, and number specify the variable type, membership function and index of the parameter. These fields can also contain the expression "ANY".

Example

The examples folder in the Xfuzzy distribution contains different e xamples of tuning processes. The initial system configuration specified in an XFL3 file, which defines a fuzzy system with two input and one output variables. The membership functions of the output variable are identical, so that the input/output behavior of this initial specification corresponds to a flat surface.

The following table shows the results obtained in one of the cases, in which was used a training file with patterns that describe the surface given by the expression

 z = 1 / ( 1 + exp(10*(x-y)) ) 

after using the Marquardt-Levenberg learning algorithm and applying clustering post-processing techniques to reduce the number of functions.


------------------------------------------------------------------------
F. J. Moreno-Velo, I. Baturone, A. Barriga, S. Sánchez-Solano
Automatic Tuning of Complex Fuzzy Systems with Xfuzzy
Fuzzy Sets and Systems 2007 
DOI: 10.1016/j.fss.2007.03.006

For comments, patches, bug reports, etc contact us at:   xfuzzy-team@imse-cnm.csic.es

©IMSE-CNM 2018