This file contains a description of the learn and predict algorithms, in particular the initialization which can look convoluted. A lot of steps and verifications are needed before an actual computation can be run.
A run of Kernalytics is composed of the following steps.
This is not method specific. Every numerical method read the data the same way, the objective is to generate the Gram matrix. There are only two files to describe this, Learn and Predict. Upon parsing the method name, a specific method is called that will read the parameters and run the numerical method on the data;
This is similar but different across numerical methods. What differs is the parameters description. Each method (regression, segmentation,...) has its own set of parameters. This is described in the learn and predict directories. Once the parameters have been parsed, the numerical method main function is called with the data and parameters. The result of the analysis is returned.
The method run is described in the overview.
The last step is to write all the results as csv files in the root folder of the analysis case.
You can find examples for various algorithms in Examples.scala. They all follow the same structure, a root folder is defined (located in exec), and either Learn or Predict is called.
Kernalytics only works with structured data in the form of a collection of csv files. Even the rscala packages is just a thin wrapper to transmit the root folder location. The root folder is the place where they should be found.
A description of Learn, and the differences with Predict will then be highlighted.
- readAndParseFile: read the
algo.csv
file, check the content and generate a mapkey => value
with an entry for each column. - readAndParseVars: read the
learnData.csv
orlearnPredict.csv
file, and generate a tuple(Array[ParsedVar], Index)
. First element contains the parsed variable, the second the number of observations. - cacheGram: parse the
gramOpti
option inalgo.csv
. - readAndParseParam: read the
desc.csv
file, parse each column and generate anArray[ParsedParam]
. - generateGlobalKerEval: from the
Array[ParsedVar]
andArray[ParsedParam]
previously generated, generate theKerEval
object- linkParamToData: link each kernel to a variable, in a KerEvalFuncDescription, which merges kernel and data information.
- multivariateKerEval: from the list of KerEvalFuncDescription, generate the kernel function
(Index, Index) => Real
- generateKernelFromParamData: for each KerEvalFuncDescription, generate a function
(Index, Index) => Real
- linearCombKerEvalFunc, aggregate the individual
(Index, Index) => Real
in a global(Index, Index) => Real
- generateKernelFromParamData: for each KerEvalFuncDescription, generate a function
- then use the
gramOpti option
to generate aKerEval
object which uses the cache method provided by the user inalgo.csv
- callAlgo: parse the
algo
entry inalgo.csv
to launch the corresponding numerical method, for example KMeans in the next line- main: read and write parameters and data that are specific for the algorithm. For KMeans, it is the number of classes asked and the number of iterations the algorithm must be run. Then call the main function in the numerical method.
getNClass
: number of classes asked forgetNIteration
: number of iterations- runKMeans: code of the main algorithm. It assumes that all the data provided have been read and validated.
writeResults
: write the results on the disk.
- main: read and write parameters and data that are specific for the algorithm. For KMeans, it is the number of classes asked and the number of iterations the algorithm must be run. Then call the main function in the numerical method.
Essentially similar to exec.Learn.main. The main differences are:
- the
gramOpti
parameter inalgo.csv
is ignored.Direct()
is always used instead.- in prediction, the gram matrix has dimension
(nObsLearn + nObsPredict) x (nObsLearn + nObsPredict)
, and usually each coefficient is used only once
- in prediction, the gram matrix has dimension
- readAndParseVars2Files is called instead of readAndParseVars, in order to generate a Gram matrix that combines learn and prediction data.