Geeks With Blogs
Josh Reuben



PMML Overview

  • An XML standard managed by the Data Mining Group (www.dmg.org ) whose members include IBM, Microsoft, Oracle, SAS, SPSS,, NCR, SAP, KXEN, Magnify, MINEit, & StatSoft

  • Predictive Model Markup Language (PMML) is an XML mark up language to describe statistical and data mining models.

  • PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves.

  • It is the most widely deployed data mining standard.

  • PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as XML for Analysis.

  • provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications.

  • PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications.

  • It allows users to develop models within one vendor's application, and use other vendors' applications to visualize, analyze, evaluate or otherwise use the models the exchange of models between compliant applications is now straightforward.

  • One or more mining models can be contained in a PMML XML document.

  • The PMML statistics subset provides a basic framework for representing univariate statistics, such as mean, min, max, counts, standard deviation & frequency

  • PMML is a standard for XML documents which express trained instances of analytic models.

  • PMML supports the following Model classes:

    • Association Rules

    • Decision Trees

    • Center-Based & Distribution-Based Clustering

    • Regression

    • General Regression

    • Neural Networks

    • Naive Bayes

    • Sequences

PMML document structure:

<?xml version="1.0"?>

<!DOCTYPE PMML PUBLIC "PMML 2.0"

"http://www.dmg.org/v2-0/pmml_v2_0.dtd">

<PMML version="2.0">

...

</PMML>

  • The root element of a PMML document must have type PMML.

  • A PMML document can contain zero or more models - The document can be used to carry the initial metadata before an actual model is computed. A PMML document containing no model is not meant to be useful for a PMML consumer.

  • The element <MiningBuildTask> can contain any XML value describing the configuration of the training run that produced the model instance - the natural container for task specifications as defined by other mining standards, e.g., in SQL or .NET.

  • The fields in the <DataDictionary> and in the <TransformationDictionary> elements are identified by unique names - Other elements in the models can refer to these fields by name so that Multiple models on one PMML document can share the same fields defined in these dictionary elements

  • Certain types of PMML models such as neural networks or logistic regression can be used for different purposes - some instances implement prediction of numeric values, while others can be used for classification according to the functionName attribute which specifies the mining function.

  • A Model element has the following attributes:

    • modelName - identifies the model with a unique name in the context of the PMML file.

    • functionName and algorithmName provide informational descriptions of the nature of the mining model, e.g., whether it is intended to be used for clustering or for classification.

  • Basic data types and entities: NUMBER, INT-NUMBER, REAL-NUMBER, PROB-NUMBER (a real number between 0.0 & 0.1) & PERCENTAGE-NUMBER

  • The types <Array> , <NUM-ARRAY>, <REAL-ARRAY> & <STRING-ARRAY> are defined as container structure which implements arrays of numbers and strings in a fairly compact way:

<Array n="3" type="int">

1 22 3

</Array>

<Array n="3" type="string">

ab "a b" "with \"quotes\" "

</Array>

PMML Header Information

  • Header: The top level tag that marks the beginning of the header information.

  • copyright: This attribute contains the copyright information for this model.

  • description: obvious.

  • Application: This element describes the software application that generated the model.

  • name: The name of the application that generated the model.

  • version: The version of the application that generated this model.

  • Annotation: Document modification history is embedded here.

  • Timestamp: This element allows a model creation timestamp

PMML Data Dictionary

  • The data dictionary contains definitions for fields as used in mining models. It specifies the types and value ranges. These definitions are assumed to be independent of specific data sets as used for training or scoring a specific model.

  • A data dictionary can be shared by multiple models, statistics and other information related to the training set is stored within a model

  • The value numberOfFields is the number of fields which are defined in the content of <DataDictionary>, this number can be added for consistency checks. The name of a data field must be unique in the data dictionary. The displayName is a string which may be used by applications to refer to that field.

  • The fields are separated into different types depending on which operations are defined on the values; this is defined by the attribute optype. Categorical fields have the operator "=", ordinal fields have an additional "<", and continuous fields also have arithmetic operators. Cyclic fields have a distance measure which takes into account that the maximal value and minimal value are close together.

  • The optional attribute 'taxonomy' refers to a hierarchy of values and is only applicable to categorical fields.

  • The content of a DataField defines the set of values which are considered to be valid – the mining model will categorize a value as valid, invalid or missing

  • If a categorical or ordinal field contains at least one Value element where the value of property is 'valid' or unspecified, then the set of Value elements completely defines the set of valid values. Otherwise any value is valid by default.

  • The element Interval defines a range of numeric values - The attributes leftMargin and rightMargin are optional but at least one value must be defined. If a margin is missing, then +/- infinity is assumed.

PMML Mining Schema

  • Each model contains one mining schema which lists fields as used in that model - This is a subset of the fields as defined in the data dictionary.

  • While the mining schema contains information that is specific to a certain model, the data dictionary contains data definitions which do not vary per model.

  • The main purpose of the mining schema is to list the fields which a user has to provide in order to apply the model.

  • The usageType attribute can have the following values:

    • active: field used as input (independent field).

    • predicted: field whose value is predicted by the model.

    • supplementary: field holding additional descriptive information.

  • The outliers attribute can have the following values:

    • asIs: field values treated at face value.

    • asMissingValues: outlier values are treated as if they were missing.

    • asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField.

  • name: symbolic name of field, must refer to a field in the data dictionary.

  • highValue and lowValue: for outliers

  • missingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input..

  • missingValueTreatment: informational only.

PMML Data flow

  • PMML defines a variety of specific mining models such as for tree classification, neural networks, regression, etc.

  • there are definitions which are common to all models, in order to describe the input data itself and generic transformations which can be applied to the input data before the model itself is evaluated.

  • The <DataDictionary> element describes the data 'as is', that's the raw input data and refers to the original data and defines how the mining model interprets the data, e.g., as categorical, or numerical

  • The <MiningSchema> element defines an interface to the user of PMML models, listing all fields which are used as input to the computations in the mining model. The MiningSchema also defines which values are regarded as outliers, which weighting is applied to a field, e.g., for clustering. Input fields as specified in the MiningSchema refer to fields in the data dictionary but not to derived fields because a user of a model is not required to perform the normalizations.

  • Various transformations are defined such as normalization of numbers to a range [0..1] or discretization of continuous fields, which convert the original values to internal values as they are required by the mining model such as an input neuron of a network model. The mining model may internally require further derived values that depend on the input values defined in the transformations block The transformations cover expressions that were generated by a mining technique - A complete mining project usually needs many other preprocessing steps which may have to be defined manually, and PMML does not provide a complete language for this full preprocessing These data preparation steps must be performed before feeding the values into a PMML consumer.

  • If a PMML document contains multiple models then sharing definitions of normalizations could save space in the document. That's the same idea as for having a common data dictionary. Note, the normalizations may still differ between models, i.e., different models may refer to different sets of derived fields.

  • A derived value, defined by a normalization, can be input for another transformation. E.g. a neural network model could have a linear normalization defined on a log-transformed input field 'income'.

  • The specific definitions of models such as tree classification or neural network may refer to fields listed in the MiningSchema or to derived fields which can be computed from the MiningSchema-fields (incl. transitive closure).

  • The statistics and the specific model can refer to fields in the MiningSchema but also to transformed fields. If there is a replacement value defined for missing values, the statistics refer to the values before the missing values are replaced.

  • The output of a model always depends on the specific kind of model, and the final result, such as a predicted class and a probability, are computed from the output of the model.

  • If a neural network is used for predicting numeric values then the output value of the network usually needs to be denormalized into the original domain of values, which can use the same kind of transformation types - The PMML consumer system will automatically compute the inverse mapping.

PMML Transformation Dictionary & Derived Values

  • At various places the mining models use simple functions in order to map user data to values that are easier to use in the specific model – e.g. for neural networks - internally work with numbers, usually in the range from 0 to 1. Numeric input data are mapped to the range [0..1], and categorical fields are mapped to series of 0/1 indicators..

  • PMML defines 4 kinds of simple data transformations:

    • Normalization: map values to numbers, the input can be continuous or discrete.

    • Discretization: map continuous values to discrete values.

    • Value mapping: map discrete values to discrete values.

    • Aggregation: summarize or collect groups of values, e.g. compute average.

  • The transformations in PMML do not cover the full set of preprocessing functions which may be needed to collect and prepare the data for mining, as there are too many variations of preprocessing expressions - Instead, the PMML transformations represent expressions that are created automatically by a mining system

PMML Conformance

  • PMML intends to enable application portability, sharing, and reuse of analytic models produced by a variety of tools.

  • Conformance must therefore be specified from both producer and consumer perspectives.

  • Applications need ways to specify what kinds of analytic models they can use, and modeling tools need ways to specify what kinds of analytic models they produce.

  • A PMML document is what gets produced by a modeling tool to specify a trained analytic model and is what an application uses to deploy that model.

  • Satisfying conformance rules ensures a model definition document is syntactically correct , specification consistent and that such a model will be applied in ways which are valid.

PMML Regression

  • A RegressionModel defines three types of regression models: linear, polynomial, and logistic regression. The modelType attribute indicates the type of regression used.

  • Linear and stepwise-polynomial regression are designed for numeric dependent variables having a continuous spectrum of values. These models should contain exactly one regression table. The attributes normalizationMethod and targetCategory are not used in that case.

  • Logistic regression is designed for categorical dependent variables. These models should contain exactly one regression table for each targetCategory. The normalizationMethod describes whether/how the prediction is converted into a probability.

  • p is the predicted value and is normally interpreted as the confidence or the probability of an individual belonging to the category of interest, as defined by targetCategory. There can be multiple regression equations. A confidence value for a category j can be computed by the softmax or simplemax functions

  • the <RegressionModel> element is the root element of an XML regression model, and contains the following attributes:

    • modelName: This is a unique identifier specifying the name of the regression model.

    • functionName: Can be regression or classification.

    • algorithmName: Can be any string describing the algorithm that was used while creating the model.

    • modelType: Specifies the type of a regression model. This information is used to select the appropriate mathematical formulas during the scoring phase. The supported regression algorithms are linearRegression, polynomialRegression, & logisticRegression.

    • targetFieldName: The name of the target field (also called response variable).

  • The <RegressionTable> element represents a table that lists the values of all predictors or independent variables. If the model is used to predict a numerical field, then there is only one RegressionTable and the attribute targetCategory may be missing. If the model is used to predict a categorical field, then there are two or more RegressionTables and each one must have the attribute targetCategory defined with a unique value.

  • The <NumericPredictor> subelement defines a numeric independent variable. The list of valid attributes comprises the name of the variable, the exponent to be used, and the coefficient by which the values of this variable must be multiplied. If the independent variable contains missing values, the mean attribute is used to replace the missing values with the mean value.

  • The <CategoricalPredictor> subelement defines a categorical independent variable. The list of attributes comprises the name of the variable, the value attribute, and the coefficient by which the values of this variable must be multiplied.

  • To do a regression analysis with categorical values, some means must be applied to enable calculations. If the specified value of an independent value occurs, the term variable_name(value) is replaced with 1. Thus the coefficient is multiplied by 1.

  • If the value does not occur, the term variable_name(value) is replaced with 0 so that the product coefficient × variable_name(value) yields 0. Consequently, the product is ignored in the ongoing analysis. If the input value is missing then variable_name(v) yields 0 for any 'v'.

  • E.g. a linear regression analysis PMML model:

<RegressionModel

functionName="regression"

modelName="Sample for linear regression"

modelType="linearRegression"

targetFieldName="number of claims">


<RegressionTable intercept="132.37">

<NumericPredictor name="age"

exponent="1" coefficient="7.1"/>

<NumericPredictor name="salary"

exponent="1" coefficient="0.01"/>

<CategoricalPredictor name="car location"

value="carpark" coefficient="41.1"/>

<CategoricalPredictor name="car location"

value="street" coefficient="325.03"/>

</RegressionTable>


</RegressionModel>



PMML Neural Network Models for Backpropagation

  • PMML can model each neuron to receive one or more input values, each coming via a network connection, and sends only one output value. All incoming connections for a certain neuron are contained in the corresponding <Neuron element>. Each connection Con stores the ID of a node it comes from and the weight. A bias weight coefficient may be stored as an attribute of <Neuron> element.

  • All neurons in the network are assumed to have the same (default) activation function, although each individual neuron may have its own activation and threshold that override the default.

  • NeuralInput defines how input fields are normalized so that the values can be processed in the neural network. For example, string values must be encoded as numeric values.

  • NeuralOutput defines how the output of the neural network must be interpreted.

  • NN-NEURON-ID is a string which uniquely identifies a neuron within a model (not within a document).

  • An input neuron represents the normalized value for an input field using the normalization elements <NormContinuous> and <NormDiscrete>. A numeric input field is usually mapped to a single input neuron while a categorical input field is usually mapped to a set of input neurons using some fan-out function.

  • Restrictions: A numeric input field or a pair of categorical input field together with an input value must not appear more than once in the input layer.

  • Neuron contains an identifier which must be unique in all layers, its attribute threshold has default value 0. If no activationFunction is given then the default activationFunction of the NeuralNetwork element applies.

  • The attribute 'bias' implicitly defines a connection to a bias unit where the unit's value is 1.0 and the weight is the value of 'bias'

  • Weighted connection between neural net nodes are represented by Con elements which are always part of a Neuron and define the connections coming into that parent element.

  • The neuron identified by 'from' may be part of any layer.

  • NN-NEURON-IDs of all nodes must be unique across the combined set of NeuralInput and Neuron nodes. The 'from' attributes of connections and NeuralOutputs refer to these identifiers.

  • In parallel to input neurons, there are output neurons which are connected to input fields via some normalization.

  • While the activation of an input neuron is defined by the value of the corresponding input field, the activation of an output neuron is computed by the activation function, and thus an output neuron is defined by a 'Neuron'.

  • In networks with supervised learning the computed activation of the output neurons is compared with the normalized values of the corresponding target fields

  • The difference between the neuron's activation and the normalized target field determines the prediction error.

  • For scoring the normalization for the target field is used to denormalize the predicted value in the output neuron. Therefore, each instance of 'Neuron' which represent an output neuron, is additionally connected to a normalized field. Note that the scoring procedure must apply the inverse of the normalization in order to map the neuron activation to a value in the original domain.

  • For neural value prediction with back propagation, the output layer contains a single neuron, this is denormalized giving the predicted value.

  • For neural classification with backpropagation, the output layers contains one or more neurons. The neuron with maximal activation determines the predicted class label. If there is no unique neuron with maximal activation then the predicted value is undefined.

  • backward connections from level N to level M with M <= N or connections between non-adjacent layers and variable values for activationFunction per Neuron require extensions

  • e.g.

<?xml version="1.0" ?>

<PMML version="2.0">

<Header copyright="DMG.org"/>

<DataDictionary numberOfFields="5">

<DataField name="gender" optype="categorical">

<Value value=" female"/>

<Value value=" male"/>

</DataField>

<DataField name="no of claims" optype="categorical">

<Value value=" 0"/>

<Value value=" 1"/>

<Value value=" 3"/>

<Value value=" &gt; 3"/>

<Value value=" 2"/>

</DataField>

<DataField name="domicile" optype="categorical">

<Value value="suburban"/>

<Value value=" urban"/>

<Value value=" rural"/>

</DataField>

<DataField name="age of car" optype="continuous"/>

<DataField name="amount of claims" optype="continuous"/>

</DataDictionary>

<NeuralNetwork modelName="Neural Insurance"

functionName="regression"

activationFunction="logistic">

<MiningSchema>

<MiningField name="gender"/>

<MiningField name="no of claims"/>

<MiningField name="domicile"/>

<MiningField name="age of car"/>

<MiningField name="amount of claims" usageType="predicted"/>

</MiningSchema>

<NeuralInputs>

<NeuralInput id="0">

<DerivedField>

<NormContinuous field="age of car">

<LinearNorm orig="0.01" norm="0"/>

<LinearNorm orig="3.07897" norm="0.5"/>

<LinearNorm orig="11.44" norm="1"/>

</NormContinuous>

</DerivedField>

</NeuralInput>

<NeuralInput id="1">

<DerivedField>

<NormDiscrete field="gender" value="male"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="2">

<DerivedField>

<NormDiscrete field="no of claims" value="0"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="3">

<DerivedField>

<NormDiscrete field="no of claims" value="1"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="4">

<DerivedField>

<NormDiscrete field="no of claims" value="3"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="5">

<DerivedField>

<NormDiscrete field="no of claims" value="3"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="6">

<DerivedField>

<NormDiscrete field="no of claims" value="2"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="7">

<DerivedField>

<NormDiscrete field="domicile" value="suburban"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="8">

<DerivedField>

<NormDiscrete field="domicile" value="urban"/>

</DerivedField>

</NeuralInput>

<NeuralInput id="9">

<DerivedField>

<NormDiscrete field="domicile" value="rural"/>

</DerivedField>

</NeuralInput>

</NeuralInputs>

<NeuralLayer>

<Neuron id="10">

<Con from="0" weight="-2.08148"/>

<Con from="1" weight="3.69657"/>

<Con from="2" weight="-1.89986"/>

<Con from="3" weight="5.61779"/>

<Con from="4" weight="0.427558"/>

<Con from="5" weight="-1.25971"/>

<Con from="6" weight="-6.55549"/>

<Con from="7" weight="-4.62773"/>

<Con from="8" weight="1.97525"/>

<Con from="9" weight="-1.0962"/>

</Neuron>

<Neuron id="11">

<Con from="0" weight="-0.698997"/>

<Con from="1" weight="-3.54943"/>

<Con from="2" weight="-3.29632"/>

<Con from="3" weight="-1.20931"/>

<Con from="4" weight="1.00497"/>

<Con from="5" weight="0.033502"/>

<Con from="6" weight="1.12016"/>

<Con from="7" weight="0.523197"/>

<Con from="8" weight="-2.96135"/>

<Con from="9" weight="-0.398626"/>

</Neuron>

<Neuron id="12">

<Con from="0" weight="0.904057"/>

<Con from="1" weight="1.75084"/>

<Con from="2" weight="2.51658"/>

<Con from="3" weight="-0.151895"/>

<Con from="4" weight="-2.88008"/>

<Con from="5" weight="0.920063"/>

<Con from="6" weight="-3.30742"/>

<Con from="7" weight="-1.72251"/>

<Con from="8" weight="-1.13156"/>

<Con from="9" weight="-0.758563"/>

</Neuron>

</NeuralLayer>

<NeuralLayer>

<Neuron id="13">

<Con from="10" weight="0.76617"/>

<Con from="11" weight="-1.5065"/>

<Con from="12" weight="0.999797"/>

</Neuron>

</NeuralLayer>

<NeuralOutputs>

<NeuralOutput outputNeuron="13">

<DerivedField>

<NormContinuous field="amount of claims">

<LinearNorm orig="0" norm="0.1"/>

<LinearNorm orig="1291.68" norm="0.5"/>

<LinearNorm orig="5327.26" norm="0.9"/>

</NormContinuous>

</DerivedField>

</NeuralOutput>

</NeuralOutputs>

</NeuralNetwork>

</PMML>


Posted on Thursday, October 9, 2014 4:44 AM | Back to top


Comments on this post: PMML – Predictive Model Markup Language

# re: PMML – Predictive Model Markup Language
Requesting Gravatar...
This process is commonly use now because it can be done easily. - Morgan Exteriors
Left by Mike Abbott on Jan 05, 2017 6:19 PM

Your comment:
 (will show your gravatar)


Copyright © JoshReuben | Powered by: GeeksWithBlogs.net