This exam is administered by SAS and Pearson VUE.

60 scored multiple-choice and short-answer questions.

(Must achieve score of 68 percent correct to pass)

In addition to the 60 scored items, there may be up to five unscored items.

Two hours to complete exam.

Use exam ID A00-240; required when registering with Pearson VUE.

ANOVA - 10%

Verify the assumptions of ANOVA

Analyze differences between population means using the GLM and TTEST procedures

Perform ANOVA post hoc test to evaluate treatment effect

Detect and analyze interactions between factors

Linear Regression - 20%

Fit a multiple linear regression model using the REG and GLM procedures

Analyze the output of the REG, PLM, and GLM procedures for multiple linear regression models

Use the REG or GLMSELECT procedure to perform model selection

Assess the validity of a given regression model through the use of diagnostic and residual analysis

Logistic Regression - 25%

Perform logistic regression with the LOGISTIC procedure

Optimize model performance through input selection

Interpret the output of the LOGISTIC procedure

Score new data sets using the LOGISTIC and PLM procedures

Prepare Inputs for Predictive Model Performance - 20%

Identify the potential challenges when preparing input data for a model

Use the DATA step to manipulate data with loops, arrays, conditional statements and functions

Improve the predictive power of categorical inputs

Screen variables for irrelevance and non-linear association using the CORR procedure

Screen variables for non-linearity using empirical logit plots

Measure Model Performance - 25%

Apply the principles of honest assessment to model performance measurement

Assess classifier performance using the confusion matrix

Model selection and validation using training and validation data

Create and interpret graphs (ROC, lift, and gains charts) for model comparison and selection

Establish effective decision cut-off values for scoring

Verify the assumptions of ANOVA

=> Explain the central limit theorem and when it must be applied

=> Examine the distribution of continuous variables (histogram, box -whisker, Q-Q plots)

=> Describe the effect of skewness on the normal distribution

=> Define H0, H1, Type I/II error, statistical power, p-value

=> Describe the effect of trial
size on p-value and power

=> Interpret the results of hypothesis testing

=> Interpret histograms and normal probability charts

=> Draw conclusions about your data from histogram, box-whisker, and Q-Q plots

=> Identify the kinds of problems may be present in the data: (biased sample, outliers, extreme values)

=> For a given experiment, verify that the observations are independent

=> For a given experiment, verify the errors are normally distributed

=> Use the UNIVARIATE procedure to examine residuals

=> For a given experiment, verify all groups have equal response variance

=> Use the HOVTEST option of MEANS statement in PROC GLM to asses response variance

Analyze differences between population means using the GLM and TTEST procedures

=> Use the GLM Procedure to perform ANOVA

o CLASS statement

o MODEL statement

o MEANS statement

o OUTPUT statement

=> Evaluate the null hypothesis using the output of the GLM procedure

=> Interpret the statistical output of the GLM procedure (variance derived from MSE, Fvalue, p-value R**2, Levene's test)

=> Interpret the graphical output of the GLM procedure

=> Use the TTEST Procedure to compare means Perform ANOVA post hoc test to evaluate treatment effect

Use the LSMEANS statement in the GLM or PLM procedure to perform pairwise comparisons

=> Use PDIFF option of LSMEANS statement

=> Use ADJUST option of the LSMEANS statement (TUKEY and DUNNETT)

=> Interpret diffograms to evaluate pairwise comparisons

=> Interpret control plots to evaluate pairwise comparisons

=> Compare/Contrast use of pairwise T-Tests, Tukey and Dunnett comparison methods Detect and analyze interactions between factors

=> Use the GLM procedure to produce reports that will help determine the significance of the interaction between factors. MODEL statement

=> LSMEANS with SLICE=option (Also using PROC PLM)

=> ODS SELECT

=> Interpret the output of the GLM procedure to identify interaction between factors:

=> p-value

=> F Value

=> R Squared

=> TYPE I SS

=> TYPE III SS

Linear Regression - 20%

Fit a multiple linear regression model using the REG and GLM procedures

=> Use the REG procedure to fit a multiple linear regression model

=> Use the GLM procedure to fit a multiple linear regression model

Analyze the output of the REG, PLM, and GLM procedures for multiple linear regression models

=> Interpret REG or GLM procedure output for a multiple linear regression model:

=> convert models to algebraic expressions

=> Convert models to algebraic expressions

=> Identify missing degrees of freedom

=> Identify variance due to model/error, and total variance

=> Calculate a missing F value

=> Identify variable with largest impact to model

=> For output from two models, identify which model is better

=> Identify how much of the variation in the dependent variable is explained by the model

=> Conclusions that can be drawn from REG, GLM, or PLM output: (about H0, model quality, graphics)

Use the REG or GLMSELECT procedure to perform model selection

Use the SELECTION option of the model statement in the GLMSELECT procedure

=> Compare the differentmodel selection methods (STEPWISE, FORWARD, BACKWARD)

=> Enable ODS graphics to display graphs from the REG or GLMSELECT procedure

=> Identify best models by examining the graphical output (fit criterion from the REG or GLMSELECT procedure)

=> Assign names to models in the REG procedure (multiple model statements)

Assess the validity of a given regression model through the use of diagnostic and residual analysis

=> Explain the assumptions for linear regression

=> From a set of residuals plots, asses which assumption about the error terms has been violated

=> Use REG procedure MODEL statement options to identify influential observations (Student Residuals, Cook's D, DFFITS, DFBETAS)

=> Explain options for handling influential observations

=> Identify collinearity problems by examining REG procedure output

=> Use MODEL statement options to diagnose collinearity problems (VIF, COLLIN, COLLINOINT)

Logistic Regression - 25%

Perform logistic regression with the LOGISTIC procedure

=> Identify experiments that require analysis via logistic regression

=> Identify logistic regression assumptions

=> logistic regression concepts (log odds, logit transformation, sigmoidal relationship between p and X)

=> Use the LOGISTIC procedure to fit a binary logistic regression model (MODEL and CLASS statements)

Optimize model performance through input selection

=> Use the LOGISTIC procedure to fit a multiple logistic regression model

=> LOGISTIC procedure SELECTION=SCORE option

=> Perform Model Selection (STEPWISE, FORWARD, BACKWARD) within the LOGISTIC procedure

Interpret the output of the LOGISTIC procedure

=> Interpret the output from the LOGISTIC procedure for binary logistic regression models: Model Convergence section

=> Testing Global Null Hypothesis table

=> Type 3 Analysis of Effects table

=> Analysis of Maximum Likelihood Estimates table

Association of Predicted Probabilities and Observed Responses

Score new data sets using the LOGISTIC and PLM procedures

=> Use the SCORE statement in the PLM procedure to score new cases

=> Use the CODE statement in PROC LOGISTIC to score new data

=> Describe when you would use the SCORE statement vs the CODE statement in PROC LOGISTIC

=> Use the INMODEL/OUTMODEL options in PROC LOGISTIC

=> Explain how to score new data when you have developed a model from a biased sample

Prepare Inputs for Predictive Model

Performance - 20%

Identify the potential challenges when preparing input data for a model

=> Identify problems that missing values can cause in creating predictive models and scoring new data sets

=> Identify limitations of Complete Case Analysis

=> Explain problems caused by categorical variables with numerous levels

=> Discuss the problem of redundant variables

=> Discuss the problem of irrelevant and redundant variables

=> Discuss the non-linearities and the problems they create in predictive models

=> Discuss outliers and the problems they create in predictive models

=> Describe quasi-complete separation

=> Discuss the effect of interactions

=> Determine when it is necessary to oversample data

Use the DATA step to manipulate data with loops, arrays, conditional statements and functions

=> Use ARRAYs to create missing indicators

=> Use ARRAYS, LOOP, IF, and explicit OUTPUT statements

Improve the predictive power of categorical inputs

=> Reduce the number of levels of a categorical variable

=> Explain thresholding

=> Explain Greenacre's method

=> Cluster the levels of a categorical variable via Greenacre's method using the CLUSTER procedure

o METHOD=WARD option

o FREQ, VAR, ID statement

Use of ODS output to create an output data set

=> Convert categorical variables to continuous using smooth weight of evidence

Screen variables for irrelevance and non-linear association using the CORR procedure

=> Explain how Hoeffding's D and Spearman statistics can be used to find irrelevant variables and non-linear associations

=> Produce Spearman and Hoeffding's D statistic using the CORR procedure (VAR, WITH statement)

=> Interpret a scatter plot of Hoeffding's D and Spearman statistic to identify irrelevant variables and non-linear associations Screen variables for non-linearity using empirical logit plots

=> Use the RANK procedure to bin continuous input variables (GROUPS=, OUT= option; VAR, RANK statements)

=> Interpret RANK procedure output

=> Use the MEANS procedure to calculate the sum and means for the target cases and total events (NWAY option; CLASS, VAR, OUTPUT statements)

=> Create empirical logit plots with the SGPLOT procedure

=> Interpret empirical logit plots

Measure Model Performance - 25%

Apply the principles of honest assessment to model performance measurement

=> Explain techniques to honestly assess classifier performance

=> Explain overfitting

=> Explain differences between validation and test data

=> Identify the impact of performing data preparation before data is split Assess classifier performance using the confusion matrix

=> Explain the confusion matrix

=> Define: Accuracy, Error Rate, Sensitivity, Specificity, PV+, PV-

=> Explain the effect of oversampling on the confusion matrix

=> Adjust the confusion matrix for oversampling

Model selection and validation using training and validation data

=> Divide data into training and validation data sets using the SURVEYSELECT procedure

=> Discuss the subset selection methods available in PROC LOGISTIC

=> Discuss methods to determine interactions (forward selection, with bar and @ notation)

Create interaction plot with the results from PROC LOGISTIC

=> Select the model with fit statistics (BIC, AIC, KS, Brier score)

Create and interpret graphs (ROC, lift, and gains charts) for model comparison and selection

=> Explain and interpret charts (ROC, Lift, Gains)

=> Create a ROC curve (OUTROC option of the SCORE statement in the LOGISTIC procedure)

=> Use the ROC and ROCCONTRAST statements to create an overlay plot of ROC curves for two or more models

=> Explain the concept of depth as it relates to the gains chart

Establish effective decision cut-off values for scoring

=> Illustrate a decision rule that maximizes the expected profit

=> Explain the profit matrix and how to use it to estimate the profit per scored customer

=> Calculate decision cutoffs using Bayes rule, given a profit matrix

=> Determine optimum cutoff values from profit plots

=> Given a profit matrix, and model results, determine the model with the highest average profit

One essential category of analysis carried out via the ingredient method is most important element evaluation. The statements

proc component; run;influence in a most important element analysis. The output includes all of the eigenvalues and the pattern matrix for eigenvalues greater than one.

Most functions require further output. for example, you can also need to compute major element scores for use in subsequent analyses or acquire a graphical help to assist make a decision how many accessories to preserve. which you could save the results of the evaluation in a permanent SAS facts library by using the OUTSTAT= alternative. (refer to the SAS Language Reference: Dictionary for more assistance on permanent SAS information libraries and librefs.) Assuming that your SAS facts library has the libref shop and that the facts are in a SAS records set called raw, you could do a major part evaluation as follows:

proc aspect statistics=raw system=principal scree mineigen=0 ranking outstat=store.fact_all; run;The SCREE alternative produces a plot of the eigenvalues this is constructive in identifying what number of accessories to make use of. The MINEIGEN=0 choice causes all components with variance stronger than zero to be retained. The ranking alternative requests that scoring coefficients be computed. The OUTSTAT= alternative saves the outcomes in a mainly structured SAS information set. The name of the statistics set, in this case fact_all, is bigoted. To compute most important element scores, use the rating procedure.

proc ranking records=uncooked rating=retailer.fact_all out=save.rankings; run;The rating manner uses the statistics and the scoring coefficients which are saved in save.fact_all to compute major element scores. The element rankings are positioned in variables named Factor1, Factor2, ... , Factorn and are saved in the statistics set shop.rankings. if you understand ahead of time how many fundamental accessories you need to use, that you can attain the ratings directly from PROC component via specifying the NFACTORS= and OUT= alternatives. To get ratings from three predominant components, specify

proc component facts=raw components=most important nfactors=three out=store.rankings; run;To plot the rankings for the first three add-ons, use the PLOT system.

proc plot; plot factor2*factor1 factor3*factor1 factor3*factor2; run; principal aspect evaluationThe least difficult and computationally choicest method of common component analysis is foremost component analysis, which is obtained the same approach as predominant component analysis other than the use of the PRIORS= alternative. The typical variety of the preliminary evaluation is

proc aspect statistics=uncooked components=main scree mineigen=0 priors=smc outstat=keep.fact_all; run;The squared varied correlations (SMC) of each variable with the entire other variables are used as the prior communality estimates. in case your correlation matrix is singular, make sure to specify PRIORS=MAX as a substitute of PRIORS=SMC. The SCREE and MINEIGEN= options serve the identical aim as in the preceding foremost part analysis. Saving the effects with the OUTSTAT= choice enables you to check the eigenvalues and scree plot earlier than finding out what number of elements to rotate and to are trying a few distinct rotations devoid of re-extracting the elements. The OUTSTAT= data set is instantly marked class=element, so the ingredient system realizes that it consists of facts from a previous analysis in its place of uncooked information.

After looking on the eigenvalues to estimate the number of elements, you can are trying some rotations. Two and three components may also be rotated with here statements:

proc element statistics=store.fact_all formula=primary n=2 rotate=promax reorder rating outstat=retailer.fact_2; proc aspect records=shop.fact_all formula=main n=3 rotate=promax reorder score outstat=save.fact_3; run;The output statistics set from the old run is used as input for these analyses. The alternatives N=2 and N=three specify the variety of components to be turned around. The specification ROTATE=PROMAX requests a promax rotation, which has the skills of featuring both orthogonal and indirect rotations with only one invocation of PROC element. The REORDER choice causes the variables to be reordered in the output so that variables linked to the identical component seem next to each and every different.

that you may now compute and plot element rankings for both-aspect promax-rotated answer as follows:

proc score records=uncooked rating=save.fact_2 out=store.scores; proc plot; plot factor2*factor1; run; maximum-likelihood component evaluationhowever important component evaluation is most likely essentially the most widespread components of regular factor evaluation, most statisticians opt for maximum-likelihood (ML) element analysis (Lawley and Maxwell 1971). The ML formula of estimation has alluring asymptotic houses (Bickel and Doksum 1977) and produces more suitable estimates than predominant factor evaluation in giant samples. that you may verify hypotheses concerning the variety of commonplace components the use of the ML method.

The ML solution is reminiscent of Rao's (1955) canonical component solution and Howe's solution maximizing the determinant of the partial correlation matrix (Morrison 1976). for that reason, as a descriptive system, ML component evaluation does not require a multivariate regular distribution. The validity of Bartlett's examine for the variety of elements does require approximate normality plus further regularity conditions which are usually satisfied in apply (Geweke and Singleton 1980).

The ML formula is more computationally annoying than predominant component analysis for 2 causes. First, the communalities are estimated iteratively, and each iteration takes about as much desktop time as main aspect evaluation. The number of iterations usually tiers from about 5 to twenty. 2nd, in case you need to extract diverse numbers of elements, as is frequently the case, you should run the component technique as soon as for every number of elements. for this reason, an ML evaluation can take one hundred times so long as a essential element evaluation.

you can use predominant element analysis to get a rough conception of the variety of components earlier than doing an ML evaluation. if you think that there are between one and three elements, that you may use right here statements for the ML analysis:

proc factor records=uncooked components=ml n=1 outstat=keep.fact1; run; proc element records=uncooked formula=ml n=2 rotate=promax outstat=save.fact2; run; proc element records=raw system=ml n=three rotate=promax outstat=retailer.fact3; run;The output facts sets can be used for attempting different rotations, computing scoring coefficients, or restarting the system in case it does not converge inside the distributed variety of iterations.

The ML components can't be used with a singular correlation matrix, and it is exceptionally susceptible to Heywood situations. (See the section "Heywood cases and different Anomalies" for a dialogue of Heywood cases.) you probably have issues with ML, the most reliable option is to make use of the components=u.s.a.alternative for unweighted least-squares aspect evaluation.

Copyright © 1999 by SAS Institute Inc., Cary, NC, united states of america. All rights reserved.

