Category Archives: SPSS

3 Steps to Identify the Analytics Training You Need

You know your keynotes at conferences have a positive impact when they raise awareness. My keynotes raise awareness not just to the science that is analytics and its application, but also the need to achieve erudition at it. I know, because one of the most commonly asked post-keynote questions I get is – “I’m very interested in furthering my knowledge in Analytics. Given my background, could you suggest what kind of analytics training should I look for?”

The past few years, have borne witness to a boom in analytics education – be it an Analytics major in a multi-year Master’s Degree, Software tool training, Multi-day workshops or even concise online tutorials.  The multitude of offerings, while all relevant, make the task of selecting the appropriate program very arduous for professionals. Additionally, there is not enough clarity on pertinence, process and practice to answer the one key question – what is truly needed to succeed in analytics? 

If you have been looking to get trained in analytics and have also been wondering how to choose, I recommend following these 3 steps to find out what you need, based your own background and where you want to go.

STEP 1: Identify what you want to do

What current/future role are you going for: are you/do you want to be an analyst/data scientist? Or are you a business professional, looking to leverage analytics in your day to day work flow?

STEP 2: Identify the skills gap you have based on what you want to do

As you can imagine, the skills needed for business professionals within Marketing, Product etc. functions to leverage data effectively is going to be somewhat different from that of a data scientist. Data scientists need deeper technical skills and skills to work effectively with business professionals. The 6 key analytics skills used by successful analyst/data scientist are:

  1. DTD framework: Understanding and hands-on experience of the basic “Data to Decisions” framework
  2. SQL skills: Ability to pull data from multiple sources and collate: experience in writing SQL queries and exposure to tools like Teradata TDC -0.3%, Oracle ORCL NaN% etc. Some understanding of Big Data tools using Hadoop is also helpful.
  3. Basic “applied” stat techniques: Hands-on experience with basic statistical techniques: Profiling, Correlation analysis, Trend analysis, Sizing/Estimation, Segmentation (RFM, product migration etc.)
  4. Working effectively with business side: Ability to work effectively with stakeholders by building alignment, effective communication and influencing
  5. Advanced “applied” stat techniques (hands-on): Hands-on comfort with advance techniques: Time Series, Predictive Analytics – Regression and Decision Tree, Segmentation (K-means clustering) and Text Analytics (optional)
  6. Stat Tools: Experience with one or more statistical tools like SAS, R, SPSS, Knime or others.

On the other hand, business professionals need easy access to data through some kind of tool like Business Object, Micro strategy etc., basic analysis skills and ability to work effectively with data scientists and analysts. The 4 key analytics skills needed by business professionals are:

  1. DTD framework: Understanding and hands-on experience of the basic “Data to Decisions” framework
  2. Basic “applied” stat techniques: Hands-on experience with basic statistical techniques: Profiling, Correlation analysis, Trend analysis, Sizing/Estimation, Basic Segmentation
  3. Working effectively with analysts: Ability to work effectively with Data Scientists/Analyst
  4. Advanced “applied” stat techniques (intro): High level understanding of advance techniques: Time Series, Predictive Analytics – Regression and Decision Tree, Segmentation

STEP 3: Based on skills gap you identified, choose the most appropriate training option

Given what you want to do, figure out the skills gap you have and fill out the chart below. Depending on the gaps, there 3 major options to get the analytics training you need: Analytics Chart

Analytics Training Skills Chart

  1. Master’s degree in Analytics: Several universities are offering Master’s degree in Analytics often by combining courses from their Statistics, Computer Science and Management department. In my experience, this program is most useful for individuals with no quantitative background but looking for future data scientist/analyst roles. These programs are fairly comprehensive but are as a result, time consuming and often not appropriate for working professionals. Some universities do offer online options making it more accessible.
  2. Semester courses at local universities: Most universities offer semester/quarterly courses from statistics and computer science department, often as part of continuing education program. These courses are most appropriate for data scientist/analyst/people with some quantitative background who are looking to pick up incremental skills for their current analytics role – for e.g. if are in an analytics role and you have never used R, you can take a semester course like “programming in R”.
  3. Professional Workshop: Many consulting companies like Analytic Square and others offer short analytics training most appropriate for working professionals. Depending on their area of focus, these short courses are most appropriate for business professionals looking to leverage data to make better decisions and analyst looking to pick incremental skills. The most valuable aspect of these courses are that these are courses geared towards business and often taught by analytics professionals who have seen analytics in action as applied to business. Downside of these courses are, they are not comprehensive and often don’t cover all the statistical concepts. But being short in duration, they are very accessible by most working professionals. Statistical tool companies, like SAS, SPSS etc. are good places to get the respective tool training.

But in the end, do your own due diligence and be sure to match the gaps you have identified with the courses you choose to take.

 

Online SPSS Training for Beginners-Session 1

 

What is in this workshop

  • —SPSS interface: data view and variable view
  • —How to enter data in SPSS
  • —How to import external data into SPSS
  • —How to clean and edit data
  • —How to transform variables
  • —How to sort and select cases
  • —How to get descriptive statistics

 Data used in the workshop

  • —We use 2009 Youth Risk Behavior Surveillance System (YRBSS, CDC) as an example.
  • —YRBSS monitors priority health-risk behaviors and the prevalence of obesity and asthma among youth and young adults. 
  • —The target population is high school students
  • —Multiple health behaviors include drinking, smoking, exercise, eating habits, etc.
 
—Data view
—The place to enter data
—Columns: variables
—Rows: records
 
—Variable view
—The place to enter variables
—List of all variables
—Characteristics of all variables

 

Before the data entry
 
  • —You need a code book/scoring guide
  • —You give ID number for each case (NOT real identification numbers of your subjects) if you use paper survey.
  • —If you use online survey, you need something to identify your cases.
  • —You also can use Excel to do data entry.

Example of a code book

spss1

Enter data in SPSS 19.0 
 
spssimage2
 
 
Enter variables
spssimage3
 
Enter variables
 
spssimage4
 
Enter cases
 
spssimage5
 Keep watching this page…..

Engineers who implement process control can use analytics to think outside the of box. Better yet, they can use analytics to help solve the issues and risks associated with being inside the box or outside the box in the first place. Read on to learn what box I’m referring to exactly.

Understanding advanced process control

Basic or advanced process control (APC) are terms typically associated with process industries, such as chemicals, petrochemicals, oil and gas, and power generation. These industries deal with many continuous processes and fluid processing. It is interesting to me that even though oil and gas and power generation industries embrace and implement APC that, in general, these same industries don’t necessarily embrace using advanced analytics along with APC.

First let’s see how APC may force an engineer inside a rectangle or box. APC tends to use known values or safe ranges — plus or minus some percentage off these known values, such as an average — as inputs that impact a specific process or target within a process.

Real world processes are best represented by some type of ellipse which can be modeled using a more real world distribution of input values instead of these “known” ranges.  By choosing to use “known safe ranges” as inputs you are now limiting yourself and your process to fitting inside a rectangle or box.

A picture is worth a thousand words, especially in this case the graph below makes it much easier to understand or explain what all this means.

In this simple example, the points represent the real world values, APC represents this process as the black box and only takes into account the values that fall into that box, while advanced analytics represents this process by the red ellipse.  If you use APC alone then you have issues or risks that fall into these two categories:

  1. Points that fall within the box, but outside the ellipse.
  2. Points that fall outside the box, but inside the ellipse.

You may be wondering, who cares? Let’s add some background information to our example. The process is being monitored and there are consequences when the process is stopped or restarted, or if the process fails to stop when it gets too high or low out of range. More specifically:

  • Whenever the process is stopped and restarted it costs our company some number of dollars (lost money!) .
  • If the process gets too high out of range and isn’t stopped, it can cause an explosion (major safety issue!).
  • If the process gets too low out of range the resulting output will not meet the required specifications and will need to be disposed of instead of sold (even more money losses!).

 Now who cares? Everyone at this company should, especially with the safety issue.

Safety issue: The points in the upper right hand corner of the box, but outside the ellipse. In this case the process being monitored appears to be within the proper range, but in the real world the values are too high out of range which eventually results in an explosion.

Money issue: The points above and below the box, but inside the ellipse shows situations where the real world process is actually fine, but our monitoring process has us stop the process and restart it which results in money being lost.

Money issue: The points in the lower left hand corner of the box, but outside the ellipse results in our process being monitored appears to be within the proper range, but the end product will be out of specification and therefore result in money being lost.

Where can this lesson be applied?

I’ve seen advanced analytics used to enhance process controlled systems  for improved safety and overall production in refining oil, generating power, producing beer, chemicals, pharmaceuticals, food products, as well as other processes across a variety of industries. For example, in generating power and monitoring a turbine one can easily identify speed, cooling, and heating as three processes impacted by a variety of measures that can be monitored and improved this way.

What type of processes do you have in your business today that could be improved by applying advanced analytics in this way?

One other issue for the engineer or data scientist to potentially argue over is whether the X and Y variables in this example are correlated or not?  You might think these two would agree, however if they both calculate Pearson’s r value (a common statistical measure to determine correlation), they may come to opposite conclusions.  Once again you may wonder why?  It goes back to being inside the box.  Many times when using APC someone only looks at the points that fall inside the box and as a result they are not using all the data available to make the proper decision.

SAS Interview Questions and Answers

  1. What has been your most common programming mistake?
  2. What is your favorite programming language and why?
  3. What is your favorite operating system? Why?
  4. Do you observe any coding standards? What is your opinion of them?
  5. What percent of your program code is usually original and what percent copied and modified?
  6. Have you ever had to follow SOPs or programming guidelines?
  7. Which is worse: not testing your programs or not commenting your programs?
  8. Name several ways to achieve efficiency in your program. Explain trade-offs.
  9. What other SAS products have you used and consider yourself proficient in using?
  10. How do you make use of functions?
  11. When looking for contained in a character string of 150 bytes, which function is the best to locate that data: scan, index, or indexc?
  12. What is the significance of the ‘OF’ in X=SUM(OF a1-a4, a6, a9);?
  13. What do the PUT and INPUT functions do?
  14. Which date function advances a date, time or date/time value by a given interval?
  15. What do the MOD and INT function do?
  16. How might you use MOD and INT on numerics to mimic SUBSTR on character strings?
  17. In ARRAY processing, what does the DIM function do?
  18. How would you determine the number of missing or nonmissing values in computations?
  19. What is the difference between: x=a+b+c+d; and x=SUM(a,b,c,d);?
  20. There is a field containing a date. It needs to be displayed in the format “ddmonyy” if it’s before 1975, “dd mon ccyy” if it’s after 1985, and as ‘Disco Years’ if it’s between 1975 and 1985. How would you accomplish this in data step code? Using only PROC FORMAT.
  21. In the following DATA step, what is needed for ‘fraction’ to print to the log? data _null_; x=1/3; if x=.3333 then put ‘fraction’; run;
  22. What is the difference between calculating the ‘mean’ using the mean function and PROC MEANS?
  23. Have you ever used “Proc Merge”? (be prepared for surprising answers..)
  24. If you were given several SAS data sets you were unfamiliar with, how would you find out the variable names and formats of each dataset?
  25. What SAS PROCs have you used and consider yourself proficient in using?
  26. How would you keep SAS from overlaying the a SAS set with its sorted version?
  27. In PROC PRINT, can you print only variables that begin with the letter “A”?
  28. What are some differences between PROC SUMMARY and PROC MEANS?
  29. Code the tables statement for a single-level (most common) frequency.
  30. Code the tables statement to produce a multi-level frequency.
  31. Name the option to produce a frequency line items rather that a table.
  32. Produce output from a frequency. Restrict the printing of the table.
  33. Code a PROC MEANS that shows both summed and averaged output of the data.
  34. Code the option that will allow MEANS to include missing numeric data to be included in the report.
  35. Code the MEANS to produce output to be used later.
  36. Do you use PROC REPORT or PROC TABULATE? Which do you prefer? Explain.
  37. What happens in a one-on-one merge? When would you use one?
  38. How would you combine 3 or more tables with different structures?
  39. What is a problem with merging two data sets that have variables with the same name but different data?
  40. When would you choose to MERGE two data sets together and when would you SET two data sets?
  41. Which data set is the controlling data set in the MERGE statement?
  42. How do the IN= variables improve the capability of a MERGE?
  43. Explain the message ‘MERGE HAS ONE OR MORE DATASETS WITH REPEATS OF BY VARIABLES”.
  44. How would you generate 1000 observations from a normal distribution with a mean of 50 and standard deviation of 20. How would you use PROC CHART to look at the distribution? Describe the shape of the distribution.
  45. How do you generate random samples?
  46. What is the purpose of the statement DATA _NULL_ ;?
  47. What is the pound sign used for in the DATA _NULL_?
  48. What would you use the trailing @ sign for?
  49. For what purpose(s) would you use the RETURN statement?
  50. How would you determine how far down on a page you have printed in order to print out footnotes?
  51. What is the purpose of using the N=PS option?
  52. What system options would you use to help debug a macro?
  53. Describe how you would create a macro variable.
  54. How do you identify a macro variable?
  55. How do you define the end of a macro?
  56. How do you assign a macro variable to a SAS variable?
  57. For what purposes have you used SAS macros?
  58. What is the difference between %LOCAL and %GLOBAL?
  59. How long can a macro variable be? A token?
  60. If you use a SYMPUT in a DATA step, when and where can you use the macro variable?
  61. What do you code to create a macro? End one?
  62. Describe how you would pass data to a macro.
  63. You have five data sets that need to be processed identically; how would you simplify that processing with a macro?
  64. How would you code a macro statement to produce information on the SAS log? This statement can be coded anywhere.
  65. How do you add a number to a macro variable?
  66. If you need the value of a variable rather than the variable itself, what would you use to load the value to a macro variable?
  67. Can you execute a macro within a macro? Describe.
  68. Can you a macro within another macro? If so, how would SAS know where the current macro ended and the new one began?
  69. How are parameters passed to a macro?

 

Career in Business Intelligence and Analytics

The economy might still be wobbly. But for Job seekers, there are pockets of promises. Out of the so many options available, Analytics and Business Intelligence is on top of the charts as per one of Editorial of Economic Times.  As per economic times , there are few points that every student and youth should consider before taking decision about their career .

Reasons

Recent developments in hardware and networking technologies have made it cheap to not only gather large volumes of data, but also to store it and retrieve it with ease. The ability to analyze the data and make business sense out of it is a skill that is fast gaining prominence,” says Ajit Isaac, MD and CEO at Ikya Human Capital Solutions, who projects 12,000-15,000 openings next year 2014.

Sectors

Banks, consumer goods, retail, IT & IT consulting, business consulting, & e-commerce/online.

Skills

Training/experience in statistics or financial analysis; familiarity with statistical techniques, and software such as SAS and SPSS.

Pay

Rs 4.5-8 lakh (p.a) Entry level (Graduate/PG)

Rs 8-12 lakh (Five years of experience)

Rs 15 lakh for IITs & Premier Schools