Category Archives: Analytics

SAS Projects- Assignment 1

Dear reader, we are going to add some practice assignment on different tools and techniques. We would encourage you to practice on these topics.

Assignment-1

In an  advertisements a company claimed that taking its product Down-Drowsy reduced the time to fall asleep by 46% over the time necessary without pills. Able based this claim on a sleep study. Persons were asked to record how long they took to fall asleep (`sleep latency’) in minutes, and their average for a week was computed. In the next week these persons received Down-Drowsy and recorded their sleep latency. The following link gives data of the average sleep latency for each of the 73 persons first for the week without pills and then for the week with pills.

Click Assignment 1 to access the data.

Problem:1 Put the data above into a SAS data set containing 3 variables and use Patient, Week 1 and Week 2 as labels. Refer to Data Step Basics , SAS Variables, and Input Statement (List) for assistance. (Use the windows clipboard to transfer the data from the help file to the SAS program window following the dataline statement.) Use input statement with @@.

Problem:2 Use proc sort to arrange the data in increasing order by patient number. Print this sorted data set using proc print.

Problem:3 Use proc means to calculate the mean and standard deviation of the sleep latency times for each individual week.

Problem:4 How you will statistically find out that which drug is more effective. Use statistical call and interpret them.

Please submit your answer or let us know at analyticsquare@gmail.com that if you need assistance or help to solve above problems.

NoSQL- Introduction


Introduction
What NoSQL is
Types of NoSQL supported databases

Introduction: Today’s market is full of the new buzzwords- NoSQL, Big Data and Clouds etc. In this post we will discuss about latest buzzword- NoSQL and sometime I believe it is most confusing buzzword in today’s world. For last 25 years we are using databases to store data electronically (In 1978 Oracle Version 1 was launched-Source: Wikipedia). In databases, data is stored in Tables and Tables have rows and columns. Tables may have relation with other tables and called Parent and Child table very often; so these databases are called RDMBS. RDBMS must follow the ACID rules: Automaticity, Consistency, Isolation and Durability. RDMBS works perfectly when we work with structured and organized data. Each row has fixed columns in RDBMS, we have to decide the structure (schema) first of table before inserting data into it. Like table Car have predefined columns: ModelNo, Color, Make.

Now in today’s world the nature of data is changing very rapidly. We can consider anything as data- my email id is data, my likes, my posts, my pictures, my browser history, my call detail and infect my geostationary location anything can be used as data. Since the nature of data is growing rapidly the storage capacity is also growing. To store and manage the unstructured data is a big problem.  Since nature of data is changing so it is not very easy to maintain it by using traditional ways. More data mean more space requirement and more work in data management scalability can be an issue with RDBMS.

After consideration of all above limitations companies want a solution which support non-relational environment, which is not schema specific and supports dynamic schema and very easy to maintain and scalable- not only vertically but also horizontally. To overcome all the limitations of RDBMS a new term introduced – NoSQL.

What NoSQL is: NoSQL term came into picture in early of 2009 (However Carlo Strozzi used the term NoSQL in 1998). Carlo pronounces NoSQL as no-seequel. But today NoSQL is known as “Not Only SQL”. NoSQL is the concept which does not follow traditional RDBMS. The very special feature of NoSQL supported database are: they do not work in relational model, they do not use SQL to query data, they support dynamic schema and they are scalable and guarantee the data availability as RDBMS support ACID properties. NoSQL supported databases support BASE (Basic Availability, Soft state, Eventual consistency) properties as RDBMS supports ACID.

Types of NoSQL supported databases: As per the data model used, Query Model structure and Consistency Model structure, NoSQL supported databases can be divided into four categories:

  • Document Databases
  • Key-Value Stores
  • Graph Databases
  • Wide-Column Stores

Document Model: As relational databases store data in rows and columns, document databases store data in documents and fields. These documents use structure in JSON (JavaScript Object Notation). Documents contain one or more fields and each field contains a value with specific data type such as a string, date or array. Document can contain Arrays or even nested documents. Like in RDBMS data stores in multiple rows and columns in tables in document model each record and data associated are typically stored in a single document. This makes very simple to data access. In a document database, the schema is dynamic: each document can contain different fields (which is opposite to fix columns in each row in a single table in RDBMS). This approach makes life of developer, database programmer and database administrators easier when add some new fields in documents in future.

Examples: MongoDB, CouchDB

Key-Value Stores: Key-Value store type of NoSQL databases are simplest databases and very similar to document store. In key-value type every value is stored against a key. Similar to document store, there is no need to define schema for key-value store. Key-Value store database require the key before storing the data and that key must be known while extracting the record. This model stores the key-value in Hash Table.

Examples: Riak, Redis, FoundationDB

Graph Databases: This type of NoSQL database based on Graph Theory and data is stored in form of Graphs- made by Nodes, Edges and properties to store/represent data. The most useful property of this type of database is Index Free, which means every value directly links or points to its associated value; hence no need of Indexes to lookup the values. Nodes in graph database are similar to entities in E-R diagram like people, company, and department. Graph database stores values in Nodes which have properties and organized by relationship which also have some properties. Edge is the connection or relationship between two nodes.

Examples: Neo4J, Titan, Infinite Graph

Wide-Column Stores: In RDBMS data is stored in two dimensional Row and Columns whereas in Wide Column Stores type of NoSQL database store data in one dimensional- In Column only. In this approach columns can be nested and called super columns. Example: In RDMBS, the Employee table can have below structure:

Employee_ID First_Name Last_Name Salary
E001 Deepak Sharma 10,000
E002 Sachin Sharma 12,000

In column store database, the same structure can be stored as:

E001,E002
Deepak,Sachin
Sharma,Sharma
10,000,12,000

Examples: Cassandra, Hadoop, Cloudata.

Conclusion: All RDBMS are most stable and trusted sources to store data. NoSQL databases are still in development. However, we have stable release of each and every NoSQL databases and companies have started to use them. In the last I would like to say that NoSQL is not the replacement of SQL it is just an alternate of SQL.

top

3 Steps to Identify the Analytics Training You Need

You know your keynotes at conferences have a positive impact when they raise awareness. My keynotes raise awareness not just to the science that is analytics and its application, but also the need to achieve erudition at it. I know, because one of the most commonly asked post-keynote questions I get is – “I’m very interested in furthering my knowledge in Analytics. Given my background, could you suggest what kind of analytics training should I look for?”

The past few years, have borne witness to a boom in analytics education – be it an Analytics major in a multi-year Master’s Degree, Software tool training, Multi-day workshops or even concise online tutorials.  The multitude of offerings, while all relevant, make the task of selecting the appropriate program very arduous for professionals. Additionally, there is not enough clarity on pertinence, process and practice to answer the one key question – what is truly needed to succeed in analytics? 

If you have been looking to get trained in analytics and have also been wondering how to choose, I recommend following these 3 steps to find out what you need, based your own background and where you want to go.

STEP 1: Identify what you want to do

What current/future role are you going for: are you/do you want to be an analyst/data scientist? Or are you a business professional, looking to leverage analytics in your day to day work flow?

STEP 2: Identify the skills gap you have based on what you want to do

As you can imagine, the skills needed for business professionals within Marketing, Product etc. functions to leverage data effectively is going to be somewhat different from that of a data scientist. Data scientists need deeper technical skills and skills to work effectively with business professionals. The 6 key analytics skills used by successful analyst/data scientist are:

  1. DTD framework: Understanding and hands-on experience of the basic “Data to Decisions” framework
  2. SQL skills: Ability to pull data from multiple sources and collate: experience in writing SQL queries and exposure to tools like Teradata TDC -0.3%, Oracle ORCL NaN% etc. Some understanding of Big Data tools using Hadoop is also helpful.
  3. Basic “applied” stat techniques: Hands-on experience with basic statistical techniques: Profiling, Correlation analysis, Trend analysis, Sizing/Estimation, Segmentation (RFM, product migration etc.)
  4. Working effectively with business side: Ability to work effectively with stakeholders by building alignment, effective communication and influencing
  5. Advanced “applied” stat techniques (hands-on): Hands-on comfort with advance techniques: Time Series, Predictive Analytics – Regression and Decision Tree, Segmentation (K-means clustering) and Text Analytics (optional)
  6. Stat Tools: Experience with one or more statistical tools like SAS, R, SPSS, Knime or others.

On the other hand, business professionals need easy access to data through some kind of tool like Business Object, Micro strategy etc., basic analysis skills and ability to work effectively with data scientists and analysts. The 4 key analytics skills needed by business professionals are:

  1. DTD framework: Understanding and hands-on experience of the basic “Data to Decisions” framework
  2. Basic “applied” stat techniques: Hands-on experience with basic statistical techniques: Profiling, Correlation analysis, Trend analysis, Sizing/Estimation, Basic Segmentation
  3. Working effectively with analysts: Ability to work effectively with Data Scientists/Analyst
  4. Advanced “applied” stat techniques (intro): High level understanding of advance techniques: Time Series, Predictive Analytics – Regression and Decision Tree, Segmentation

STEP 3: Based on skills gap you identified, choose the most appropriate training option

Given what you want to do, figure out the skills gap you have and fill out the chart below. Depending on the gaps, there 3 major options to get the analytics training you need: Analytics Chart

Analytics Training Skills Chart

  1. Master’s degree in Analytics: Several universities are offering Master’s degree in Analytics often by combining courses from their Statistics, Computer Science and Management department. In my experience, this program is most useful for individuals with no quantitative background but looking for future data scientist/analyst roles. These programs are fairly comprehensive but are as a result, time consuming and often not appropriate for working professionals. Some universities do offer online options making it more accessible.
  2. Semester courses at local universities: Most universities offer semester/quarterly courses from statistics and computer science department, often as part of continuing education program. These courses are most appropriate for data scientist/analyst/people with some quantitative background who are looking to pick up incremental skills for their current analytics role – for e.g. if are in an analytics role and you have never used R, you can take a semester course like “programming in R”.
  3. Professional Workshop: Many consulting companies like Analytic Square and others offer short analytics training most appropriate for working professionals. Depending on their area of focus, these short courses are most appropriate for business professionals looking to leverage data to make better decisions and analyst looking to pick incremental skills. The most valuable aspect of these courses are that these are courses geared towards business and often taught by analytics professionals who have seen analytics in action as applied to business. Downside of these courses are, they are not comprehensive and often don’t cover all the statistical concepts. But being short in duration, they are very accessible by most working professionals. Statistical tool companies, like SAS, SPSS etc. are good places to get the respective tool training.

But in the end, do your own due diligence and be sure to match the gaps you have identified with the courses you choose to take.

 

Predictive Analytics and Social Media- Predicting the Unpredictable

University researchers have discovered a new way to predict what topics on Twitter will be popular hours before they are identified as trending topics, offering a novel method to analyze information that changes over time.

social media crystal ball 300 Predictive Analytics and Social Media: Predicting the Unpredictable
MIT professor Devavrat Shah and his student Stanislav Nikolov have developed a new algorithm that they say can, with 95% accuracy, predict the Twitter topics that trend, or suddenly explode in volume, reflecting their popularity.

Twitter determines the trending topics based on its own algorithm that analyzes the number of Tweets and those that have recently grown in volume, according to an MIT report on the research.

Shah notes that his research differs from the standard approach to machine learning in which researchers develop a general hypothesis about a pattern and specifics about that pattern need to be inferred.

“You’d say, ‘Series of trending things . . . remain small for some time and then there is a step,’” Shah says in the MIT article. “This is a very simplistic model. Now, based on the data, you try to train for when the jump happens, and how much of a jump happens. The problem with this is, I don’t know that things that trend have a step function. There are a thousand things that could happen.”

With the method that he’s developed, the data decides, he adds.

Shah and Nikolov compare changes over time in the number of Tweets about new topics to a sample set of data. Sample data where the statistics are similar to those of the new topic are given more weight to predict whether the topic will become a trend or fade away.

In essence, the comparison to the sample data set allows the sample set to “vote” as to the likelihood that the topic will trend on Twitter. The method can be applied to any sequence of measurements that’s performed at regular intervals such as ticket sales for movies or stock prices, according to MIT.

This is not the first time researchers have used predictive analytics to tap social media data to predict seemingly unpredictable trends.

A professor at the University of California Riverside (UCR), and other researchers, have created a model that uses data from Twitter collected on a particular day to help predict how often a stock will be traded and at what price the following day.

A trading strategy that’s based on the researchers’ model, “outperformed other baseline strategies by between 1.4 percent and nearly 11 percent and also did better than the Dow Jones Industrial Average during a four-month simulation,” according to UCR Today.

“These findings have the potential to have a big impact on market investors,” says Vagelis Hristidis, an associate professor at the Bourns College of Engineering, who has helped to develop the new model. “With so much data available from social media, many investors are looking to sort it out and profit from it.”

The researchers have found that stock price correlates with the number of connected Tweets about a company – those Tweets about distinct topics that relate to one company.

Facebook has also been targeted by data scientists attempting to use predictive analytics to predict the fluctuating stock market. Arthur J. O’Connor, who has worked on Wall Street in risk management for a couple decades, has developed a method that uses data analysis to analyzes if likes on Facebook affect consumer brand stock prices.

“My theory was, you know, it’s like in high school,” he says in a NPR report. “Does being really popular help you win friends [or] help you enhance your performance? And it turns out that, yeah, popularity does seem to help brands.”

O’Connor has spent a year tracking the likes of 30 brands with the most followers on Facebook, while also tracking their daily share prices.

“So, 99.95 percent of the change could be explained by the change in fan counts,” he adds.

The admiration a company gets on social media seems to be a good predictor about stock market performance.

Online SPSS Training for Beginners-Session 1

 

What is in this workshop

  • —SPSS interface: data view and variable view
  • —How to enter data in SPSS
  • —How to import external data into SPSS
  • —How to clean and edit data
  • —How to transform variables
  • —How to sort and select cases
  • —How to get descriptive statistics

 Data used in the workshop

  • —We use 2009 Youth Risk Behavior Surveillance System (YRBSS, CDC) as an example.
  • —YRBSS monitors priority health-risk behaviors and the prevalence of obesity and asthma among youth and young adults. 
  • —The target population is high school students
  • —Multiple health behaviors include drinking, smoking, exercise, eating habits, etc.
 
—Data view
—The place to enter data
—Columns: variables
—Rows: records
 
—Variable view
—The place to enter variables
—List of all variables
—Characteristics of all variables

 

Before the data entry
 
  • —You need a code book/scoring guide
  • —You give ID number for each case (NOT real identification numbers of your subjects) if you use paper survey.
  • —If you use online survey, you need something to identify your cases.
  • —You also can use Excel to do data entry.

Example of a code book

spss1

Enter data in SPSS 19.0 
 
spssimage2
 
 
Enter variables
spssimage3
 
Enter variables
 
spssimage4
 
Enter cases
 
spssimage5
 Keep watching this page…..