The World Happiness Report was recommended to be a good starting point for gauging worldwide bliss. Although, the data points surely helped us throughout our analysis, towards the end we were able to understand that perhaps all of the variables in this report alone are insufficient to accurately measure the happiness of a country since "happiness" is very relative in nature.
There are six measurements taken per country for gauging the World Happiness Index. They are:
GDP per Capita - Gross Domestic Product per capita for the countries
Family - Satisfaction Rank of Family
Life Expectancy - Avg. expected years to live
Freedom - Perception of freedom quantified
Generosity - Numerical value estimated based on the perception of Generosity experienced by poll takers in their country.
Trust/Government Corruption - A quantification of the people's perceived trust in their governments.
Dystopia Score - Score based on comparison to hypothetically the saddest country in the world.
Dystopia Residual - Rank of any country in a particular year
The Happiness Score calculated in the report is actually an average of the responses to the main life evaluation question asked in the Gallup World Poll (GWP), which uses the Cantril Ladder. Cantril Ladder involves something called a Cantril step, where respondents are asked about the most excellent life they can imagine and with that as the benchmark, score their current life.
Given the data available per country to gauge the Happiness Index, our aim is to:
Part A - Analyze and understand which factors affect the Happiness Index Score of countries
Part B - Analyze and understand the relationship between Terror Attacks and Happiness Index
Part C - Create a Model to predict the Happiness Index of a Country
Part D - To see how much Health contributes to the Happiness Index? With the current pandemic at hand, predicting COVID-19 Cases in the coming days for countries.
Part E - Creating a Dashboard for viewing COVID-19 Predictions
To analyze and understand which factors affect the Happiness Index Score of countries
Exploratory Data Analysis
Our objective here is to look through the datasets and perform some basic analysis to understand and gauge insights.
A look into Correlation
The Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data.
The Spearman rank correlation coefficient, ρ considers the ranks of the values for the two variables.ρ will always be a value between -1 and 1.
The further away ρ is from zero, the stronger the relationship between the two variables. The sign of ρ corresponds to the direction of the relationship. If it is positive, then as one variable increases, the other tends to increase. If it is negative, then as one variable increases, the other tends to decrease.
You use Spearman’s correlation if your data have a non-linear relationship (like an exponential relationship) or you have one or more outliers. However, Spearman’s correlation is only appropriate if the relationship between your variables is monotonic.
Inference: From the above matrixes, it seems like Health, GDP Per Capita and freedom are the top 3 factors that correlate with happiness index.
This type of analysis consists of use of single variable. The analysis of univariate data does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it.
This type of analysis involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables
From the above plot, we can infer that there seems to be a: Linear Relationship between happiness score and GDP per capita, happiness score and health, and happiness score and freedom Non-Linear Relationship between happiness score and generosity and happiness score and government trust
Performing ANOVA test between predictors and response variable to gauge how significantly it affects the scoring
Analysis of Variance is a statistical method, used to check the means of two or more groups that are significantly different from each other. It assumes Hypothesis as: H0: Means of all groups are equal. H1: At least one mean of the groups are different.
If the distributions overlap or close, then the grand mean will be similar to individual means whereas if distributions are far, the grand mean and individual means differ by a larger distance.
In ANOVA, we will be checking & comparing both Between-group variability to Within-group variability through f-test.
If there is no significant difference between the groups that all variances are equal, the result of ANOVA’s F-ratio will be close to 1.
The best predictors of Happiness Index are: GDP per capita, government trust, health, and family
Two of the aspects coming out of ANOVA test belong to our correlation inference i.e GDP per capita and health. Apart from that, it seems like government trust and family also play quite a significant role in realizing the happiness score.
Looking at all countries and their ranks in Happiness Index Score
Inference: Clearly Norway seems to be the top country in Happiness Index. It is not surprising since European Countries have better living conditions.
Happiness with regards to Generosity and Economy
Inference: The bubbles farther on the right side are mostly countries in Europe. Clearly, they have better GDP Per capita. Surprisingly European countries score average on Generosity (Asian countries have the highest generosity) but have the most Happiness Score rankings.
Happiness with regards to Health and Economy
Inference: The bubbles farther on the right side are mostly countries in Europe. Clearly, they have better health scores as well since they are present on top. The lowest health scores mostly occur in African and Asian countries.
Happiness with regards to Family and Economy
The bubbles farther on the right side are mostly countries in Europe. Clearly, they have better Family ratings. The most unsatisfied family ranking is actually mixture of mostly African, South American, Asian and a few European & North American countries.
Happiness with regards to Govt Trust and Economy
Most countries rank low on government trust giving us insights into how most of the world population doesn't necessarily trust its governments despite the overarching push of democracy to be adoptee. Countries that place a lot of trust in the government are Rwanda and obvious countries like Singapore, New Zealand, Finland.
World-wide View of Countries with regards to Generosity
Trend of Happiness Over Time
From the chart we can notice that the continent of Europe has a good score of GDP per capita, compared to others. Australian countries contribute the least to global GDP.
Analyse and understand the relationship between Terror Attacks and Happiness Index
One of the things that intrigued us is terrorism across the world. With wars and conflicts happening on a day to day basis, we really wanted to understand to what extent terrorism plays a role in happiness index. For this we combined two datasets - the Happiness Datasets and the World Terrorism dataset from Global Terrorism Database(GTD). In our datasets, we have only taken the count of terror attacks and no other information such as text based data surrounding the context of what happened, names of the weapons used and so on since that would delve into NLP. Our future scope of work is using NLP to also analyse the datasets in order to better gauge the relationship between happiness and terrorism.
Processing the Datasets
Now that we have seen EDA on Happiness Index, we were wondering what about terror attacks? Clearly the factors mentioned above are not sufficient to explain true happiness. So we decided to see how terror attacks combine with happiness index to answer the question - is there a correlation?
Below Cells take time to execute due to large dataset
Exploratory Data Analysis on Combining Dataset with Terrorism
We can see that there are some countries which face a lot of terrorist attacks.
There seems to be a: Linear Relationship between happiness score and GDP per capita, happiness score and health, and happiness score and freedom Non-Linear Relationship between happiness score and generosity and happiness score and government trust
With the data that we have, there doesn't seem to be much correlation between terror attacks and happiness index. We would need more data to come to a significant conclusion as to how terrorism really affects the happiness index. Perhaps another factor that would allow us to further understand the happiness index would be war conditions. Countries like Syria and Palestine are in critical war zones which would make their living conditions poor and hence affect the happiness index.
To create a Model to Predict Happiness Index.
Predicting happiness Index
We used Lasso Regression with the degree of 6 to perform Polynomial Lasso Regression in order to predict the Happiness Score.
Our MSE value for Lasso Regression is 0.25 and our R2 Score is 0.82 which is pretty satisfactory.
Why did we use Lasso Regression?
We understood that Lasso tends to do well if there are a small number of significant parameters and the others are close to zero (ergo: when only a few predictors actually influence the response). This was our case - since the parameters were relatively small, this seemed like a good approach. Ridge works well if there are many large parameters of about the same value (ergo: when most predictors impact the response).
Lasso, or Least Absolute Shrinkage and Selection Operator, is conceptually quite similar to ridge regression. It adds a penalty for non-zero coefficients. However, unlike ridge regression which penalizes sum of squared coefficients (the so-called L2 penalty), lasso penalizes the sum of their absolute values (L1 penalty). As a result, for high values of λ, many coefficients are exactly zeroed under lasso.
What did we do in MLP Regressor?
Our choice of multiple number of layers here is to depict non-linearity in the model. Multiple number of layers lead to non-linearity, but excess number of layers may lead to overfitting of the model.
Experimenting and trying out multiple combinations of layers and neurons, three layers with depicted neurons turned out to be suitable for our model.
Also, we used the default Activation Function, ReLu because of our model being a Linear Regression Model and ReLu fits the best for this problem.
Our MSE value for MLP Regressor is 0.26 and our R2 Score is 0.82 which is pretty much the same as Lasso Regression.
Predicting Terrorist attacks
We also tried experimenting with the variables we had from the happiness dataset to see if we can satisfactorily predict no. of terrorist attacks likely to happen.
Of course the model does not have the best performance because we understand that there are more factors that affect the outcome.
Our future work here is to get more external factors relating to what sparks terror attacks and create amodel to allow for better risk handling.
Clearly our model is not performing well here.
To see how much health contributes to the Happiness Index. With the pandemic at hand, predicting COVID-19 cases in the coming days for countries.
From Part A, we realized that Health does play a major role in a country's happiness score. With the pandemic at hand, we were motivated to look at COVID cases and forecast the upcoming cases. We wanted to compare the COVID data with the happiness index data, however, we felt that it would not give the right results, since the happiness index data of 2020 is from the months of January and February, when COVID was not the health crisis that it is now.
However, in pursuit of excitement and interest, we decided to go forth to do a basic forecasting model on COVID-19 dataset using fbprophet.
What and Why Prophet?
Prophet is Facebooks' open source time series prediction. Prophet decomposes time series into trend, seasonality and holiday. It has intuitive hyper parameters which are easy to tune.
Prophet time series = Trend + Seasonality + Holiday + error
Trend models non periodic changes in the value of the time series. Seasonality is the periodic changes like daily, weekly, or yearly seasonality. Holiday effect which occurs on irregular schedules over a day or a period of days. Error terms is what is not explained by the model. We believe that the advantages of using Prophet are:
It accommodates seasonality with multiple periods
Prophet is resilient to missing values
The best way to handle outliers in Prophet is to remove them
Our main motivation here was to be able to learn how to best provide the model outcomes to the audience.
You can see the code in our file under the name: Covid-pred
The data factors being used for calculating the Happiness Index of the countries are not holistic and inclusive. There are other factors also to be considered. GDP per capita seems to be a skewed figure itself and the limitations that GDP poses are highly likely to bias the happiness score.
We did not find much correlation between the number of terror attacks and happiness index of a country. However, we believe we need to consider more factors and influences pertaining to terrorism for us to properly see the relationship.
For COVID-19 forecasts, we performed univariate analysis on our historical data, which made us realize that historical data alone might not be sufficient for the prediction. But certainly, this is one of the main predictors and it can be used with other set of predictors to create a more powerful model.
Improvement: Figure out another way to calculate Happiness Index of a country which includes more holistic and inclusive factors Based on our observations, we believe that factors apart from the 6 that were selected, need to be considered in order to make accurate happiness index scoring. A possible improvement would be to research an alternative way to calculate the index without using GDP per capita as a score.
Improvement: Move to using NLP & Decision Trees for analyzing Terrorism Data Most of the factors in the Terrorism Dataset were text based. Hence, using NLP here will be best for us to understand the influences of the predictor on the response. To improve model prediction, we believe models pertaining to Decision Trees will help.
Improvement: Move to Multivariate Analysis
We forecasted COVID-19 cases using only past data. However, we are aware that historical data alone is not enough to make accurate forecasts. There are many other external factors – our intention was to more or less look at the trend and observe how this trend will move in the future.