Prior knowledge of programming and statistics are not required. It's only an introduction to Advanced Data Analysis, and we will continue every Saturdays with more advanced tasks. The main goal is to develop the interest of mathematical programming among students from Africa. However, anyone who interested is welcome to join and learn. This class will be a good introduction to data analysis with advanced statistical software like R.
For general information about the class, please visit the link below:
https://piazza.com/clements_school_on_data_analysis/fall2020/ss401/info
For the personal profile of the instructor/tutor, click below:
https://f36d003f-d71d-49cc-9fea-cbb325732ca8.filesusr.com/ugd/779116_8b81a5793af94ccdadec45b62b49cfcd.pdf
Personal website: https://twumasiclement.wixsite.com/website
Below summarizes the content of this YouTube video- third lesson:
DISCLAIMER: This is a continuation of the previous YouTube videos, so watch them before this video. That will help us wrap up all we have done in our previous meetings so far before we start the Inferential Statistical Analyses (ie. statistical tests and models) from next week. The script (and a Jupyter HTML file) for this video has been uploaded on Piazza platform for you to practice at your own pace with it afterwards. The primary goal of this third meeting/youtube video is to solve some assignment tasks to wrap up what has been learnt so far. Below are the assignment tasks (from Saturday's class dated October 31, 2020):
1. Assign the names to the levels of the categorical variables (Residence) where 1 means South; 2 means North; 3 means East; 4 means West.
2. Create a new numerical variable called Total_points; which is obtained after adding both the Enrolment_Points and Mathematics points for each student & add this new variable (Total_points) to the original data.
3. Convert their Mathematics points to a categorical variable called Mathematics_grade with categories: i) Low class if Mathematics point is less 40% ii) Medium class if Mathematics point is from 40% to 70% and iii) High class if Mathematics point greater than 70%.
4. Save the new data as a CSV file directly into your working directory/DataAnalysis_results_R folder.
5. Create your own function to compute descriptive/summary statistics (just update what I created already in the second meeting dated October 24, 2020). The summary statistics function should: i) First determine the type of variable, ii). If it's numeric find & return mean, median, mode, variance, standard deviation, maximum value, minimum value, standard error, skewness, kurtosis and 95% quantile recorded to 2 decimal places as well as histogram plot of the numeric variable coloured by “green” colour. iii) Else if it is categorical, it should find & return percentages for all categories/levels (in 1 decimal place) and the name of the categories as a data frame as well as plot a pie chart with percentages for each category of the variable with different colours.
6. For each variable, find the summary statistics with the data created above.
7. Depict at least 8 figures/plots (it can be more than
8) using only ggplot2 package to describe the data the best you can as a Data scientist (example of the plots are line graphs, scatter plots, barplot, boxplot, histograms of numeric variables across different categories, barchart of categorical variables distinguishing between different categories variables, etc ). Figures should be nice and very informative too as a professional Data Scientist.
9. Produce 5 different plots to also summarize the data the best you can without the use of the ggplot2 package.
10. Find the correlation matrix plots between all the numerical variables (Enrolment_points, Mathematics points and Total points) using the "PerformanceAnalytics" R package.
Comments