1 PISA Data

The 2022 PISA Data is available on the OECD website.

2 Getting Started

Library	Description
tidyverse, janitor	For data preparation, wrangling, and exploration.
haven	To enable R to read and write various data formats such as SAS and SPSS.
knitr, DT, kableExtra	For dynamic report generation.
labelled	For reading and manipulating variable labels.
gtsummary	For summary and analytical tables.

The following code chunk uses p_load() of pacman package to check if tidyverse packages are installed in the computer. If they are, the libraries will be called into R.

pacman::p_load(tidyverse, haven, knitr, DT,
               labelled, janitor, gtsummary)

3 Reading Data into R

From PISA 2022, SAS data sets (.sas) are available with all countries in the file for each respondent type.

The code chunk below imports the 2022 Student Questionnaire dataset downloaded from OECD’s PISA Database using the read_sas() from the haven package.

stu <- read_sas("data/cy08msp_stu_qqq.sas7bdat")

The dataset is in a tibble dataframe, containing 613,744 observations (rows) across 1,279 variables (columns). Each observation corresponds to an entry from a student who participated in the 2022 PISA survey for students, and the variables correspond to information from students on various aspects of their home, family, and school background.

CNT refers to the country of response, we can use this to filter for Singapore (where CNT = SGP) responses for our analysis. filter() of the dplyr package allows us to perform this extraction of participating country.

stu_SG <- stu %>%
  filter(CNT == "SGP")

The resulting data contains 6,606 rows/observations across 1,279 columns/variables.

The .rds file format is usually smaller than its SAS file counterpart and will therefore take up less storage space. The .rds file will also preserve data types and classes such as factors and dates eliminating the need to redefine data types after loading the file. For fast and space efficient data storage, files can be exported as RDS and re-imported into R using write_rds() and read_rds() respectively.

write_rds(stu_SG, "data/stu_SG.rds")

stu_SG <- read_rds("data/stu_SG.rds")

4 Data Wrangling

Below chart provides an overview of the different categories the team hopes to focus on to understand their impact on student scores.

flowchart TD

    A[2022 PISA Survey Student Questionnaire]-.-> A11[Gender]
    A -.-> A12[Socio-economic]
    A -.-> A13[Wellbeing]
    A -.-> A14[Attitude]   
    A -.-> A15[Environment] 
    A -.-> A16[Schools]

4.1 Filtering for required dataset

After perusing through the Codebook and Technical Report, the team narrowed down the questions from the survey that would yield insightful results. The names of the columns are stored in a vector named colname. To filter the raw dataset with the columns, we use select() function of readr package to identify all the variables listed out in the colname vector.

colname <- c("CNTSCHID", "ST034Q06TA", "ST265Q03JA", "ST270Q03JA", "ST004D01T", "ST296Q01JA", "ST296Q02JA", "ST296Q03JA", "STRATUM", "HISCED", "IMMIG", "ST022Q01TA", "ST230Q01JA", "ST250D06JA", "ST250D07JA", "ST251Q01JA", "ST255Q01JA", "EXERPRAC", "ST250Q01JA", "WORKHOME", "ST268Q01JA", "ST268Q02JA", "ST268Q03JA")

The following code chunk serves the following purpose:

select() function to retain the following columns:
- Variables identified in colname and
- Columns that starts with “PV” and contains either “MATH”, “SCIE”, or “READ” to extract the plausible values of scores related to the subjects of Mathematics, Science, and Reading. This is performed using a combination of starts_with() and contains().
  - starts_with(): Matches the beginning of the column name with “PV”, and
  - contains(): Searches for columns containing three alternative subjects to be matched.
mutate() to create 3 new variables to store the mean plausible values for each subject for each row using rowMeans() and across().

stu_SG_filtered <- 
  stu_SG %>% 

  # Retains desired variables
  select(all_of(colname), starts_with("PV") & contains(c("MATH", "READ", "SCIE"))) %>% 

  # Calculates the mean of plausible values for each subject per student
  mutate(Math = rowMeans(across(starts_with("PV") & contains("MATH")), na.rm = TRUE),
         Reading = rowMeans(across(starts_with("PV") & contains("READ")), na.rm = TRUE),
         Science = rowMeans(across(starts_with("PV") & contains("SCIE")), na.rm = TRUE),
         ) %>% 
  
  # Drops Plausible Values columns
  select(-starts_with("PV"))

stu_SG_filtered contains 5183 observations across 26 variables.

stu_SG_filtered %>%  
  generate_dictionary() %>% 
  kable() %>% 
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"),
                            fixed_thead = T)

pos	variable	label	col_type	missing	levels	value_labels
1	CNTSCHID	Intl. School ID	dbl	0	NULL	NULL
2	ST034Q06TA	Agree/disagree: I feel lonely at school.	dbl	1147	NULL	NULL
3	ST265Q03JA	Agree/disagree: I feel safe in my classrooms at school.	dbl	44	NULL	NULL
4	ST270Q03JA	How often: The teacher helps students with their learning.	dbl	68	NULL	NULL
5	ST004D01T	Student (Standardized) Gender	dbl	0	NULL	NULL
6	ST296Q01JA	How much time spent on homework in: Mathematics homework	dbl	70	NULL	NULL
7	ST296Q02JA	How much time spent on homework in: [Test language] homework	dbl	77	NULL	NULL
8	ST296Q03JA	How much time spent on homework in: [Science] homework	dbl	87	NULL	NULL
9	STRATUM	Stratum ID 5-character (cnt + original stratum ID)	chr	0	NULL	NULL
10	HISCED	Highest level of education of parents (ISCED)	dbl	57	NULL	NULL
11	IMMIG	Index on immigrant background (OECD definition)	dbl	236	NULL	NULL
12	ST022Q01TA	What language do you speak at home most of the time?	dbl	42	NULL	NULL
13	ST230Q01JA	How many siblings (including brothers, sisters, step-brothers, and step-sisters) do you have?	dbl	43	NULL	NULL
14	ST250D06JA	Which of the following are in your home? <Country-specific item 1>	chr	0	NULL	NULL
15	ST250D07JA	Which of the following are in your home? <Country-specific item 2>	chr	0	NULL	NULL
16	ST251Q01JA	How many of these items are there at your [home]: Cars, vans, or trucks	dbl	47	NULL	NULL
17	ST255Q01JA	How many books are there in your [home]?	dbl	44	NULL	NULL
18	EXERPRAC	Exercise or practice a sport before or after school	dbl	47	NULL	NULL
19	ST250Q01JA	Which of the following are in your [home]: A room of your own	dbl	66	NULL	NULL
20	WORKHOME	Working in household/take care of family members before or after school	dbl	51	NULL	NULL
21	ST268Q01JA	Agree/disagree: Mathematics is one of my favourite subjects.	dbl	69	NULL	NULL
22	ST268Q02JA	Agree/disagree: [Test language] is one of my favourite subjects.	dbl	74	NULL	NULL
23	ST268Q03JA	Agree/disagree: [Science] is one of my favourite subjects.	dbl	79	NULL	NULL
24	Math	NA	dbl	0	NULL	NULL
25	Reading	NA	dbl	0	NULL	NULL
26	Science	NA	dbl	0	NULL	NULL

4.2 Renaming Columns

stu_SG_filtered <-
  stu_SG_filtered %>% 
  dplyr::rename(
    "SchoolID" = "CNTSCHID",    
    "Loneliness" = "ST034Q06TA",
    "ClassroomSafety" = "ST265Q03JA",
    "TeacherSupport" = "ST270Q03JA",
    "Gender" = "ST004D01T",
    "Homework_Math" = "ST296Q01JA",
    "Homework_Reading" = "ST296Q02JA",
    "Homework_Science" = "ST296Q03JA",
    "SchoolType" = "STRATUM",
    "ParentsEducation" = "HISCED",
    "Immigration" = "IMMIG",
    "HomeLanguage" = "ST022Q01TA",
    "Sibling" = "ST230Q01JA",
    "Aircon" = "ST250D06JA",
    "Helper" = "ST250D07JA",
    "Vehicle" = "ST251Q01JA",
    "Books" = "ST255Q01JA",
    "Exercise" = "EXERPRAC",
    "OwnRoom" = "ST250Q01JA",
    "FamilyCommitment" = "WORKHOME",
    "Preference_Math" = "ST268Q01JA",
    "Preference_Reading" = "ST268Q02JA",
    "Preference_Science" = "ST268Q03JA"
  )

Rows: 6,606
Columns: 26
$ SchoolID           <dbl> 70200052, 70200134, 70200112, 70200004, 70200152, 7…
$ Loneliness         <dbl> 3, 3, 3, NA, 4, 4, 3, 4, 3, 3, 3, 4, NA, NA, 3, 4, …
$ ClassroomSafety    <dbl> 2, 1, 2, 2, 1, 2, 3, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, …
$ TeacherSupport     <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 3, 2, 1, 1, 2, …
$ Gender             <dbl> 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, …
$ Homework_Math      <dbl> 1, 3, 2, 3, 4, 1, 1, 2, 1, 3, 3, 4, 3, 1, 2, 1, 3, …
$ Homework_Reading   <dbl> 1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 3, 2, 1, 4, 1, 3, …
$ Homework_Science   <dbl> 2, 3, 3, 2, 4, 1, 1, 2, 1, 2, 3, 4, 3, 1, 3, 1, 3, …
$ SchoolType         <chr> "SGP01", "SGP01", "SGP01", "SGP01", "SGP01", "SGP01…
$ ParentsEducation   <dbl> 8, 7, 4, 6, 7, 9, 6, 9, 8, 8, 4, 9, 10, 9, 6, 9, 9,…
$ Immigration        <dbl> 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 1, 1, 1, 3, 1, 2, 3, …
$ HomeLanguage       <dbl> 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, …
$ Sibling            <dbl> 4, 4, 2, 4, 4, 3, 2, 2, 3, 4, 1, 3, 4, 1, 4, 3, 2, …
$ Aircon             <chr> "7020002", "7020001", "7020001", "7020002", "702000…
$ Helper             <chr> "7020002", "7020001", "7020002", "7020002", "702000…
$ Vehicle            <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 3, 3, 1, 2, 2, 1, 2, 2, 1, …
$ Books              <dbl> 7, 4, 4, 3, 2, 2, 4, 5, 7, 4, 3, 7, 4, 4, 2, 4, 5, …
$ Exercise           <dbl> 1, 4, 2, 5, 9, 1, 2, 0, 3, 5, 1, 2, 5, 2, 4, 0, 2, …
$ OwnRoom            <dbl> 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, …
$ FamilyCommitment   <dbl> 10, 2, 0, 10, 5, 5, 7, 0, 0, 4, 2, 2, 10, 0, 10, 0,…
$ Preference_Math    <dbl> 2, 4, 3, 2, 3, 3, 4, 4, 2, 3, 3, 4, 3, 4, 1, 3, 2, …
$ Preference_Reading <dbl> 3, 3, 2, 3, 4, 3, 3, 2, 2, 2, 2, 3, 2, 2, 4, 3, 2, …
$ Preference_Science <dbl> 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 4, 2, 4, 2, 2, 2, …
$ Math               <dbl> 605.2533, 689.9528, 676.7768, 401.0528, 436.1151, 5…
$ Reading            <dbl> 667.4296, 627.6078, 582.9252, 361.3969, 475.6763, 4…
$ Science            <dbl> 639.7873, 672.0703, 660.0384, 343.6425, 479.2390, 4…

4.3 Dropping Invalid Responses

There are some responses which are marked as invalid, or missing in our data. In the Helper and Aircon variables, these are coded as “9999999”. In this next step, we remove these responses and drop them.

stu_SG_filtered <- stu_SG_filtered %>% 
  mutate(Aircon = na_if(Aircon, "9999999"),
         Helper = na_if(Helper, "9999999")) %>% 
  na.omit()

stu_SG_filtered now contains 5158 observations across 26 variables.

4.4 Recoding and Ranking Questionnaire Responses

There are several types of responses for the Student’s Questionnaire. We store all the response levels for each question in separate vectors and subsequently combine to create a global dictionary named dicts.

Books <- c('1' = "0",
               '2' = "1 - 10",
               '3' = "11 - 25",
               '4' = "26 - 100",
               '5' = "101 - 200",
               '6' = "201-500",
               '7' = ">500")

HomeLanguage <- c('1' = "English",
             '2' = "Others")

# Likert Scales: Strong Disagree to Strongly Agree
Preference_Math <- c('1' = "Strongly Disagree",
           '2' = "Disagree",
           '3' = "Agree",
           '4' = "Strongly Agree")

Preference_Reading <- c('1' = "Strongly Disagree",
           '2' = "Disagree",
           '3' = "Agree",
           '4' = "Strongly Agree")

Preference_Science <- c('1' = "Strongly Disagree",
           '2' = "Disagree",
           '3' = "Agree",
           '4' = "Strongly Agree")

# Likert Scales: Strong Agree to Strongly Disagree
Loneliness <- c('1' = "Strongly Agree",
           '2' = "Agree",
           '3' = "Disagree",
           '4' = "Strongly Disagree")

ClassroomSafety <- c('1' = "Strongly Agree",
           '2' = "Agree",
           '3' = "Disagree",
           '4' = "Strongly Disagree")

# Binary
SchoolType <- c('SGP01' = "Public",
           'SGP03' = "Private")

OwnRoom <- c('1' = "Yes", 
                '2' = "No")

Aircon <- c('7020001' = "Yes",
            '7020002' = "No")

Helper <- c('7020001' = "Yes",
            '7020002' = "No")

# Frequency responses
Exercise <- c('0' = "0",
          '1' = "1", 
          '2' = "2",
          '3' = "3",
          '4' = "4",
          '5' = "5",
          '6' = "6",
          '7' = "7",
          '8' = "8",
          '9' = "9",
          '10' = "10")

FamilyCommitment <- c('0' = "0",
          '1' = "1", 
          '2' = "2",
          '3' = "3",
          '4' = "4",
          '5' = "5",
          '6' = "6",
          '7' = "7",
          '8' = "8",
          '9' = "9",
          '10' = "10")

# Time Periods
Homework_Math <- c('1' = "≤ 0.5hr",
                '2' = "0.5hr - 1hr",
                '3' = "1hr - 2hr",
                '4' = "2hr - 3hr",
                '5' = "3 - 4 hr",
                '6' = "> 4hr")

Homework_Reading <- c('1' = "≤ 0.5hr",
                '2' = "0.5hr - 1hr",
                '3' = "1hr - 2hr",
                '4' = "2hr - 3hr",
                '5' = "3 - 4 hr",
                '6' = "> 4hr")

Homework_Science <- c('1' = "≤ 0.5hr",
                '2' = "0.5hr - 1hr",
                '3' = "1hr - 2hr",
                '4' = "2hr - 3hr",
                '5' = "3 - 4 hr",
                '6' = "> 4hr")

# Gender
Gender <- c('1' = "Female",
            '2' = "Male")


# Immigrant Background
Immigration <- c('1' = "Native",
           '2' = "2nd Generation",
           '3' = "3rd Generation")

# Education Level
ParentsEducation <- c('1'="Pre-Primary",   
         '2'="Primary", 
         '3'="Secondary",
         '4'='Secondary',
         '6'="Post-Secondary",
         '7'="Post-Secondary",
         '8'="Tertiary",
         '9'="Tertiary",
         '10'="Tertiary")

# Posessions
Vehicle <- c('1' = "0",
            '2' = "1",
            '3' = "2",
            '4' = "≥3")

Sibling <- c('1' = "0",
            '2' = "1",
            '3' = "2",
            '4' = "≥3")

# Support
TeacherSupport <- c('1' = "Every lesson",
            '2' = "Most lesson",
            '3' = "Some lessons",
            '4' = "Never or almost never")

# Global Dictionary
dicts <- list(
  "Loneliness" = Loneliness,
  "ClassroomSafety" = ClassroomSafety,
  "TeacherSupport" = TeacherSupport,
  "Gender" = Gender,
  "Homework_Math" = Homework_Math,
  "Homework_Reading" = Homework_Reading,
  "Homework_Science" = Homework_Science,
  "SchoolType" = SchoolType,
  "ParentsEducation" = ParentsEducation,
  "Immigration" = Immigration,
  "HomeLanguage" = HomeLanguage,
  "Sibling" = Sibling,
  "Aircon" = Aircon,
  "Helper" = Helper,
  "Vehicle" = Vehicle,
  "Books" = Books,
  "Exercise" = Exercise,
  "OwnRoom" = OwnRoom,
  "FamilyCommitment" = FamilyCommitment,
  "Preference_Math" = Preference_Math,
  "Preference_Reading" = Preference_Reading,
  "Preference_Science" = Preference_Science
)

The helper function below attempts to recode all of the columns based on the global recode dictionary, dicts, using functions from the base R, tidyr, and rlang packages:

names(x) retrieves the column names of the input dataframe
recode() helps to recode values in the columns using dicts
!!sym(x_nm) unquotes and evaluates the column name that matches the names of the dictionaries, while !!!dicts[[x_nm]] unquotes and splices the global recoding dictionary corresponding to the column name.

rcd <- function(x) {
  x_nm <- names(x)
  mutate(x, !! x_nm := recode(!! sym(x_nm), !!! dicts[[x_nm]]))
}

lmap_at() of the purrr package applies the helper function to the column in the dataframe where the column name matches the keys of the dictionaries.

stu_SG_rcd <-lmap_at(stu_SG_filtered, 
        names(dicts),
        rcd)

The mutate() function in the dplyr package and the fct_relevel() function in the forcats package are then used to set the order for ordinal variables.

stu_SG_rcd <- stu_SG_rcd %>%
  mutate_if(is.character, as.factor) %>% 
  mutate(SchoolID = factor(SchoolID)) %>% 
  mutate(Books = fct_relevel(Books, 
                             "0",
                             "1 - 10",
                             "11 - 25",
                             "26 - 100",
                             "101 - 200",
                             "201-500",
                             ">500"),
         Preference_Math = fct_relevel(Preference_Math,
                                       "Strongly Disagree",
                                       "Disagree",
                                       "Agree",
                                       "Strongly Agree"),
         Preference_Reading = fct_relevel(Preference_Reading,
                                          "Strongly Disagree",
                                          "Disagree",
                                          "Agree",
                                          "Strongly Agree"),
         Preference_Science = fct_relevel(Preference_Science,
                                          "Strongly Disagree",
                                          "Disagree",
                                          "Agree",
                                          "Strongly Agree"),
         Loneliness = fct_relevel(Loneliness,
                                  "Strongly Disagree",
                                  "Disagree",
                                  "Agree",
                                  "Strongly Agree"),
         ClassroomSafety = fct_relevel(ClassroomSafety,
                                       "Strongly Disagree",
                                       "Disagree",
                                       "Agree",
                                       "Strongly Agree"),
         Exercise = fct_relevel(Exercise,
                                "0",
                                "1", 
                                "2",
                                "3",
                                "4",
                                "5",
                                "6",
                                "7",
                                "8",
                                "9",
                                "10"),
         FamilyCommitment = fct_relevel(FamilyCommitment,
                                        "0",
                                        "1",
                                        "2",
                                        "3",
                                        "4",
                                        "5",
                                        "6",
                                        "7",
                                        "8",
                                        "9",
                                        "10"),
         Homework_Math = fct_relevel(Homework_Math,
                                     "≤ 0.5hr",
                                     "0.5hr - 1hr",
                                     "1hr - 2hr",
                                     "2hr - 3hr",
                                     "3 - 4 hr",
                                     "> 4hr"),
         Homework_Reading = fct_relevel(Homework_Reading,
                                        "≤ 0.5hr",
                                        "0.5hr - 1hr",
                                        "1hr - 2hr",
                                        "2hr - 3hr",
                                        "3 - 4 hr",
                                        "> 4hr"),
         Homework_Science = fct_relevel(Homework_Science,
                                        "≤ 0.5hr",
                                        "0.5hr - 1hr",
                                        "1hr - 2hr",
                                        "2hr - 3hr",
                                        "3 - 4 hr",
                                        "> 4hr"),
         Immigration = fct_relevel(Immigration,
                                   "Native",
                                   "2nd Generation",
                                   "3rd Generation"),
         ParentsEducation = fct_relevel(ParentsEducation,
                                        "Pre-Primary",
                                        "Primary", 
                                        "Secondary",
                                        "Post-Secondary",
                                        "Tertiary"),
         Vehicle = fct_relevel(Vehicle,
                               "0",
                               "1",
                               "2",
                               "≥3"),
         Sibling = fct_relevel(Sibling,
                               "0",
                               "1",
                               "2",
                               "≥3"),
         TeacherSupport = fct_relevel(TeacherSupport,
                                      "Never or almost never",
                                      "Some lessons",
                                      "Most lesson",
                                      "Every lesson"))

**Table of Variable Summary**
Characteristic	N	N = 5,158¹
SchoolID	5,158
70200001		50 (1.0%)
70200002		29 (0.6%)
70200003		31 (0.6%)
70200004		41 (0.8%)
70200005		30 (0.6%)
70200006		25 (0.5%)
70200007		27 (0.5%)
70200008		30 (0.6%)
70200009		30 (0.6%)
70200010		30 (0.6%)
70200011		51 (1.0%)
70200012		41 (0.8%)
70200013		49 (0.9%)
70200014		30 (0.6%)
70200015		22 (0.4%)
70200016		30 (0.6%)
70200017		29 (0.6%)
70200018		41 (0.8%)
70200019		29 (0.6%)
70200020		48 (0.9%)
70200021		35 (0.7%)
70200022		26 (0.5%)
70200023		29 (0.6%)
70200024		27 (0.5%)
70200025		29 (0.6%)
70200026		45 (0.9%)
70200027		44 (0.9%)
70200028		20 (0.4%)
70200029		33 (0.6%)
70200030		26 (0.5%)
70200031		42 (0.8%)
70200032		31 (0.6%)
70200033		24 (0.5%)
70200034		26 (0.5%)
70200035		42 (0.8%)
70200036		26 (0.5%)
70200037		32 (0.6%)
70200038		32 (0.6%)
70200039		31 (0.6%)
70200040		44 (0.9%)
70200041		4 (<0.1%)
70200042		30 (0.6%)
70200043		40 (0.8%)
70200044		44 (0.9%)
70200045		47 (0.9%)
70200046		26 (0.5%)
70200047		32 (0.6%)
70200048		27 (0.5%)
70200049		43 (0.8%)
70200050		22 (0.4%)
70200051		29 (0.6%)
70200052		44 (0.9%)
70200053		35 (0.7%)
70200054		27 (0.5%)
70200055		28 (0.5%)
70200056		22 (0.4%)
70200057		20 (0.4%)
70200058		28 (0.5%)
70200059		33 (0.6%)
70200060		24 (0.5%)
70200061		30 (0.6%)
70200062		42 (0.8%)
70200063		32 (0.6%)
70200064		27 (0.5%)
70200065		29 (0.6%)
70200066		47 (0.9%)
70200067		41 (0.8%)
70200068		30 (0.6%)
70200069		28 (0.5%)
70200070		29 (0.6%)
70200071		48 (0.9%)
70200072		27 (0.5%)
70200073		30 (0.6%)
70200074		30 (0.6%)
70200075		49 (0.9%)
70200076		29 (0.6%)
70200077		30 (0.6%)
70200078		16 (0.3%)
70200079		20 (0.4%)
70200080		28 (0.5%)
70200081		29 (0.6%)
70200082		46 (0.9%)
70200083		28 (0.5%)
70200084		29 (0.6%)
70200085		28 (0.5%)
70200086		32 (0.6%)
70200087		28 (0.5%)
70200088		32 (0.6%)
70200089		29 (0.6%)
70200090		31 (0.6%)
70200091		26 (0.5%)
70200092		30 (0.6%)
70200093		33 (0.6%)
70200094		41 (0.8%)
70200095		21 (0.4%)
70200096		28 (0.5%)
70200097		28 (0.5%)
70200098		25 (0.5%)
70200099		23 (0.4%)
70200100		27 (0.5%)
70200101		31 (0.6%)
70200102		32 (0.6%)
70200103		24 (0.5%)
70200104		29 (0.6%)
70200105		45 (0.9%)
70200106		31 (0.6%)
70200107		28 (0.5%)
70200108		27 (0.5%)
70200109		27 (0.5%)
70200110		45 (0.9%)
70200111		47 (0.9%)
70200112		32 (0.6%)
70200113		28 (0.5%)
70200114		37 (0.7%)
70200115		14 (0.3%)
70200116		29 (0.6%)
70200117		29 (0.6%)
70200118		44 (0.9%)
70200119		40 (0.8%)
70200120		28 (0.5%)
70200121		23 (0.4%)
70200122		26 (0.5%)
70200123		28 (0.5%)
70200124		26 (0.5%)
70200125		30 (0.6%)
70200126		34 (0.7%)
70200127		29 (0.6%)
70200128		30 (0.6%)
70200129		28 (0.5%)
70200130		46 (0.9%)
70200131		26 (0.5%)
70200132		46 (0.9%)
70200133		28 (0.5%)
70200134		32 (0.6%)
70200135		31 (0.6%)
70200136		26 (0.5%)
70200137		31 (0.6%)
70200138		14 (0.3%)
70200139		46 (0.9%)
70200140		28 (0.5%)
70200141		35 (0.7%)
70200142		44 (0.9%)
70200143		32 (0.6%)
70200144		32 (0.6%)
70200145		43 (0.8%)
70200146		31 (0.6%)
70200147		20 (0.4%)
70200148		23 (0.4%)
70200149		18 (0.3%)
70200151		33 (0.6%)
70200152		25 (0.5%)
70200153		28 (0.5%)
70200154		30 (0.6%)
70200155		44 (0.9%)
70200156		28 (0.5%)
70200157		28 (0.5%)
70200158		24 (0.5%)
70200159		48 (0.9%)
70200160		26 (0.5%)
70200161		27 (0.5%)
70200162		34 (0.7%)
70200163		27 (0.5%)
70200164		16 (0.3%)
70200165		29 (0.6%)
Loneliness	5,158
Strongly Disagree		1,412 (27%)
Disagree		2,747 (53%)
Agree		783 (15%)
Strongly Agree		216 (4.2%)
ClassroomSafety	5,158
Strongly Disagree		72 (1.4%)
Disagree		149 (2.9%)
Agree		2,261 (44%)
Strongly Agree		2,676 (52%)
TeacherSupport	5,158
Never or almost never		80 (1.6%)
Some lessons		568 (11%)
Most lesson		1,799 (35%)
Every lesson		2,711 (53%)
Gender	5,158
Female		2,542 (49%)
Male		2,616 (51%)
Homework_Math	5,158
≤ 0.5hr		1,185 (23%)
0.5hr - 1hr		1,671 (32%)
1hr - 2hr		1,580 (31%)
2hr - 3hr		525 (10%)
3 - 4 hr		133 (2.6%)
> 4hr		64 (1.2%)
Homework_Reading	5,158
≤ 0.5hr		2,008 (39%)
0.5hr - 1hr		1,777 (34%)
1hr - 2hr		1,085 (21%)
2hr - 3hr		219 (4.2%)
3 - 4 hr		40 (0.8%)
> 4hr		29 (0.6%)
Homework_Science	5,158
≤ 0.5hr		1,150 (22%)
0.5hr - 1hr		1,564 (30%)
1hr - 2hr		1,684 (33%)
2hr - 3hr		578 (11%)
3 - 4 hr		128 (2.5%)
> 4hr		54 (1.0%)
SchoolType	5,158
Private		354 (6.9%)
Public		4,804 (93%)
ParentsEducation	5,158
Pre-Primary		7 (0.1%)
Primary		49 (0.9%)
Secondary		637 (12%)
Post-Secondary		1,559 (30%)
Tertiary		2,906 (56%)
Immigration	5,158
Native		3,742 (73%)
2nd Generation		573 (11%)
3rd Generation		843 (16%)
HomeLanguage	5,158
English		3,229 (63%)
Others		1,929 (37%)
Sibling	5,158
0		643 (12%)
1		2,397 (46%)
2		1,287 (25%)
≥3		831 (16%)
Aircon	5,158	4,543 (88%)
Helper	5,158	1,276 (25%)
Vehicle	5,158
0		2,033 (39%)
1		2,605 (51%)
2		423 (8.2%)
≥3		97 (1.9%)
Books	5,158
0		160 (3.1%)
1 - 10		727 (14%)
11 - 25		909 (18%)
26 - 100		1,880 (36%)
101 - 200		826 (16%)
201-500		482 (9.3%)
>500		174 (3.4%)
Exercise	5,158
0		1,340 (26%)
1		466 (9.0%)
2		795 (15%)
3		663 (13%)
4		507 (9.8%)
5		414 (8.0%)
6		312 (6.0%)
7		96 (1.9%)
8		150 (2.9%)
9		43 (0.8%)
10		372 (7.2%)
OwnRoom	5,158	3,214 (62%)
FamilyCommitment	5,158
0		1,917 (37%)
1		371 (7.2%)
2		529 (10%)
3		400 (7.8%)
4		319 (6.2%)
5		584 (11%)
6		198 (3.8%)
7		104 (2.0%)
8		140 (2.7%)
9		57 (1.1%)
10		539 (10%)
Preference_Math	5,158
Strongly Disagree		561 (11%)
Disagree		1,186 (23%)
Agree		2,010 (39%)
Strongly Agree		1,401 (27%)
Preference_Reading	5,158
Strongly Disagree		530 (10%)
Disagree		1,902 (37%)
Agree		2,121 (41%)
Strongly Agree		605 (12%)
Preference_Science	5,158
Strongly Disagree		384 (7.4%)
Disagree		1,184 (23%)
Agree		2,339 (45%)
Strongly Agree		1,251 (24%)
¹ n (%)

4.5 Data Health

get_dupes() of the janitor package is used to hunt for duplicate records. The results show that there are no duplicated rows.

get_dupes(stu_SG_rcd)

 [1] SchoolID           Loneliness         ClassroomSafety    TeacherSupport    
 [5] Gender             Homework_Math      Homework_Reading   Homework_Science  
 [9] SchoolType         ParentsEducation   Immigration        HomeLanguage      
[13] Sibling            Aircon             Helper             Vehicle           
[17] Books              Exercise           OwnRoom            FamilyCommitment  
[21] Preference_Math    Preference_Reading Preference_Science Math              
[25] Reading            Science            dupe_count        
<0 rows> (or 0-length row.names)

5 Our Final Dataset

write_csv(stu_SG_rcd, "data/stu_SG_rcd.csv")
write_rds(stu_SG_rcd, "data/stu_SG_rcd.rds")

colSums(is.na(stu_SG_rcd))

          SchoolID         Loneliness    ClassroomSafety     TeacherSupport 
                 0                  0                  0                  0 
            Gender      Homework_Math   Homework_Reading   Homework_Science 
                 0                  0                  0                  0 
        SchoolType   ParentsEducation        Immigration       HomeLanguage 
                 0                  0                  0                  0 
           Sibling             Aircon             Helper            Vehicle 
                 0                  0                  0                  0 
             Books           Exercise            OwnRoom   FamilyCommitment 
                 0                  0                  0                  0 
   Preference_Math Preference_Reading Preference_Science               Math 
                 0                  0                  0                  0 
           Reading            Science 
                 0                  0