Intro to Conjoint Experiments | Lab - 1

1 Lab sessions: why and how?

  1. Apply theoretical knowledge
  2. Increase understanding by interacting with data
  3. Learn to use some packages in R
  4. How:
    • Relatively unstructured
    • Go at your own pace, try to do the exercises yourself (do yourself a favour and do not just copy paste and run the solutions)
    • “There is never time to do it right, but there is always time to do it over”

1.1 Software used: R

- This is not an R course!
- We will learn some R as we go along
- I will use RStudio
- Many packages or libraries exist to do specific analyses

2 Setting up R

2.1 Why R

  • Free and open source (think of science in developing countries)
  • Good online-documentation (much better than some commercial softwere such as Mplus)
  • Lively community of users (forums etc.)
  • Visualization capabilities (ggplot …)
  • Cooperates with other programs and programming language (e.g. Python)
  • Popularity (See popularity statistics on books, blogs, forums)
  • RStudio as powerful integrated development environment (IDE) for R
  • Evolves into a scientific work suite optimizing workflow (replication, reproducability etc.)
  • Institutions/people (Gary King, Andrew Gelman etc.)

2.2 Where to learn R

  • If you haven’t used R so far it’s necessary that you learn some basics in R.
  • Data Camp (free trial but it is commercial)
  • Try R: A short interactive intro to the language can be found here: http://tryr.codeschool.com/
  • Swirl: Learn R interactively within R itself: http://swirlstats.com/

2.3 Install R on your machine

Below some notes on the installation and setup of R and relevant packages on your own computer:

  • Install Rtools for Windows machines from CRAN (https://cran.r-project.org/bin/windows/Rtools/).
  • If you are using OS X, you will need to to install XCode, available for free from the App Store. This will install a compiler (if you don’t have a compiler installed) which will be needed when installing packages from GitHub that require compilation from C++ source code.
  • [If you are using OS X, you can use Homebrew to install R.(https://mac.install.guide/homebrew/index.html)]
  • Install the latest version of R from CRAN (https://cran.r-project.org/).
  • Install the latest version of RStudio (https://www.rstudio.com/products/RStudio/). RStudio is the editor, i.e. you’ll write code in RStudio which is subsequently sent to and run within R.
  • Install the latest versions of various packages that we need. You can also update your packages by running “update.packages(ask = FALSE)” in the R commandline.

3 What we are going to cover

  • Load data
    • From Qualtrics
  • Data exploration
    • Attributes
    • Levels
  • Checks and data manipulation
    • Validity checks
    • Removing NAs

4 Dataset used

The data set used throughout is the CEU Experimental Political Science dataset on Jan 2019. We will restrict the analysis to some specific variables. Each line in the data set represents a respondent recruited through Qualtrics from a representative panel of the US population.

Codebook:

  1. Q578 Q579 Q580: Choice CJ Task
  2. F-*-*: Conjoint features
  3. Q78: Employment status
  4. Q77: Race
  5. Q76: Education
  6. Q75: Gender
  7. Q74: Age
  8. Q581: Religiosity

5 Environment preparation

# ### Data import ###
# install.packages("readr")     # read datasets
# install.packages("qualtRics") # read qualtrics datasets
# install.packages("here")      # absolute path management
# ### Data manipulation ###
# install.packages("dplyr")     # pipes and data manipulation
# ### Visualization ###
# install.packages("ggplot2")    # graphing capabilities
# ### Estimation ###
# install.packages("cjoint")    # base amce package
# install.packages("cregg")     # amce and mm 
# install.packages("factorEx")  # amce with non-uniform distribution

## Custom build functions 
# library(devtools)
# install_github("albertostefanelli/cjoint") # fixes some problem with cjoint

### Data import ###
library("readr")     
library("qualtRics") 
library("here")
### Data manipulation ###
library("dplyr")     
### Visualization ###
library("ggplot2")    
### Estimation ###
library("cjoint")   
library("cregg")     
library("factorEx")  

6 Load the data: Qualtrics

  • Let’s take a look to how to deal with the Qualtrics format
    • Qualtrics include HTML tags that are use for diagnostics
    • We need to get rid of them
    • We are going to use the qualtRics package for removing the tags
df_base <- qualtRics::read_survey(here("data","experimental_political_science_2019.csv") , # name of our data file
    legacy=FALSE,     # new or old version of Qualtrics 
    strip_html=TRUE   # remove the html tag
)

head(df_base)
# A tibble: 6 × 45
  `Duration (in seconds)` Finished ResponseId  Q578  Q579  Q580  Q271_1 Q271_2 Q271_3 Q271_4 Q271_5 Q271_6 Q78   Q77   Q76   Q75     Q74 Q581  `F-1-1` `F-1-1-1` `F-1-2` `F-1-1-2` `F-1-3` `F-1-1-3` `F-1-2-1` `F-1-2-2` `F-1-2-3` `F-2-1` `F-2-1-1` `F-2-2` `F-2-1-2` `F-2-3` `F-2-1-3` `F-2-2-1` `F-2-2-2`
                    <dbl> <lgl>    <chr>       <chr> <chr> <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr> <dbl> <chr> <chr>   <chr>     <chr>   <chr>     <chr>   <chr>     <chr>     <chr>     <chr>     <chr>   <chr>     <chr>   <chr>     <chr>   <chr>     <chr>     <chr>    
1                     227 TRUE     R_2rzostKY… Cand… Cand… Cand… 4 - N… 6      3      5      3      5      Empl… Hisp… Mast… Fema…    25 Regu… Penal … The cand… Past P… Approxim… Policy… Acceptan… The cand… Approxim… Collecti… Penal … No proce… Past P… None      Policy… Collecti… No proce… Approxim…
2                     195 TRUE     R_5dluK6Rw… Cand… Cand… Cand… 5      3      2      4 - N… 2      5      Unem… Cauc… Some… Male     28 Never Policy… Collecti… Penal … The cand… Past P… Approxim… Welcome … The cand… None      Policy… Acceptan… Penal … The cand… Past P… Approxim… Welcome … The cand…
3                     295 TRUE     R_3QMq6Ezo… Cand… Cand… Cand… 6      6      6      6      3      5      Stud… Cauc… Some… Male     22 Never Past P… None      Policy… Welcome … Penal … The cand… Approxim… Acceptan… No proce… Past P… Approxim… Policy… Acceptan… Penal … The cand… Approxim… Welcome …
4                     358 TRUE     R_2ZWYdunD… Cand… Cand… Cand… 6      3      3      6      1 - S… 7 - S… Unem… Cauc… Some… Male     32 Never Past P… Approxim… Penal … No proce… Policy… Welcome … Approxim… The cand… Acceptan… Past P… Approxim… Penal … No proce… Policy… Acceptan… Approxim… The cand…
5                     310 TRUE     R_3GEedrAF… Cand… Cand… Cand… 6      3      6      5      3      6      Stud… Asian Some… Male     21 Never Penal … The cand… Past P… None      Policy… Acceptan… No proce… None      Acceptan… Penal … The cand… Past P… Approxim… Policy… Collecti… The cand… Approxim…
6                     365 TRUE     R_emv0H4Gn… Cand… Cand… Cand… 6      5      5      4 - N… 5      5      Empl… Cauc… Some… Male     30 Regu… Past P… None      Policy… Acceptan… Penal … The cand… None      Collecti… No proce… Past P… Approxim… Policy… Acceptan… Penal … No proce… None      Collecti…
# … with 10 more variables: `F-2-2-3` <chr>, `F-3-1` <chr>, `F-3-1-1` <chr>, `F-3-2` <chr>, `F-3-1-2` <chr>, `F-3-3` <chr>, `F-3-1-3` <chr>, `F-3-2-1` <chr>, `F-3-2-2` <chr>, `F-3-2-3` <chr>

If you first need to download the data from the GitHub repository

githubURL <- "https://github.com/albertostefanelli/conjoint_class/raw/master/data/experimental_political_science_2019.csv"

download.file(githubURL, 
    destfile = here("data","experimental_political_science_2019.csv"))

7 Understanding the data

The F-*-* are our conjoint features meaning the experimental manipulations. We need to makes sense of them and be extra careful to handle them correctly.

df_base |> dplyr::select(dplyr::starts_with("F-")) 
# A tibble: 222 × 27
   `F-1-1`                   `F-1-1-1`                             `F-1-2` `F-1-1-2` `F-1-3` `F-1-1-3` `F-1-2-1` `F-1-2-2` `F-1-2-3` `F-2-1` `F-2-1-1` `F-2-2` `F-2-1-2` `F-2-3` `F-2-1-3` `F-2-2-1` `F-2-2-2` `F-2-2-3` `F-3-1` `F-3-1-1` `F-3-2` `F-3-1-2` `F-3-3` `F-3-1-3` `F-3-2-1` `F-3-2-2` `F-3-2-3`
   <chr>                     <chr>                                 <chr>   <chr>     <chr>   <chr>     <chr>     <chr>     <chr>     <chr>   <chr>     <chr>   <chr>     <chr>   <chr>     <chr>     <chr>     <chr>     <chr>   <chr>     <chr>   <chr>     <chr>   <chr>     <chr>     <chr>     <chr>    
 1 Penal proceedings         The candidate has been convicted of … Past P… Approxim… Policy… Acceptan… The cand… Approxim… Collecti… Penal … No proce… Past P… None      Policy… Collecti… No proce… Approxim… Welcome … Penal … The cand… Past P… Approxim… Policy… Acceptan… No proce… Approxim… Welcome …
 2 Policy Proposal           Collective expulsion of immigrants a… Penal … The cand… Past P… Approxim… Welcome … The cand… None      Policy… Acceptan… Penal … The cand… Past P… Approxim… Welcome … The cand… Approxim… Policy… Collecti… Penal … No proce… Past P… Approxim… Acceptan… The cand… None     
 3 Past Political Experience None                                  Policy… Welcome … Penal … The cand… Approxim… Acceptan… No proce… Past P… Approxim… Policy… Acceptan… Penal … The cand… Approxim… Welcome … No proce… Past P… Approxim… Policy… Collecti… Penal … No proce… Approxim… Welcome … The cand…
 4 Past Political Experience Approximately 10 years                Penal … No proce… Policy… Welcome … Approxim… The cand… Acceptan… Past P… Approxim… Penal … No proce… Policy… Acceptan… Approxim… The cand… Welcome … Past P… Approxim… Penal … No proce… Policy… Acceptan… Approxim… No proce… Welcome …
 5 Penal proceedings         The candidate is under investigation… Past P… None      Policy… Acceptan… No proce… None      Acceptan… Penal … The cand… Past P… Approxim… Policy… Collecti… The cand… Approxim… Collecti… Penal … The cand… Past P… None      Policy… Welcome … No proce… None      Collecti…
 6 Past Political Experience None                                  Policy… Acceptan… Penal … The cand… None      Collecti… No proce… Past P… Approxim… Policy… Acceptan… Penal … No proce… None      Collecti… The cand… Past P… Approxim… Policy… Welcome … Penal … The cand… Approxim… Collecti… The cand…
 7 Policy Proposal           Welcome immigrants and organise huma… Past P… Approxim… Penal … The cand… Welcome … Approxim… The cand… Policy… Welcome … Past P… None      Penal … The cand… Acceptan… Approxim… The cand… Policy… Acceptan… Past P… None      Penal … The cand… Welcome … None      The cand…
 8 Policy Proposal           Collective expulsion of immigrants a… Penal … The cand… Past P… Approxim… Welcome … The cand… None      Policy… Acceptan… Penal … No proce… Past P… None      Welcome … The cand… None      Policy… Collecti… Penal … The cand… Past P… Approxim… Welcome … The cand… Approxim…
 9 Penal proceedings         No proceedings                        Past P… None      Policy… Acceptan… The cand… None      Acceptan… Penal … The cand… Past P… None      Policy… Collecti… No proce… Approxim… Collecti… Penal … No proce… Past P… None      Policy… Acceptan… The cand… Approxim… Welcome …
10 Policy Proposal           Acceptance of immigrants conditional… Past P… Approxim… Penal … The cand… Collecti… None      The cand… Policy… Collecti… Past P… Approxim… Penal … The cand… Welcome … Approxim… The cand… Policy… Acceptan… Past P… Approxim… Penal … No proce… Collecti… Approxim… The cand…
# … with 212 more rows

Let’s drive deeper and try to figure out which ones are the attributes (i.e. experimental characteristics/manupulation) of the experiment

df_base |> dplyr::select("F-1-1") |> table() 

Past Political Experience         Penal proceedings           Policy Proposal 
                       75                        69                        78 
df_base |> dplyr::select("F-1-1-1") |> table() 

  Acceptance of immigrants conditional on certain requisites                                       Approximately 10 years                                       Approximately 20 years Collective expulsion of immigrants and closure of the border 
                                                          18                                                           26                                                           22                                                           31 
                                              No proceedings                                                         None               The candidate has been convicted of corruption          The candidate is under investigation for corruption 
                                                          26                                                           27                                                           25                                                           18 
              Welcome immigrants and organise human corridor 
                                                          29 

8 Checks and data manipulation

8.1 Validity Checks

  • We are going to try to detect abnormal patterns in the survey duration
    • Exclude respondents who did not completed the survey or were too fast in completing it
    • Exclude observations that took too long to complete the survey
# duration of the survey in seconds
df_base$"Duration (in seconds)" <- as.numeric(df_base$"Duration (in seconds)") 
# let's transform it in minutes 
df_base$duration_mins <-  df_base$"Duration (in seconds)"/60
# plot the density 
ggplot(df_base, aes(x=duration_mins)) + 
  geom_density()

  1. The survey has been tested and it was estimated to take between 8 and 15 mins to be completed
  2. Q: What can you see from the graph above?
  3. We are going to exclude outliers
    1. Less than 4 mins
    2. More than 3 standard deviation above the mean
# calculate the % of respondents that completed the survey in less than 4 minutes
(sum(df_base$duration_mins < 4)/nrow(df_base)*100)
[1] 11.71171
# select the observations that completed the survey in more than 4 minutes 
lower_bound <- df_base$duration_mins > 4
# subset the observation keeping only the observation that completed the survey in more than 4 minutes
df_greater_base <- df_base[lower_bound,]
# to get the upper bound we resort on another approach 
# scale a z-scores
df_greater_base$scaled_duration_mins <- scale(df_greater_base$duration_mins)
# calculate the % of respondents that are 3 sd or more from the mean  
(sum(df_greater_base$scaled_duration_mins > 3)/nrow(df_greater_base)*100)
[1] 1.530612
# upper bound 
upper_bound <- df_greater_base$scaled_duration_mins < 3
# subset the observation keeping only the observation that completed the survey in no more than 3 sd from the mean
df_greater_base <- df_greater_base[upper_bound,]

8.2 Removing NA on the DV

  • Calculate how many respondents did not answer the CJ questions
(sum(is.na(df_greater_base$Q578))/nrow(df_greater_base)*100)
[1] 4.145078
(sum(is.na(df_greater_base$Q579))/nrow(df_greater_base)*100)
[1] 4.145078
(sum(is.na(df_greater_base$Q580))/nrow(df_greater_base)*100)
[1] 3.626943
  • The similarity in the % suggests that a specific subset of the respondents did not complete the survey
  • This is good news but let’s check if this is the case
  • Let investigate which respondents did not answer the CJ questions
df_greater_base[is.na(df_greater_base$Q578),"ResponseId"]
# A tibble: 8 × 1
  ResponseId       
  <chr>            
1 R_3dEN4IvpNRlrYGZ
2 R_24GD2YvqjePlgR8
3 R_9QSDhesusl6rm8N
4 R_2QMWfW6np5x4tOU
5 R_1r8icC6TonV8OPo
6 R_2S3ngM3YxFabDkc
7 R_3nDDFSDtufMIwpL
8 R_3QPGHPBx2jnXRIQ
df_greater_base[is.na(df_greater_base$Q579),"ResponseId"]
# A tibble: 8 × 1
  ResponseId       
  <chr>            
1 R_3dEN4IvpNRlrYGZ
2 R_24GD2YvqjePlgR8
3 R_9QSDhesusl6rm8N
4 R_2QMWfW6np5x4tOU
5 R_1r8icC6TonV8OPo
6 R_2S3ngM3YxFabDkc
7 R_3nDDFSDtufMIwpL
8 R_3QPGHPBx2jnXRIQ
df_greater_base[is.na(df_greater_base$Q580),"ResponseId"]
# A tibble: 7 × 1
  ResponseId       
  <chr>            
1 R_3dEN4IvpNRlrYGZ
2 R_24GD2YvqjePlgR8
3 R_9QSDhesusl6rm8N
4 R_2QMWfW6np5x4tOU
5 R_1r8icC6TonV8OPo
6 R_3nDDFSDtufMIwpL
7 R_3QPGHPBx2jnXRIQ
  • It seems that these unique respondents did not perform any of the CJ tasks
    • We can safely remove them from our sample
    • Let’s now exclude the missing
    • NB: If you got more than 5% of missing might be that something have gone wrong and you need further investigations
df_greater_base_wo_NA <- df_greater_base[!is.na(df_greater_base$Q578),]
df_greater_base_wo_NA <- df_greater_base_wo_NA[!is.na(df_greater_base_wo_NA$Q579),]
df_greater_base_wo_NA <- df_greater_base_wo_NA[!is.na(df_greater_base_wo_NA$Q580),]