Estudo do Pacote Flexmix

1 O pacote

O trabalho desenvolvido teve como objetivo estudar o pacote Flexmix e compara-lo com demais técnicas de clusterização.

O pacote trás como sua principal funcionalidade a capacidade de ajustar diferentes distribuições para as misturas, se caracterizando como um framework de misturas de modelos lineares generalizados, semi paramétrico e não paramétrico

O pacote já está disponível no cran, através de https://cran.r-project.org/web/packages/flexmix/index.html

O autor também oferece diversos artigos em forma de vignetes

2 Aplicação em Dados Reais

A primeira etapa do trabalho se baseia na aplicação de diferentes métodos de agrupamento

Os dados utilizados foram retirados do e se referem a medidas de pinguins adultos perto da Estação Palmer, Antártida (Palmer Station)

df_pengu = palmerpenguins::penguins %>%
  filter(complete.cases(.)) |>
  select(-year)

df_pengu |>
  rmarkdown::paged_table()

O conjunto de dados possui as seguintes variáveis

species
- Um fator com as espécies de pinguim (Adelie, Gentoo e Chinstrap)
island
- Um fator com cada ilha do Arquipélago Palmer, na Antártida (Biscoe, Dream, Togersen)
bill_length_mm
- Um número inteiro que indica o comprimento do bico (em milímetros)
bill_depth_mm
- Um número inteiro que indica a profundidade do bico (em milímetros)
flipper_length_mm
- Um número inteiro que indica o comprimento da nadadeira (em milímetros)
body_mass_g
- Um número inteiro que indica a massa corporal (em gramas)
sex
- Um fator que indica o sexo do(a) pinguim (macho, fêmea)

As 3 variáveis categóricas podem se mostrar de interesse para construção de grupos

df_pengu |>
  group_by(
    across(
      where(is.factor)
      )
    ) |>
  summarise(
    across(
      where(is.numeric),
      ~mean(.)
      )
    ) |>
  rmarkdown::paged_table()

`summarise()` has grouped output by 'species', 'island'. You can override using
the `.groups` argument.

df_pengu  |>
    tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  ungroup() |>
  mutate(species_sex = glue::glue("{species}_{sex}") |>
           as.character(),
         .keep = 'unused') |>
  ggplot(aes(x = species_sex, y = value, fill = species_sex)) +
  geom_violin(drop = F) +
  facet_wrap(~name, scales = 'free') +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Vemos que a variável de espécie e sexo são aquelas a apresentarem maiores divisões entre os pinguins e assim desejamos ver que essa separação seja capturada pelos algoritmos de agrupamento

É importante destacar que tais variáveis categóricas serão ‘escondidas’ para algoritmos de agrupamento.

3 diferentes algoritmos foram utilizados: AGNES, K-Means e Modelos de Misturas

2.1 AGNES

AGNES(AGglomerative NESting) é um método de agrupamento hierárquico aglomerativo.

df_pengu_scl_num = df_pengu |>
  select(
    where(is.numeric)
    ) |>
  mutate(
    across(
      where(is.numeric), 
    ~scale(.)
    )
  )


agnes_cluter = df_pengu_scl_num |>
  factoextra::get_dist() |>
  hclust(method = 'complete')

fviz_nbclust(df_pengu_scl_num, FUN = hcut, method = "wss")

fviz_nbclust(df_pengu_scl_num, FUN = hcut, method = "silhouette")

fviz_nbclust(df_pengu_scl_num, FUN = hcut, method = "gap_stat")

plot(agnes_cluter)
rect.hclust(agnes_cluter, k=5)

fviz_cluster(
  list(data = df_pengu_scl_num, 
       cluster = cutree(agnes_cluter, k = 5)),
  ellipse = TRUE,
  ellipse.type = "norm"
  )

2.2 K-Means

A clusterização via K-means (MacQueen 1967) é um dos algoritmos de aprendizado de máquina não supervisionado mais comumente usados para particionar um determinado conjunto de dados em um conjunto de k grupos (ou seja, k clusters), onde k representa o número de grupos pré-especificados pelo analista.

Assim como no método hierárquico, foi definido um número de grupos igual a 5

kmeans_cluter = df_pengu_scl_num |> 
  kmeans(centers = 5, nstart = 35)

fviz_cluster(kmeans_cluter, 
             data = df_pengu_scl_num,
             ellipse = TRUE,
             ellipse.type = "norm"
  )

2.3 Modelo de Misturas

Para a aplicação de modelos de misturas, o pacote flexmix, já introduzido, foi utilizado

gmm_pengu = df_pengu_scl_num %>%
  flexmix::flexmix(bill_length_mm + bill_depth_mm + 
                     flipper_length_mm + body_mass_g ~ 1, data = ., k = 5) 

gmm_pengu|>
  summary()


Call:
flexmix::flexmix(formula = bill_length_mm + bill_depth_mm + flipper_length_mm + 
    body_mass_g ~ 1, data = ., k = 5)

       prior size post>0 ratio
Comp.1 0.259  105    257 0.409
Comp.2 0.198  111    333 0.333
Comp.3 0.179    1    333 0.003
Comp.4 0.171  108    262 0.412
Comp.5 0.192    8    333 0.024

'log Lik.' -754.4234 (df=14)
AIC: 1536.847   BIC: 1590.161

gmm_pengu |>
  plot()

3 Etapa de Simulação

A segunda etapa do trabalho buscou estudar a capacidade de agrupamento do algoritmo utilizado no pacote via estudo de simulação

Estudos de grupos menos e mais semelhantes entre si foi realizado, onde para isso foram definidos 3 simulações bases

Grupos diferentes apenas na média
- Foram simulados grupos que se diferem em 10, 5, e 1 unidade de média, com o desvio-padrão fixado em 1

mix_mean_change = 
  list(near = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 1, sd = 1),
           c3 = rnorm(100, mean = 2, sd = 1)),
       between = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 5, sd = 1),
           c3 = rnorm(100, mean = 10, sd = 1)),
       far = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 10, sd = 1),
           c3 = rnorm(100, mean = 20, sd = 1))
       )

mix_mean_change |> 
  as.data.frame() |> 
  tidyr::pivot_longer(where(is.numeric)) |>
  dplyr::mutate(name = name |>
           stringr::str_remove_all("\\..*$")) |>
  ggplot(aes(x = value, fill = name)) +
  geom_histogram() +
  facet_wrap(~name, ncol = 1, scales = 'free')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Grupos diferentes apenas no desvio-padrão
- Foram simulados grupos que se diferem em 10, 5, e 1 unidade de desvio-padrão, com a média fixada em 0

mix_sd_change = 
  list(near = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 0, sd = 2),
           c3 = rnorm(100, mean = 0, sd = 3)),
       between = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 0, sd = 5),
           c3 = rnorm(100, mean = 0, sd = 10)),
       far = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 0, sd = 10),
           c3 = rnorm(100, mean = 0, sd = 20))
       )

mix_sd_change |> 
  as.data.frame() |> 
  tidyr::pivot_longer(where(is.numeric)) |>
  dplyr::mutate(name = name |>
           stringr::str_remove_all("\\..*$")) |>
  ggplot(aes(x = value, fill = name)) +
  geom_histogram() +
  facet_wrap(~name, ncol = 1, scales = 'free')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Grupos diferentes em média e desvio-padrão
- Foram simulados grupos que se diferem em 5, e 1 unidade de média e desvio-padrão. Buscando a complexidade do trabalho, as seguintes combinações foram realizadas

1 unidades de distância na média

1, 2 e 5 unidades de distância no desvio padrão

3 unidades de distância na média

1, 2 e 5 unidades de distância no desvio padrão

mix_mean_sd_change1 = 
  list(near = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 3, sd = 2),
           c3 = rnorm(100, mean = 6, sd = 3)),
       between = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 3, sd = 3),
           c3 = rnorm(100, mean = 6, sd = 5)),
       far = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 3, sd = 5),
           c3 = rnorm(100, mean = 6, sd = 10))
       )

mix_mean_sd_change2 = 
  list(near = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 1, sd = 2),
           c3 = rnorm(100, mean = 2, sd = 3)),
       between = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 1, sd = 3),
           c3 = rnorm(100, mean = 2, sd = 5)),
       far = 
         tibble::tibble(
           c1 = rnorm(100, mean = 0, sd = 1),
           c2 = rnorm(100, mean = 1, sd = 5),
           c3 = rnorm(100, mean = 2, sd = 10))
       )

mix_mean_sd_change1 |> 
  as.data.frame() |> 
  tidyr::pivot_longer(where(is.numeric)) |>
  dplyr::mutate(name = name |>
           stringr::str_remove_all("\\..*$")) |>
  ggplot(aes(x = value, fill = name)) +
  geom_histogram() +
  facet_wrap(~name, ncol = 1, scales = 'free')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mix_mean_sd_change2 |> 
  as.data.frame() |> 
  tidyr::pivot_longer(where(is.numeric)) |>
  dplyr::mutate(name = name |>
           stringr::str_remove_all("\\..*$")) |>
  ggplot(aes(x = value, fill = name)) +
  geom_histogram() +
  facet_wrap(~name, ncol = 1, scales = 'free')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Para os 3 casos, foram simulados grupos a partir de uma distribuição Normal com tamanho amostral 300.

3.1 AGNES

3.1.1 Variando a Média

Primeiramente o algoritmo foi aplicado nos grupos com seperação de 10 unidade na média

agnes_far_cluster = mix_mean_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_far_cluster)
rect.hclust(agnes_far_cluster, k=3)

A segunda etapa foi a aplicação do método nos dados com grupos se distanciando em 5 unidades na média

agnes_between_cluster = mix_mean_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_between_cluster)
rect.hclust(agnes_between_cluster, k=3)

A útlima etapa foi aplicação do método nos dados que possuiam os grupos mais próximos, com apenas 1 unidade de distancia na média

agnes_near_cluster = mix_mean_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_near_cluster)
rect.hclust(agnes_near_cluster, k=3)

3.1.2 Variando o Desvio Padrão

agnes_far_cluster = mix_sd_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_far_cluster)
rect.hclust(agnes_far_cluster, k=3)

agnes_between_cluster = mix_sd_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_between_cluster)
rect.hclust(agnes_between_cluster, k=3)

agnes_near_cluster = mix_sd_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_near_cluster)
rect.hclust(agnes_near_cluster, k=3)

3.1.3 Variando a Média e o Desvio Padrão

agnes_far_cluster = mix_mean_sd_change1$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_far_cluster)
rect.hclust(agnes_far_cluster, k=3)

agnes_between_cluster = mix_mean_sd_change1$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_between_cluster)
rect.hclust(agnes_between_cluster, k=3)

agnes_near_cluster = mix_mean_sd_change1$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_near_cluster)
rect.hclust(agnes_near_cluster, k=3)

agnes_far_cluster = mix_mean_sd_change2$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_far_cluster)
rect.hclust(agnes_far_cluster, k=3)

agnes_between_cluster = mix_mean_sd_change2$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_between_cluster)
rect.hclust(agnes_between_cluster, k=3)

agnes_near_cluster = mix_mean_sd_change2$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |>
  factoextra::get_dist() |>
  hclust(method = 'complete') 


plot(agnes_near_cluster)
rect.hclust(agnes_near_cluster, k=3)

3.2 K-Means

3.2.1 Variando a Média

kmeans_far_cluster = mix_mean_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

mix_mean_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_far_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

kmeans_between_cluster = mix_mean_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

kmeans_between_cluster

K-means clustering with 3 clusters of sizes 101, 100, 99

Cluster means:
        [,1]
1 5.11407444
2 9.95674088
3 0.02254053

Clustering vector:
  [1] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 2 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
 [38] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
 [75] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[112] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
[149] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[186] 2 3 1 2 3 1 1 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[223] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 1 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
[260] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[297] 2 3 1 2

Within cluster sum of squares by cluster:
[1]  96.52853 111.47700  98.30881
 (between_SS / total_SS =  94.1 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

mix_mean_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_between_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

kmeans_near_cluster = mix_mean_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

kmeans_near_cluster

K-means clustering with 3 clusters of sizes 91, 83, 126

Cluster means:
        [,1]
1 -0.4942491
2  2.6004717
3  1.0827474

Clustering vector:
  [1] 1 3 3 1 3 3 3 3 2 1 3 2 3 2 2 1 2 3 3 3 2 1 2 3 1 2 3 1 2 2 3 3 2 1 3 2 1
 [38] 3 2 1 3 2 3 3 3 3 2 2 1 3 2 1 2 2 3 2 2 1 3 2 1 3 3 3 1 3 1 2 2 2 3 2 3 1
 [75] 3 1 1 2 1 2 3 3 2 2 1 2 3 1 1 2 3 3 2 1 2 2 3 3 3 1 1 1 1 3 3 3 1 3 1 3 2
[112] 1 1 3 1 2 3 1 1 2 3 3 2 1 3 2 1 2 3 3 1 2 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3
[149] 1 3 3 1 2 1 2 3 1 3 3 3 1 3 1 2 3 3 3 2 1 2 2 1 1 2 3 1 3 3 2 3 3 2 3 1 1
[186] 3 1 1 3 1 3 3 1 3 3 1 3 2 3 2 3 1 2 2 3 1 2 1 3 2 1 1 3 1 1 1 1 3 2 3 1 2
[223] 3 2 1 1 3 2 1 3 2 1 3 2 1 1 3 1 2 2 1 3 2 3 3 2 1 1 3 3 2 2 3 1 3 1 3 2 1
[260] 1 2 1 3 3 3 2 3 3 3 2 1 3 2 1 2 2 1 1 3 1 3 1 1 3 3 3 3 2 1 1 2 1 2 2 3 3
[297] 2 1 3 3

Within cluster sum of squares by cluster:
[1] 40.28134 34.35659 25.18619
 (between_SS / total_SS =  80.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

mix_mean_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_near_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

3.2.2 Variando o Desvio Padrão

kmeans_far_cluster = mix_sd_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

kmeans_far_cluster

K-means clustering with 3 clusters of sizes 210, 49, 41

Cluster means:
         [,1]
1   0.9986902
2 -19.9670116
3  24.7030709

Clustering vector:
  [1] 1 1 2 1 1 1 1 1 3 1 1 2 1 3 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 2 1 1 1 1 2 2 1
 [38] 2 1 1 1 3 1 1 3 1 1 2 1 2 2 1 1 1 1 1 3 1 1 3 1 1 2 1 2 1 1 1 2 1 1 2 1 2
 [75] 2 1 1 1 1 2 1 1 1 2 1 3 2 1 1 1 1 1 1 1 1 2 1 1 3 1 1 1 1 1 3 1 1 2 1 2 3
[112] 1 1 1 1 2 3 1 2 1 1 1 1 1 1 1 1 2 2 1 3 3 1 1 2 1 1 1 1 1 2 1 1 3 1 1 1 1
[149] 1 1 1 1 2 1 1 3 1 3 3 1 1 3 1 2 1 1 2 3 1 1 1 1 3 3 1 1 1 1 1 2 1 3 1 1 2
[186] 2 1 1 1 1 3 2 1 3 1 1 3 2 1 1 2 1 1 3 1 1 3 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2
[223] 1 1 2 1 2 1 1 1 3 1 1 1 1 1 2 1 1 3 1 1 2 1 1 1 1 1 3 1 3 2 1 3 1 1 1 2 1
[260] 1 1 1 2 2 1 1 3 1 1 3 1 3 1 1 1 1 1 1 1 1 1 2 1 3 1 1 1 3 1 1 1 1 1 1 1 1
[297] 2 1 1 3

Within cluster sum of squares by cluster:
[1] 4248.265 4756.318 6918.339
 (between_SS / total_SS =  73.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

mix_mean_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_far_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

kmeans_between_cluster = mix_sd_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

kmeans_between_cluster

K-means clustering with 3 clusters of sizes 50, 204, 46

Cluster means:
        [,1]
1 -8.9656265
2  0.3308617
3 12.1292901

Clustering vector:
  [1] 2 1 3 2 2 2 2 2 3 2 2 1 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2 3 2 2 1 2 1 1 2
 [38] 3 3 2 3 3 2 3 3 2 2 3 2 2 2 2 1 2 2 2 3 2 3 3 2 2 3 2 2 2 2 2 1 2 3 3 2 1
 [75] 3 2 2 1 2 2 3 2 1 2 2 2 1 2 2 1 2 2 3 2 2 3 2 1 1 2 2 3 2 2 1 2 2 1 2 3 2
[112] 2 2 2 2 2 1 2 2 3 2 1 2 2 3 2 2 2 2 2 3 3 2 2 3 2 2 1 2 2 1 2 1 2 2 2 2 2
[149] 2 3 2 2 2 2 2 1 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 3 2 1 1 2 1 2 2 3
[186] 3 2 2 1 2 3 2 2 3 3 2 2 2 2 3 1 2 2 3 2 2 1 2 2 2 2 2 2 2 2 3 2 2 2 2 1 1
[223] 2 2 2 2 2 1 2 1 3 2 2 3 2 3 3 2 2 3 2 3 2 2 2 2 2 2 1 2 2 3 2 1 2 2 2 1 2
[260] 2 1 2 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 3 2 2 1 2 2 3 2 2 1 2 2
[297] 2 2 2 2

Within cluster sum of squares by cluster:
[1]  981.1611  970.5043 1395.8758
 (between_SS / total_SS =  76.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

mix_mean_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_between_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

kmeans_near_cluster = mix_sd_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

kmeans_between_cluster

K-means clustering with 3 clusters of sizes 50, 204, 46

Cluster means:
        [,1]
1 -8.9656265
2  0.3308617
3 12.1292901

Clustering vector:
  [1] 2 1 3 2 2 2 2 2 3 2 2 1 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2 3 2 2 1 2 1 1 2
 [38] 3 3 2 3 3 2 3 3 2 2 3 2 2 2 2 1 2 2 2 3 2 3 3 2 2 3 2 2 2 2 2 1 2 3 3 2 1
 [75] 3 2 2 1 2 2 3 2 1 2 2 2 1 2 2 1 2 2 3 2 2 3 2 1 1 2 2 3 2 2 1 2 2 1 2 3 2
[112] 2 2 2 2 2 1 2 2 3 2 1 2 2 3 2 2 2 2 2 3 3 2 2 3 2 2 1 2 2 1 2 1 2 2 2 2 2
[149] 2 3 2 2 2 2 2 1 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 3 2 1 1 2 1 2 2 3
[186] 3 2 2 1 2 3 2 2 3 3 2 2 2 2 3 1 2 2 3 2 2 1 2 2 2 2 2 2 2 2 3 2 2 2 2 1 1
[223] 2 2 2 2 2 1 2 1 3 2 2 3 2 3 3 2 2 3 2 3 2 2 2 2 2 2 1 2 2 3 2 1 2 2 2 1 2
[260] 2 1 2 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 3 2 2 1 2 2 3 2 2 1 2 2
[297] 2 2 2 2

Within cluster sum of squares by cluster:
[1]  981.1611  970.5043 1395.8758
 (between_SS / total_SS =  76.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

mix_sd_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_near_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

3.2.3 Variando a Média e o Desvio Padrão

kmeans_far_cluster = mix_mean_sd_change1$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

mix_mean_sd_change1$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_far_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

kmeans_between_cluster = mix_mean_sd_change1$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

kmeans_between_cluster

K-means clustering with 3 clusters of sizes 102, 31, 167

Cluster means:
       [,1]
1  4.550842
2 12.084501
3 -0.314631

Clustering vector:
  [1] 3 3 1 3 3 2 3 3 3 3 1 3 3 3 3 3 3 1 3 1 2 3 1 1 3 1 1 3 3 1 3 3 3 3 1 1 3
 [38] 3 1 3 3 1 3 3 3 3 3 1 3 2 1 3 1 1 3 1 2 3 1 1 3 1 1 3 1 1 3 1 3 3 3 3 3 3
 [75] 1 3 1 3 3 1 1 3 3 1 3 1 2 3 1 1 3 1 2 3 3 2 3 1 2 3 3 3 3 1 1 3 3 1 3 1 1
[112] 3 1 1 3 1 1 3 3 2 3 1 1 3 3 2 3 1 2 3 1 3 3 1 1 3 1 2 3 1 3 3 1 3 3 1 2 3
[149] 3 3 3 1 2 3 3 1 3 1 2 3 3 1 3 3 2 3 1 1 3 1 1 3 3 2 3 1 2 3 1 3 3 3 1 3 3
[186] 3 3 1 1 3 3 2 3 1 3 3 1 1 3 3 2 3 3 1 3 1 2 3 1 3 3 1 3 3 1 2 3 3 1 3 3 1
[223] 3 3 3 3 3 3 3 1 2 3 1 2 3 3 3 1 1 1 3 1 2 3 1 3 3 1 1 3 3 2 3 1 1 3 1 3 3
[260] 1 3 3 3 1 3 2 1 3 3 3 3 1 1 3 2 3 3 1 3 3 1 3 3 1 2 3 3 1 3 3 1 3 1 2 3 3
[297] 1 3 1 2

Within cluster sum of squares by cluster:
[1] 290.9344 276.8449 336.0580
 (between_SS / total_SS =  83.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

mix_mean_sd_change1$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_between_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

kmeans_near_cluster = mix_mean_sd_change1$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  dplyr::pull(value) |> 
  kmeans(centers = 3, nstart = 35)

kmeans_near_cluster

K-means clustering with 3 clusters of sizes 41, 155, 104

Cluster means:
      [,1]
1 9.115956
2 0.615215
3 4.785996

Clustering vector:
  [1] 2 3 2 2 2 2 2 3 3 2 3 1 2 3 3 2 3 2 2 2 3 2 2 3 2 3 1 2 3 1 2 3 2 2 3 3 2
 [38] 3 1 2 3 1 2 2 3 2 3 1 2 2 3 2 2 3 2 3 1 2 3 1 2 2 3 2 1 3 2 2 2 2 2 3 2 2
 [75] 1 2 2 3 2 3 1 2 3 1 2 3 3 2 3 3 2 3 3 2 2 3 2 3 1 2 3 3 2 3 1 2 2 3 2 2 3
[112] 2 2 1 2 3 3 2 3 3 2 3 1 2 2 3 2 3 1 2 3 1 2 2 2 2 3 1 2 2 1 2 3 1 2 2 2 2
[149] 2 1 2 3 3 2 3 1 2 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 3 3 2 3 1 2 2 1 2 2 3 2 3
[186] 2 2 3 3 2 2 1 2 3 3 2 2 1 2 3 3 2 1 3 2 2 3 2 3 3 2 2 3 2 2 1 2 2 3 2 2 2
[223] 2 3 1 2 3 1 2 3 2 2 2 1 2 3 1 2 2 1 2 2 3 2 3 3 2 3 3 2 3 1 2 2 3 2 2 1 2
[260] 3 3 2 3 1 2 3 3 2 3 3 2 2 3 2 3 3 2 2 3 2 2 3 2 2 3 2 3 1 2 2 3 2 2 1 2 3
[297] 1 2 2 1

Within cluster sum of squares by cluster:
[1] 150.0171 236.2913 143.7327
 (between_SS / total_SS =  83.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

mix_mean_sd_change1$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = kmeans_near_cluster$cluster) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

3.3 Modelos de Misturas

3.3.1 Variando a Média

gmm_far = mix_mean_change$far |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .)

gmm_far|>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.348   96    300 0.320
Comp.2 0.309  100    300 0.333
Comp.3 0.343  104    300 0.347

'log Lik.' -1055.212 (df=8)
AIC: 2126.424   BIC: 2156.054

mix_mean_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_far |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_between = mix_mean_change$between |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .)

gmm_between|>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.343  169    300 0.563
Comp.2 0.316   97    300 0.323
Comp.3 0.341   34    300 0.113

'log Lik.' -852.9401 (df=8)
AIC: 1721.88   BIC: 1751.511

mix_mean_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_between |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_near = mix_mean_change$near |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_far |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.348   96    300 0.320
Comp.2 0.309  100    300 0.333
Comp.3 0.343  104    300 0.347

'log Lik.' -1055.212 (df=8)
AIC: 2126.424   BIC: 2156.054

mix_mean_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_near |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

3.3.2 Variando o Desvio Padrão

gmm_far = mix_sd_change$far |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .)

gmm_far|>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.418  120    300 0.400
Comp.2 0.335  125    164 0.762
Comp.3 0.247   55    296 0.186

'log Lik.' -1075.781 (df=8)
AIC: 2167.561   BIC: 2197.192

mix_sd_change$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_far |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_between = mix_sd_change$between |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .)

gmm_between|>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.565  155    300 0.517
Comp.2 0.210  102    170 0.600
Comp.3 0.225   43    246 0.175

'log Lik.' -948.516 (df=8)
AIC: 1913.032   BIC: 1942.662

mix_sd_change$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_between |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_near = mix_sd_change$near |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_far |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.418  120    300 0.400
Comp.2 0.335  125    164 0.762
Comp.3 0.247   55    296 0.186

'log Lik.' -1075.781 (df=8)
AIC: 2167.561   BIC: 2197.192

mix_sd_change$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_near |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

3.3.3 Variando a Média e o Desvio Padrão

gmm_far = mix_mean_sd_change1$far |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_far |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0  ratio
Comp.1 0.298  114    162 0.7037
Comp.2 0.551  170    300 0.5667
Comp.3 0.151   16    300 0.0533

'log Lik.' -950.5372 (df=8)
AIC: 1917.074   BIC: 1946.705

mix_mean_sd_change1$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_far |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_between = mix_mean_sd_change1$between |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_between |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0  ratio
Comp.1 0.099   17    300 0.0567
Comp.2 0.317  120    194 0.6186
Comp.3 0.584  163    300 0.5433

'log Lik.' -798.5202 (df=8)
AIC: 1613.04   BIC: 1642.671

mix_mean_sd_change1$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_between |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_near = mix_mean_sd_change1$near |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_near |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.255   59    300 0.197
Comp.2 0.322  109    294 0.371
Comp.3 0.423  132    201 0.657

'log Lik.' -748.44 (df=8)
AIC: 1512.88   BIC: 1542.51

mix_mean_sd_change1$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_between |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_far = mix_mean_sd_change2$far |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_far |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0  ratio
Comp.1 0.138   14    300 0.0467
Comp.2 0.497  150    300 0.5000
Comp.3 0.365  136    198 0.6869

'log Lik.' -913.3017 (df=8)
AIC: 1842.603   BIC: 1872.234

mix_mean_sd_change2$far |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_far |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_between = mix_mean_sd_change2$between |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_between |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.449  196    269 0.729
Comp.2 0.320  104    300 0.347
Comp.3 0.231    0    300 0.000

'log Lik.' -751.798 (df=8)
AIC: 1519.596   BIC: 1549.226

mix_mean_sd_change2$between |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_between |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

gmm_near = mix_mean_sd_change2$near |> 
  tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) %>%
  flexmix::flexmix(value ~ 1, k = 3, data = .) 

gmm_near |>
  summary()


Call:
flexmix::flexmix(formula = value ~ 1, data = ., k = 3)

       prior size post>0 ratio
Comp.1 0.357  155    277 0.560
Comp.2 0.307   50    300 0.167
Comp.3 0.336   95    298 0.319

'log Lik.' -648.8469 (df=8)
AIC: 1313.694   BIC: 1343.324

mix_mean_sd_change2$near |>
   tidyr::pivot_longer(
    where(
      is.numeric
      )
    ) |>
  mutate(fitted = gmm_between |> clusters()) |>
  rename(actual = name) |>
  mutate(actual = actual |>
           stringr::str_remove("[^0-9]") |> 
           as.numeric() ) |>
  mutate(correct = 
           case_when(actual == fitted ~ T,
                             .default = F),
         actual = 
           actual |> 
           as.character()) |>
  group_by(actual) |>
  summarise(media = mean(correct)) |>
  ggplot(aes(x = actual, y = media, fill = actual)) +
  geom_col() +
  labs(title = 'Proporção de Acertos por Grupo', x = 'Grupo', y = 'Proporção Acertos') +
  theme_minimal()

4 Conclusão

O pacote estudado apresentou um desempenho abaixo dos demais métodos para clusterização tanto dos dados reais tanto como para os dados simulados

Alguns pontos devem ser destacados:

O pacote foi construído de uma maneira confusa
- Ao se construir um pacote no R, os autores devem fazer a distinção de funções internas e funções disponíveis ao público, chamada de funções exportadas. Tal diferenciação é importante pois permite aos autores a construção de funções auxiliares, onde essas possuem o objetivo de auxiliar as funções exportadas. Apesar de tal definição ser extremamente crucial na construção de pacotes no R, os autores não fizeram tal diferenciação, onde todas as funções construidos foram exportadas, ou seja, funções internas que não deveriam ser acessadas pelo usuário podem ser utilizadas.
Pacote possui problemas na definição de argumentos
- Não há como definir o número de maneira iterações máximos de maneira rápida, onde esse vem configurada como 200 iterações máximas, na qual esse mostra muitas vezes insuficientes. Para mudar isso, devemos utilizar uma função de configuração do pacote, onde essa defini um número máximo para todas os modelos utilizados, ou seja, a personalização individual de cada modelo se torna lenta por conta dessa abordagem. Não há como definir o máximo do método numérico utilizado, esse vem definido como padrão 10E-6. Diversas funções possuem o parâmetro reticências, “…”, que deveria ser utilizado para definição de algum parãmetro não listado usualmente na função, porém apesar do reticências ser declarado, ele não possui utilização
O foco apresentado nos vignetes
- Os artigos disponíveis como vignetes do pacote apresentam seu foco em MLGs e modelos semi e não paramétricos. Assim, a adição de covariáveis se mostra importantíssima para algoritmo utilizado se mostrar mais preciso