Discussion of results - Input-OutputRepresentationsforSupervisedClusteringMethods F3

Throughout the experiments on synthetic data, the state of the art models showed greater performance than proposed methods. This shows that a more complex model architecture using self-attention is propably needed to solve such problems. On the real-world Newspaper dataset, we achieved best results

...

7.4. Discussion of results

Model ARI NMI

DAC 0.7989±0.0554 0.8607±0.0371 ABC_ADD 0.8581±0.0076 0.9047±0.0116 ABC_MULTI 0.7779±0.1289 0.8435±0.1061 MIL_MEAN 0.8692±0.0531 0.9152±0.0398 MIL_PMA 0.9094±0.0125 0.9420±0.0076 PEM_MAX 0.5908±0.0497 0.6937±0.0389 PEM_MEAN 0.5821±0.0406 0.6798±0.0331

Table 7.8: Models comparison on Newspaper dataset. Box-plot representation of these results can be seen in Figure 7.7. We report ARI and NMI averaged on 5 runs.

using our proposed model MIL_PMA.

We showed that the DAC model is the only feasible model for problems with a large number of elements in set due to time consumption of training and clustering, but on datasets with smaller sets, MIL (on the Newspaper dataset) and ABC (on the synthetic datasets) models showed better performance.

7. Experiments

...

DAC

ABC

ADD

ABC

MUL TI

MIL

MEAN

MIL

PMA

PEM

MAX

PEM

MEAN

0.5 0.6 0.7 0.8 0.9

ARI

Figure 7.7: Graphical comparison of results from Table 7.8.

Chapter 8 Conclusion

In this thesis, we focused on the supervised clustering problem and comparison of models that use different approaches to represent the input and output to solve the problem using neural networks. We explored current literature and chose two state of the art architectures, DAC and ABC, both relying on the self-attention mechanism as a fundamental building block. Another two models, MIL and PEM, were proposed to explore the clustering ability of simpler architectures. Basic ideas and methods behind the examined models were organized and described in Chapter 2, followed by detailed description of the models in Chapters 3 and 4.

The implementation of DAC and ABC was modified to be compatible with all used datasets. Some further modifications were needed to solve bugs and to optimize the computation of attention in the ABC model. Together with our implementation of the proposed methods, all models form a unified software project that enables us to run experiments easily, store the results and quickly integrate new datasets to test the models on.

In the experimentation phase, we compared the models on one real-world and two synthetic datasets. This included searching for optimal parameters for each model on each dataset and running a large number of experiments.

First, we focused on the intra-cluster dependencies of the Circles dataset.

Next, the scalability of models to growing input was explored using the MoG dataset. As a real-world example, the Newspaper dataset was used.

8.1 Future work

All models but DAC use spectral clustering (another unsupervised clustering method could be used), which also forces us to use some method for estimating the number of clusters. In future work, we want to focus more on methods that are able to infer the clusters directly and are trained end-to-end such as DAC. As shown in experiments on the MoG dataset, the independence of a model on an unsupervised clustering method greatly increases the scalability to growing input size.

We also plan to use supervised clustering models to attempt to solve some of the problems discussed in Section 1.1, such as fast and cheap data annotation, text summarization and topic detection.

Bibliography

[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In:

arXiv preprint arXiv:1409.0473 (2014).

[2] Lukas Biewald.Experiment Tracking with Weights and Biases. Software available from wandb.com. 2020.url:https://www.wandb.com/.

[3] Samuel Coward, Erik Visse-Martindale, and Chithrupa Ramesh. “Attention-Based Clustering: Learning a Kernel from Context”. In:arXiv preprint arXiv:2010.01040 (2020).

[4] Lawrence Hubert and Phipps Arabie. “Comparing partitions”. In: Jour-nal of Classification 2.1 (Dec. 1985), pp. 193–218. doi: 10 . 1007 / bf01908075.url:https://doi.org/10.1007/bf01908075.

[5] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

[6] Juho Lee, Yoonho Lee, and Yee Whye Teh. “Deep Amortized Clustering”.

In: CoRR abs/1909.13433 (2019). arXiv: 1909 . 13433. url: http : //arxiv.org/abs/1909.13433.

[7] Juho Lee et al. “Set transformer: A framework for attention-based permutation-invariant neural networks”. In:International Conference on Machine Learning. PMLR. 2019, pp. 3744–3753.

[8] Andrew Ng, Michael Jordan, and Yair Weiss. “On spectral cluster-ing: Analysis and an algorithm”. In: Advances in neural information processing systems 14 (2001), pp. 849–856.

[9] Tomáš Pevný and Petr Somol. “Using neural network formalism to solve multiple-instance problems”. In:International Symposium on Neural Networks. Springer. 2017, pp. 135–142.

[10] Ashish Vaswani et al. “Attention is all you need”. In: arXiv preprint arXiv:1706.03762 (2017).

[11] Ulrike Von Luxburg. “A tutorial on spectral clustering”. In: Statistics and computing 17.4 (2007), pp. 395–416.

[12] Manzil Zaheer et al. “Deep Sets”. In: CoRR abs/1703.06114 (2017).

arXiv:1703.06114.url:http://arxiv.org/abs/1703.06114.

Appendix A

Mixture of Gaussian experiments -additional plots

Additional figures for Section 7.2 are displayed here.

In Figure A.2 and A.3, results are displayed using a "two-dimension box plot". The plus sign marks position of median of both measured values (time and ARI) during the particular experiment. Rectangle around median displays the Q1 and Q3 (described in the introduction of Chapter 7) positions for values displayed on y axis (ARI) and x axis (time) analogously to normal box plot (see Figure 7.1). Outliers are not displayed. Number next to the plus sign showing the median value denotes number of elements nX used for that particular experiment. Models are divided with colour.

A. Mixture of Gaussian experiments - additional plots

...

0 50 100 150 200 250 300 350

time to train [min]

0.93 0.94 0.95 0.96 0.97

ARI

20 50 100

200

500 1000

20 50

100

200

20 50

100 200

20 50100

200

20 50100

200

20 50

100

50 DAC ABCADD

ABC_MULTI PEM_MAX PEM_MEAN MIL_MEAN MILPMA

Figure A.1: ARI achieved on test dataset vs time to train the model on the MoG dataset, with the number of clusters sampled randomly from the range [2, 6] for each set. Numbers of elementsn_X in each set for corresponding measurements are displayed next to the plus signs. Mean values are displayed.

...

A. Mixture of Gaussian experiments - additional plots

0 20 40 60 80 100

time to train [min]

0.92 0.93 0.94 0.95 0.96 0.97 0.98

ARI

20 50 100

200500 1000

20 50

100

200

50 20

100 200

50 100

200

20 50

100 200

20 50

100

DAC ABCADD

ABCMULTI

PEMMAX

PEMMEAN

MILMEAN

Figure A.2: ARI achieved on test dataset vs time to train the model on the MoG dataset, with the number of clusters sampled randomly from the range [2, 6] for each set. Numbers of elementsn_X in each set for corresponding measurements are displayed next to the plus signs. Median values with two-dimensional box plots are displayed.

A. Mixture of Gaussian experiments - additional plots

...

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

time to cluster [min]

0.92 0.93 0.94 0.95 0.96 0.97 0.98

ARI

20 50 100 2005001000

20 50

100

200

2050

100 200

20 50100

200

20 50

100 200

50 20 100

50 DAC ABCADD

ABC_MULTI PEM_MAX PEM_MEAN MIL_MEAN MILPMA

Figure A.3: ARI achieved on test dataset vs time to cluster 1000 sets of the MoG dataset. Numbers of elements n_X in each set for corresponding measurements are displayed next to the plus signs. Median values with two-dimensional box plots are displayed.

In document Input-OutputRepresentationsforSupervisedClusteringMethods F3 (Stránka 54-0)