Sentence selection - Building the snippet

3.5 Building the snippet

4.3.1 Sentence selection

For each target document, the list of top 15 most important words is predicted using the LTR algorithm (the fullpage-f1-rfr is used for the following experiments). From the context of the target, sentences are extracted from the body text part of each source document. This is done using the Punkt Sentence Tokenizer module from NLTK, that contains pre-trained model for English sentences. From these sentences, only those that contain at least three important words are selected to be included in the possible snippet.

Sentences with more than three verbs are discarded and others are ranked according the sum of predicted tf-idf scores. The snippet is then generated using the algorithm 3.5.

4.3.2 Sample results

A selection of sample results was made by hand, to present some of the capabilities of the model. Apparently, the best results are achieved on target pages that represent a single entity, which could be seen on the example 4.6. Entities from the Wikipedia, items from e-commerce sites and articles focused on one topic belongs to the group of target sites where the snippet generation is successful. More examples could be found in the appendix A.3.

Table 4.6: Selected title and snippet generated by the fullpage-f1-rfr model.

URL titles

Generated snippet

http://en.wikipedia.org/wiki/YLE film, channel, eurofighter, including, funding

Number 3 was added later, when the channel was allocated the third nation-wide television channel and it generally became known as “Chan-nel Three” – Finnish Broadcasting Company’s Yle TV1 and Yle TV2 being the first two – and also to distinguishing it from the later MTV Finland. YLE, Finland’s public broadcasting station, operates five tele-vision channels and thirteen radio channels in both national languages.

CHAPTER 4. EXPERIMENTS 21

4.4 Implementation

The source code of the implementation of proposed method and a framework for running the experiments is available on GitHub [3]. The code is written mainly in python and makes use of various libraries, of which I would like to mention gensim [25], which was used for the pre-processing tasks and corpus representation; scikit-learn [22] which contains implementations of the random forest algorithms; NLTK [6] that contains models used for sentence tokenization and part of speech tagging; and I also appreciate the ipython[23]

notebook for interactive python environment.

Chapter 5 Conclusion

In this thesis, various approaches to the problem of context based labeling and snippet generation were surveyed. A combination of two approaches was selected and imple-mented. The framework for the task was proposed and a comparison of two algorightms and various preprocessing tasks was performed, resulting in six different models. The influence of the amount of data gathered from each web page’s context was examined;

and the effects of different features to the selection of the most discriminative terms were studied.

The performance of these models was then measured and using the best model, exem-plary snippets were generated. Due to the lack of labeled data, that could act as a gold standard for generated snippets, the evaluation of snippet quality will have to be done manually. Moreover, with labeled data, the use of a learning model for snippet generation would also be possible.

The context based content generation is a complex problem and as the topic is valued by Seznam.cz, the further research on the topic could be done. The learning model for snippet generation was already mentioned, but the other parts of the process can be explored more deeply, e.g. the semantic expansion for better coverage of title genaration, improved feature engineering and more.

Appendix A

Additional tables and figures

Table A.1: Feature importances of partpage p3-f3-rfr

tf [1.1] tf-idf [1.8] # docs [1.9] tf/anchors [1.7] # docs [1.10]

0.29 0.16 0.12 0.11 0.08

tf/meta [1.5] tf/query [1.4] tf/url [1.3] tf/titles [1.2] tf/keywords [1.6]

0.07 0.04 0.04 0.03 0.02

doc tf-idf [2.2] doc tf [2.1] body size [3.2] pagerank [3.1] is hp [3.3]

0.02 0.01 0.00 0.00 0.00

Table A.2: Feature importances of partpage p3-f3-rfc

# docs [1.9] # docs [1.10] tf [1.1] tf-idf [1.8] doc tf-idf [2.2]

0.20 0.19 0.14 0.13 0.09

tf/titles [1.2] tf/keywords [1.6] tf/query [1.4] doc tf [2.1] tf/url [1.3]

0.05 0.04 0.04 0.04 0.04

tf/meta [1.5] body size [3.2] pagerank [3.1] tf/anchors [1.7] is hp [3.3]

0.03 0.01 0.01 0.00 0.00

APPENDIX A. ADDITIONAL TABLES AND FIGURES 24

Figure A.1: Effect of size of the forest on the performance of ranking model. Shown on the fullpage-f1-rfr model.

APPENDIX A. ADDITIONAL TABLES AND FIGURES 25

Table A.3: Selected titles and snippets generated by the fullpage-f1-rfr model.

URL titles

Generated snippet

http://wn.com/Nevada Senate

committee, arizona, hawaii, american, judiciary

2014 Nevada State Senate District 4 Election Interview by Asian Amer-ican Group Michele Fiore (Incumbent) http://www.VoteFiore.com This is a must see for all voters!!

http://therecipedaily.in/category/party-food/

sausage, biscotti, pantry, bark, breakfast

Recipe: Southwestern Butternut Squash Soup — Recipes from The Kitchn. SPONSORED POST: Recipe: Crunchy Chicken Salad with Grape-Nuts and Cranberries — Recipes from The Kitchn Sponsored by Grape-Nuts

http://www.masonlec.org/

institute, papers, including, infrastructure, homepage

Program Description: The Global Antitrust Institute (GAI) at the Law

& Economics Center at George Mason University School of Law is a leading international platform for research and education that focuses on the legal and economic analysis of key antitrust issues confronting competition agencies and courts around the world. Moderator: James C. Cooper, Director, Research and Policy, Law & Economics Center and Lecturer in Law, George Mason University School of Law

http://www.dx.com/p/sewor-m113-2-men-s-pu-leather-band-self-winding-mechanical-analog-wristwatch-bla

resin, keychain, leather, women, cree

Sewor M113-2 Men’s PU Leather Band Self-winding Mechanical Analog Wristwatch - Black + Silver

Bibliography

[1] Massih R Amini, Nicolas Usunier, and Patrick Gallinari. Automatic text summa-rization based on word-clusters and ranking algorithms. InAdvances in Information Retrieval, pages 142–156. Springer, 2005.

[2] Einat Amitay and C´ecile Paris. Automatically summarising web sites: is there a way around it? In Proceedings of the ninth international conference on Information and knowledge management, pages 173–179. ACM, 2000.

[3] Jon´aˇs Amrich. Repository for the code from this thesis on github. https://github.

com/JonasAmrich/SyntheticContent.

[4] Giuseppe Attardi, Antonio Gulli, and Fabrizio Sebastiani. Automatic web page categorization by link and context analysis.

[5] Adam L Berger and Vibhu O Mittal. Ocelot: a system for summarizing web pages.

InProceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 144–151. ACM, 2000.

[6] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. ” O’Reilly Media, Inc.”, 2009.

[7] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[8] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classifica-tion and regression trees. CRC press, 1984.

[9] Brian D Davison. Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 272–279. ACM, 2000.

[10] J-Y Delort, Bernadette Bouchon-Meunier, and Maria Rifqi. Enhanced web document summarization using hyperlinks. In Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, pages 208–215. ACM, 2003.

[11] John Gantz and David Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. 2012.

[12] Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, and Gary W. Flake. Using web structure for classifying and describing web pages. In Proceedings of the 11th International Conference on World Wide Web, WWW ’02, pages 562–569, New York, NY, USA, 2002. ACM.

[13] Google. Protocol buffers. https://developers.google.com/protocol-buffers/.

BIBLIOGRAPHY 27 [14] Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer.

InProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 68–73. ACM, 1995.

[15] Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.

[16] Annie Louis and Ani Nenkova. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267–300, 2013.

[17] Hans Peter Luhn. The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165, 1958.

[18] Kathleen R McKeown, Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Judith L Klavans, Ani Nenkova, Carl Sable, Barry Schiffman, and Sergey Sigelman.

Tracking and summarizing news on a daily basis with columbia’s newsblaster. In Proceedings of the second international conference on Human Language Technology Research, pages 280–285. Morgan Kaufmann Publishers Inc., 2002.

[19] Ani Nenkova, Sameer Maskey, and Yang Liu. Automatic summarization. In Proceed-ings of the 49th Annual Meeting of the Association for Computational Linguistics:

Tutorial Abstracts of ACL 2011, page 3. Association for Computational Linguistics, 2011.

[20] You Ouyang, Wenjie Li, Sujian Li, and Qin Lu. Applying regression models to query-focused multi-document summarization. Information Processing & Management, 47(2):227–237, 2011.

[21] Jaehui Park, Tomohiro Fukuhara, Ikki Ohmukai, Hideaki Takeda, and Sang-goo Lee.

Web content summarization using social bookmarks: A new approach for social summarization. In Proceedings of the 10th ACM workshop on Web information and data management, pages 103–110. ACM, 2008.

[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research, 12:2825–2830, 2011.

[23] Fernando P´erez and Brian E. Granger. IPython: a system for interactive scientific computing. Computing in Science and Engineering, 9(3):21–29, May 2007.

[24] Yves Petinot, Kathleen McKeown, and Kapil Thadani. Cluster-based web summa-rization. 2013.

[25] Radim ˇReh˚uˇrek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/

publication/884893/en.

[26] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for auto-matic indexing. Communications of the ACM, 18(11):613–620, 1975.

BIBLIOGRAPHY 28 [27] Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang, Yuchang Lu, and Zheng Chen.

Web-page summarization using clickthrough data. InProceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 194–201. ACM, 2005.

[28] Anastasios Tombros and Mark Sanderson. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 2–10. ACM, 1998.

[29] Andrew Turpin, Yohannes Tsegay, David Hawking, and Hugh E Williams. Fast gener-ation of result snippets in web search. InProceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 127–134. ACM, 2007.

In document Studyprogramme:OpenInformaticsSpecialisation:ComputerandInformationScienceSupervisor:Ing.JanˇSediv´y,Csc.Prague,May2015 Jon´aˇsAmrich Bachelorthesis AutomaticNameandSnippetGenerationofWebpageswithUnknownContent CzechTechnicalUniversityinPragueFacultyofEle (Stránka 27-0)