About the project
SpringMVC project mainly developed a component-based web crawler and search engine using the Spring Framework, the Apache HttpClient, the Jerry Http Parser, and the Lucene Search Engine. It also developed a web-based user interface using Spring MVC and Thymeleaf to send queries to the Lucene index and retrieve the results, providing a link to the original webpage.
1. Class introduction(Spring MVC.zip)
.\Spring MVC\src\main\java\spring\webcrawler\
│
├─bean
│ CrawlerBean.java -> Saving page info, including uri, depth, search key
│ UrlBean.java -> Saving searching result, including uri, title, and content
│
├─common
│ SpiderConstant.java -> The constants for this spider project.
│
├─controller
│ SpiderController.java -> The main controller(entrance) for this project. Including two main function of crawling and searching.
│
└─service
├─crawler
│ CrawlerByHttpClientService.java -> Crawler by using HttpClient, implements the interface of crawler
│ CrawlerByJsoupService.java -> Crawler by using Jsoup, implements the interface of crawler
│ ICrawlerService.java -> The interface of crawler which could get the page information from a URI
│
├─index
│ IIndexCreator.java -> The interface of creating index
│ IndexCreatorService.java -> Index creator by using Lucene, implements the interface of creating index
│
├─parser
│ IParserService.java -> The interface of parser which could parse all links and titles and contents and calling index creator class to create index
│ ParserByHtmlParserService.java -> Parser by using HtmlParser, implements the interface of parser
│ ParserByJerryService.java -> Parser by using Jerry, implements the interface of parser
│ ParserByJsoupService.java -> Parser by using Jsoup, implements the interface of parser
│
├─searcher
│ ISearcherService.java -> The interface of searcher which could search from the Lucene indexes
│ SearcherByLuceneService.java -> Searcher by using Lucene, implements the interface of searcher
│
└─spider
ISpiderService.java -> The interface of spider
SpiderService.java -> The main logic for spider which calls the crawler and parser
by using a main recursive function
.\Spring MVC\src\main\java\
│
└─spring.xml -> The core for the whole project, which is based on Spring technology.
.\Spring MVC\src\main\resources\
└─templates
searcher.html -> The interface for web searcher
spider.html -> The interface for web spider
.\Spring MVC\logs -> Saving logs
.\Spring MVC\lucene_dat -> Saving lucene index data
.\Spring MVC\
└─bin -> The configuration files for log
application.properties If these two files do not exist, please copy from \Spring MVC\
log back.XML
2. Start-up
2.1. Run SpiderController.java by Spring boot app in Eclipse
2.2. Launch an Internet explorer
2.3. Input [http://localhost:8080/spider] to launch spider interface.
Input the URI and crawl depth and click the crawl button which will execute the logic of crawling the website by the depth and creating the index. When crawling is finished, it will turn to searcher interface automatically.
2.4. On the searcher interface, when search keys are inputted and the search button is clicked, it will execute the logic of searching and display the result on this page.
2.5. User can also directly enter the searcher interface by inputting [http://localhost:8080/searcher]
Because the crawling process actually only need one time.
3. Core technology introduction
3.1. This project is using Spring MVC framework and using XML(spring.xml) for dependency injection.
3.2. It also contains the AOP(Aspect-Oriented Programming) concept
It could change the parser by only modify XML file, for example, changing parser from Jsoup to Jerry
<bean id="crawler" class="spring.webcrawler.service.crawler.CrawlerByJsoupService">
->
<bean id="crawler" class="spring.webcrawler.service.crawler.CrawlerByJerryService">
3.3. The main calling function is a recursive function which implements a traversing design.
3.4. During the crawling process, it will be recorded (wrote in logs) when some links cannot be read and also including some repeated links.
3.5. It is using Apache HttpClient, the Jerry Http Parser, and the Lucene Search Engine.
It also implements HTML Parser and JSoup
3.6. The user interface is using Thymeleaf which also based on Spring MVC.
3.7. The search results are using a highlighting technique which could highlight the result and display parts of the content of the page.
The link below contains the project source code.