Web Page Crawling with Java - A Simple Example

Web page crawling, also known as web scraping, is a powerful technique for extracting data from websites. In this tutorial, we’ll explore web page crawling using Java’s HttpClient. We’ll create a web crawler class named Crawler, which will support both synchronous and asynchronous crawling. You can configure the crawling depth, extract links from pages, and store the crawled files in a directory. Additionally, we’ll include JUnit 5 tests to ensure the reliability of our web crawler. Let’s dive into the world of web page crawling with Java!

Prerequisites

If you don’t already have Maven installed, you can download it from the official Maven website https://maven.apache.org/download.cgi or through SDKMAN https://sdkman.io/sdks#maven

You can clone the https://github.com/dmakariev/examples repository.

git clone https://github.com/dmakariev/examples.git
cd examples/java-core/crawler

Creating a Maven Project

Let’s create a our project

  1. Open your terminal and navigate to the directory where you want to create your project.
  2. Run the following command to generate a new Maven project:
    mvn archetype:generate -DgroupId=com.makariev.examples.core -DartifactId=crawler \
    -DarchetypeArtifactId=maven-archetype-quickstart \
    -DarchetypeVersion=1.4 \
    -DinteractiveMode=false 
    

    This command generates a basic Maven project structure with a sample Java class, and the group ID and artifact ID are set as per your requirements.

Deleting Initial Files and Updating Dependencies

To clean up the initial files generated by the Maven archetype and update dependencies, follow these steps:

  1. Delete the src/main/java/com/makariev/examples/core/App.java file.
  2. Delete the src/test/java/com/makariev/examples/core/AppTest.java file.
  3. Open the pom.xml file and delete the JUnit 4 dependency (junit:junit).
  4. Add the JUnit 5 and AssertJ dependencies to the pom.xml file:
<dependencies>
    <!-- JUnit 5 -->
    <dependency>
        <groupId>org.junit.jupiter</groupId>
        <artifactId>junit-jupiter-api</artifactId>
        <version>5.10.0</version> <!-- Use the latest version -->
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.junit.jupiter</groupId>
        <artifactId>junit-jupiter-engine</artifactId>
        <version>5.10.0</version> <!-- Use the latest version -->
        <scope>test</scope>
    </dependency>
    <!-- AssertJ -->
    <dependency>
        <groupId>org.assertj</groupId>
        <artifactId>assertj-core</artifactId>
        <version>3.24.2</version> <!-- Use the latest version -->
        <scope>test</scope>
    </dependency>
</dependencies>

Creating the Crawler Class

Now, let’s create the Crawler class, which will be responsible for crawling web pages synchronously and asynchronously. The class will allow configuring the crawling depth, extracting links, and saving files to a directory.

package com.makariev.examples.core;

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.CompletableFuture;

public class Crawler {

    private static final HttpClient httpClient = HttpClient.newHttpClient();
    private static final Path CRAWL_DIR = Path.of("crawled-pages");
    private final LinkExtractor linkExtractor;

    public Crawler(LinkExtractor linkExtractor) {
        this.linkExtractor = linkExtractor;
    }

    public void crawlSynchronously(URI startUrl, int depth) {
        crawlRecursivelySynchronously(startUrl, depth);
    }

    public void crawlAsynchronously(URI startUrl, int depth) {
        crawlRecursivelyAsynchronously(startUrl, depth);
    }

    private void crawlRecursivelySynchronously(URI url, int depth) {
        if (depth <= 0) {
            return;
        }

        try {
            final HttpRequest request = HttpRequest.newBuilder()
                    .uri(url)
                    .build();
            final HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());

            if (response.statusCode() == 200) {
                final List<URI> links = extractLinks(response.body());
                savePageContent(url, response.body());

                for (URI link : links) {
                    crawlRecursivelySynchronously(link, depth - 1);
                }
            }
        } catch (IOException | InterruptedException | IllegalArgumentException e) {
            e.printStackTrace();
        }
    }

    private void crawlRecursivelyAsynchronously(URI url, int depth) {
        if (depth <= 0) {
            return;
        }

        final HttpRequest request = HttpRequest.newBuilder()
                .uri(url)
                .build();

        final CompletableFuture<HttpResponse<String>> responseFuture
                = httpClient.sendAsync(request, HttpResponse.BodyHandlers.ofString());

        responseFuture.thenAccept(response -> {
            if (response.statusCode() == 200) {
                final List<URI> links = extractLinks(response.body());
                savePageContent(url, response.body());

                for (URI link : links) {
                    crawlRecursivelyAsynchronously(link, depth - 1);
                }
            }
        });
    }

    private List<URI> extractLinks(String pageContent) {
        return linkExtractor.extractLinks(pageContent)
                .stream()
                .map(link -> {
                    try {
                        return URI.create(link);
                    } catch (Exception e) {
                    }
                    return null;
                })
                .filter(Objects::nonNull)
                .toList();
    }

    private void savePageContent(URI url, String content) {
        final String fileName = (url.getHost() + url.getPath() + ".html").replace("/.html", "/index.html");
        final Path filePath = CRAWL_DIR.resolve(fileName);
        try {
            Files.createDirectories(filePath.getParent());
            Files.write(filePath, content.getBytes());
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

Creating the LinkExtractor Interface

Implementing the link extraction logic is an essential part of web page crawling.

package com.makariev.examples.core;

import java.util.List;

public interface LinkExtractor {

    public List<String> extractLinks(String htmlContent);

}

Creating the PlainLinkExtractor Class

Below is a basic example of how you can extract links from HTML content using regular expressions in Java. Please note that this example is for educational purposes and may not cover all possible cases.

package com.makariev.examples.core;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class PlainLinkExtractor implements LinkExtractor {

    @Override
    public List<String> extractLinks(String htmlContent) {
        final List<String> links = new ArrayList<>();

        // Regular expression to find HTML anchor tags
        final String regex = "<a\\s+href\\s*=\\s*\"([^\"]+)\"[^>]*>";

        final Pattern pattern = Pattern.compile(regex);
        final Matcher matcher = pattern.matcher(htmlContent);

        while (matcher.find()) {
            final String link = matcher.group(1);
            links.add(link);
        }

        return links;
    }

}

Please note that web page structures can be complex, and HTML parsing libraries like Jsoup are often used for more robust link extraction. This example provides a simple starting point, but for real-world applications, consider using more advanced libraries for HTML parsing

Creating the JsoupLinkExtractor Class

Using Jsoup, a popular Java library for parsing HTML, you can simplify the link extraction logic.

add JSoup dependency to the project’s pom.xml

<!-- jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.16.1</version> <!-- Use the latest version -->
</dependency>

Here’s how you can implement the LinkExtractor interface using Jsoup

package com.makariev.examples.core;

import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupLinkExtractor implements LinkExtractor {

    @Override
    public List<String> extractLinks(String htmlContent) {
        final List<String> links = new ArrayList<>();

        final Document document = Jsoup.parse(htmlContent);

        final Elements anchorTags = document.select("a[href]");

        for (Element anchorTag : anchorTags) {
            final String link = anchorTag.attr("href");
            links.add(link);
        }

        return links;
    }

}

Creating Unit Tests

JUnit 5 test for the Crawler class

Now, let’s create a single JUnit 5 test called CrawlerTest.java in the src/test/java/com/makariev/examples/core

package com.makariev.examples.core;

import java.io.IOException;
import java.net.URI;
import java.nio.file.Files;
import java.nio.file.Path;
import org.junit.jupiter.api.Test;

import static org.assertj.core.api.Assertions.assertThat;

public class CrawlerTest {

    private static final Path CRAWL_DIR = Path.of("crawled-pages");

    @Test
    void testCrawlWebPagesSynchronouslyPlain() throws IOException {
        final Crawler crawler = new Crawler(new PlainLinkExtractor());
        crawler.crawlSynchronously(URI.create("https://example.com"), 2);

        // Verify that crawled pages are saved in the directory
        assertThat(Files.list(CRAWL_DIR).count()).isGreaterThanOrEqualTo(1);
    }

    @Test
    void testCrawlWebPagesAsynchronouslyPlain() throws IOException {
        final Crawler crawler = new Crawler(new PlainLinkExtractor());
        crawler.crawlAsynchronously(URI.create("https://example.com"), 2);

        // Verify that crawled pages are saved in the directory
        assertThat(Files.list(CRAWL_DIR).count()).isGreaterThanOrEqualTo(1);
    }

    @Test
    void testCrawlWebPagesSynchronouslyJSoup() throws IOException {
        final Crawler crawler = new Crawler(new JsoupLinkExtractor());
        crawler.crawlSynchronously(URI.create("https://example.com"), 2);

        // Verify that crawled pages are saved in the directory
        assertThat(Files.list(CRAWL_DIR).count()).isGreaterThanOrEqualTo(1);
    }

    @Test
    void testCrawlWebPagesAsynchronouslyJSoup() throws IOException {
        final Crawler crawler = new Crawler(new JsoupLinkExtractor());
        crawler.crawlAsynchronously(URI.create("https://example.com"), 2);

        // Verify that crawled pages are saved in the directory
        assertThat(Files.list(CRAWL_DIR).count()).isGreaterThanOrEqualTo(1);
    }
}

JUnit 5 test for the PlainLinkExtractor class

Here’s how you can test the extractLinks method

package com.makariev.examples.core;

import org.junit.jupiter.api.Test;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

public class PlainLinkExtractorTest {

    @Test
    void testExtractLinks() {
        final String htmlContent = """
            <a href=\"https://example.com\">Example</a> 
            <a href=\"https://example.org\">Another Example</a>
        """;

        final List<String> extractedLinks = new PlainLinkExtractor().extractLinks(htmlContent);

        // Verify that two links are extracted
        assertThat(extractedLinks).hasSize(2);

        // Verify the extracted links
        assertThat(extractedLinks).containsExactly("https://example.com", "https://example.org");
    }

    @Test
    void testExtractLinksNoLinks() {
        final String htmlContent = "This is a sample text without any links.";

        final List<String> extractedLinks = new PlainLinkExtractor().extractLinks(htmlContent);

        // Verify that no links are extracted
        assertThat(extractedLinks).isEmpty();
    }
}

JUnit 5 test for the JsoupLinkExtractor class

Here’s how you can test the extractLinks method

package com.makariev.examples.core;

import org.junit.jupiter.api.Test;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

public class JsoupLinkExtractorTest {

    @Test
    void testExtractLinks() {
        final String htmlContent = """
            <a href=\"https://example.com\">Example</a> 
            <a href=\"https://example.org\">Another Example</a>
        """;

        final List<String> extractedLinks = new JsoupLinkExtractor().extractLinks(htmlContent);

        // Verify that two links are extracted
        assertThat(extractedLinks).hasSize(2);

        // Verify the extracted links
        assertThat(extractedLinks).containsExactly("https://example.com", "https://example.org");
    }

    @Test
    void testExtractLinksNoLinks() {
        final String htmlContent = "This is a sample text without any links.";

        final List<String> extractedLinks = new JsoupLinkExtractor().extractLinks(htmlContent);

        // Verify that no links are extracted
        assertThat(extractedLinks).isEmpty();
    }
}

Running the Test

To run the test, execute the following command in the project’s root directory:

mvn test

JUnit 5 and AssertJ will execute the test, and you should see output indicating whether the test passed or failed.

Conclusion

In this tutorial, we explored web page crawling with Java’s HttpClient. We created a versatile Crawler class that can crawl web pages synchronously and asynchronously, allowing you to configure the crawling depth, extract links, and store crawled files in a directory. Web page crawling is a valuable skill for various applications.


Coffee Time!

Happy coding!

Share: Twitter LinkedIn