[java, selenium] web crawling (웹 크롤링)

2022. 4. 26. 14:27

728x90

최근 좀 길게 웹 크롤링에 대해 작업을 했다.
넓은 범위로 크롤링 중인지라 길어지게 되었고, 어느정도 정리가 되어가고 있어 내용을 정리해 보고자 한다.

참고로 나는 docker에 ubuntu + tomcat + selenium을 사용중이다.
그러므로 docker 환경에 대한 내용이 필요하다면 이전 글을 참고하시길 바란다.
* docker에 대한 이전 글
1. docker 컨테이너로 tomcat 실행하기 : https://deonggi.tistory.com/159
2. docker 컨테이너의 환경설정 : https://deonggi.tistory.com/160

* 환경정보
java 1.8
spring boot 2.4.3
기본 OS : ubuntu 20.04.4 LTS
docker 내 세팅: ubuntu focal, tomcat 8.5.77, selenium-java 3.141.59

1. selenium을 사용하기로 했다.
처음에는 Jsoup을 이용하려고 했다.
기본적인 테스트도 됐지만 조금 더 검색해보니 selenium에 대한 언급이 많았고,
동적으로 작동하는 사이트에 알맞다는 내용들이 있어서 바꾸기로 했다.

[크롤링] Selenium을 이용한 JAVA 크롤러 (2) - Jsoup과 비교 (With. Twitter)

2020/02/25 - [Back-end/JAVA] - [크롤링] Jsoup을 이용한 JAVA 크롤러 (1) - HTML 파싱 2020/02/25 - [Back-end/JAVA] - [크롤링] Jsoup을 이용한 JAVA 크롤러 (2) - 파일 다운로드 2020/02/27 - [Back-end/JAVA]..

heodolf.tistory.com

2. 브라우저를 선택한다.
브라우저에 랜더링 되는 내용을 긁어오기 때문에 서버환경에 브라우저와 함께
해당 브라우저의 webdriver가 설치되어야 한다.
당연히 나는 Chrome을 사용하기로 했다. 국내에서도 많이 사용하고 나도 익숙하니 Chrome이 좋다고 생각했다

그러나 이 결정은 서버에 배포하게 되면서 바꿔야만 했다. 내 서버는 Chrome을 설치할 수 없었기 때문이다.
그러므로 나와 같이 당황하고 싶지 않다면
먼저 당신의 서버에 사용하고자 하는 브라우저를 설치할 수 있는지 확인을 해봐야 한다.

나의 서버는 ubuntu를 사용하고 있다. (로컬환경은 windows 10. 나는 역시 한국사람 ㅎ)
Chrome을 사용하고 싶다면 서버가 36비트인지 64비트인지 확인해라.
64비트라면 설치가 가능할 것이다. (36비트여도 설치가 가능하다고 하는데 우선 나는 안 됐으니 알수 없고...)
내 경우 아키텍쳐가 aarch64이다.
ubuntu aarch64에 Chrome 설치를 검색해 봤더니 나오는 건 그냥 Chromium을 설치하라는 것이다.

Install Chrome on ubuntu/debian with arm64

I'm trying to install chrome using the commands below: wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - sh -c 'echo "deb [arch=$(dpkg --print-architecture)] h...

askubuntu.com

(그래도 집착을 버리지 못하고 Chrome을 설치하기 위해 많은 시도를 했지만 안 됐다.
만약 aarch64 아키텍쳐라면 고생하지 마시고 갈아 타시길...)
하지만 익숙하지 않은 Chromium이 어떨지 판단이 안되서 ubuntu 서버에 설치되어 있던 firefox를 사용하기로 했다.

3. 이제 코딩을 해보자.
firefox를 사용하기로 했으니 당연히 firefox 브라우저를 설치해야 한다.

1) 그리고 https://github.com/mozilla/geckodriver/releases 에서 브라우저 버전에 맞는 webdriver를 받아준다.
현재 내 PC에 설치된 firefox의 버전은 98.0.2 이다. 그리고 다운로드한 webdriver는 0.30.0 버전이다.

2) 다운로드한 webdriver의 압출을 해제하고 PC에 적당한 위치에 위치시켜 준다.

3) 크롤링 소스는 아래와 같다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
import java.util.concurrent.TimeUnit;
 
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.TimeoutException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxOptions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
 
public class Selenium {
 
    public static void main(String[] args) {
        try {
            String URL = "https://www.tistory.com/category/life";
            runSelenium(URL);
        } catch ( Exception e ) {
            e.printStackTrace();
        }
    }
    
    private static void runSelenium(String URL) throws Exception {
        System.out.println("#### START ####");
        
        // 1. WebDriver 경로 설정
        Path path = Paths.get("C:\\java/driver/geckodriver.exe");
        System.setProperty("webdriver.gecko.driver", path.toString());
        
        // 2. WebDriver 옵션 설정
        FirefoxOptions options = new FirefoxOptions();
        options.addArguments("--start-maximized");          // 최대크기로
        options.addArguments("--headless");                 // Browser를 띄우지 않음
        options.addArguments("--disable-gpu");              // GPU를 사용하지 않음, Linux에서 headless를 사용하는 경우 필요함.
        options.addArguments("--no-sandbox");               // Sandbox 프로세스를 사용하지 않음, Linux에서 headless를 사용하는 경우 필요함.
        options.addArguments("--disable-popup-blocking");    // 팝업 무시
        options.addArguments("--disable-default-apps");     // 기본앱 사용안함
        
        // 3. WebDriver 객체 생성
        WebDriver driver = new FirefoxDriver( options );
                
        try {
            
            // 4. 웹페이지 요청
            driver.get(URL);
            
            // 5. 페이지 로딩을 위한 최대 5초 대기
            driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
            
            // 6. 조회, 로드될 때까지 최대 10초 대기
            WebDriverWait wait = new WebDriverWait(driver, 10);
            String byFunKey = "CSSSELECTOR";
            String selectString = "div#mArticle";
//            String byFunKey = "XPATH";
//            String selectString = "//*[@id=\"mArticle\"]/div[2]/ul/li[3]/a";
            WebElement parent = wait.until(ExpectedConditions.presenceOfElementLocated( 
                    byFunKey.equals("XPATH") ? By.xpath(selectString) : By.cssSelector(selectString) ));
//            System.out.println("#### innerHTML : \n" + parent.getAttribute("innerHTML"));
            
            // 7. 콘텐츠 조회
            List<WebElement> bestContests = parent.findElements(By.cssSelector("div.section_best > ul > li > a"));
            System.out.println( "best 콘텐츠 수 : " + bestContests.size() );
            if (bestContests.size() > 0) {
                for (WebElement best : bestContests) {
                    String title = best.findElement(By.cssSelector("div.wrap_cont > strong > span")).getText();
                    String name = best.findElement(By.cssSelector("div.info_g > span.txt_id")).getText();
                    System.out.println("Best title / blog name : " + title + " / " + name);
                    System.out.println("Best url : " + best.getAttribute("href"));
                    System.out.println();
                }
            }
            
            System.out.println("########################################");
            
            List<WebElement> contents = parent.findElements(By.cssSelector("div.section_list > ul > li > a"));
            System.out.println( "조회된 콘텐츠 수 : " + contents.size() );
            if( contents.size() > 0 ) {
                for(WebElement content : contents ) {
                    String title = content.findElement(By.cssSelector("div.wrap_cont > strong > span")).getText();
                    String name = content.findElement(By.cssSelector("div.info_g > span.txt_id")).getText();
                    System.out.println("콘텐츠 title / blog name : " + title + " / " + name);
                    System.out.println("콘텐츠 url : " + content.getAttribute("href"));
                    System.out.println();
                }
            }
            
        } catch ( TimeoutException e ) {
            e.printStackTrace();
            System.out.println(e.toString());            
        } catch ( Exception e ) {
            e.printStackTrace();
            System.out.println(e.toString());         
        }
        
        // 8. WebDriver 종료
        driver.quit();
        
        System.out.println("#### END ####");
    }
}
 
Colored by Color Scripter

cs

이 위 소스 TimeoutException 이 일어 날 수 있는 곳은 49, 60번 라인이니 참고하시길.

(1) 2)에서 위치시킨 webdriver 파일을 31라인에 기록하고 32번에서 해당 webdriver의 키를 적어준다.

(2) 60번 라인에 3항 연산자를 사용중인데, 당신이 태그를 어떻게 선택할 지에 따라 결정하여 사용하면 된다.
이외에 태그명, 클래스명 등 여러 선택방법이 있지만, 위 2가지 방법이 좀더 범용성이 있다.
그래서 나는 상황에 따라 내가 이해하고 사용하기 편한 것을 선택해서 사용 중이다.

(3) 100번 라인의 driver.quit()는 꼭 해줘야 한다. 만약 누락하는 경우 아래와 같이 서버가 다운되는 상황을 맞이할 수 있다.

만약 selectString에 해당되는 값을 찾는 것이 어렵다면 아래와 같은 방법으로 얻을 수 있다.
① Chrome 브라우저에서 개발자 모드를 연다. (F12)
② Elements에서 커서 모양을 선택(단축키 Ctrl+Shift+C) 하고 화면에서 원하는 태그를 선택한다.
③ Elements에서 선택되어진 태그에서 마우스 우클릭 -> Copy -> 원하는 항목 선택한다.

(3) 크롤링에서 얻고자 하는 것이 태그는 아닐 것이다.
태그에는 여러 정보가 있으므로, 원하는 것이 텍스트인지 링크인지 등을 판단하고 추출해야 한다.
(69번, 72번 라인 소스를 참조)

(4) 그래서 해당 코드의 전체 구조를 간략히 설명하면 아래와 같다.
① 드라이버와 옵션을 선택하고 URL의 페이지를 로딩한다.
② 페이지에서 크롤링 대상 영역의 큰 부분을 먼저 크롤링 한다. (60번 라인)
③ 세부 콘텐츠를 크롤링 한다. (65 ~ 89번 라인)

(4) 실행결과 (크롤링 자체 또는 브라우저 로그는 제거함)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

#### START ####
best 콘텐츠 수 : 3
Best title / blog name : 뭐 어쩌라고!? 감자가 감자한 오늘. / 만두감자 랜선집사 in 일본
Best url : https://dumplingj.tistory.com/1048
 
Best title / blog name : 순대볶음 레시피, 들깻가루 이젠 더 이상 필요없어요 / 미자하우스
Best url : https://mija0515.tistory.com/52
 
Best title / blog name : 강아지 중성화 수술 시기, 장단점 모두 알아보고 수술여부 결정하자! / 곤냥마마
Best url : https://gonnyangmama.tistory.com/829
 
########################################
조회된 콘텐츠 수 : 9
콘텐츠 title / blog name : 뿔논병아리 육아 일기 / 행복한 사진 이야기
콘텐츠 url : https://dslrclub.tistory.com/254
 
콘텐츠 title / blog name : 2022년 기본형 공익직불금 신청하기 / 연서의여행
콘텐츠 url : https://lover6126.tistory.com/46
 
콘텐츠 title / blog name : [집밥] 초간단 해장용 된장라면밥 만들기 / 밥집(Bapzip)
콘텐츠 url : https://babzip.tistory.com/1571
 
콘텐츠 title / blog name : [걷기]두루누비 '2022 코리아둘레길 원정대' 및 걷기여행작가 모집해요. / 라온PD와 함께 디지털 노마드
콘텐츠 url : https://raonpd.tistory.com/56
 
콘텐츠 title / blog name : 송도 해수욕장 / 제이제로 다정한 일상
콘텐츠 url : https://wpdlwpfh.tistory.com/330
 
콘텐츠 title / blog name : 떼아와 함께한 커피 타임과 오페라 케이크 / 밀리멜리
콘텐츠 url : https://milymely.tistory.com/570
 
콘텐츠 title / blog name : 딸기쏙우유 찹쌀떡 리뷰 / 현토리의 일상
콘텐츠 url : https://hyuntori83.tistory.com/29
 
콘텐츠 title / blog name : 고창 / stealthily
콘텐츠 url : https://waterpolo.tistory.com/43
 
콘텐츠 title / blog name : 앱테크 추천 개이득과 유사한 출석체크형 앱테크 앱 모음 & 총정리 (1인 개발자 mrtest) / 생활의꿀팁tv
콘텐츠 url : https://livehoneytv.tistory.com/121
#### END ####
Colored by Color Scripter

cs

(5) 부록 - 스크롤 다운 (수정)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

public class Selenium {
 
    /* 생략 */
    
    private static final Integer WAIT_SEC = 1;
    private static final Integer SCROLL_COUNT = 2;
    
    private static void scrolldonw(WebDriver driver) throws InterruptedException {
//        driver.manage().timeouts().implicitlyWait(WAIT_SEC, TimeUnit.SECONDS);
        Thread.sleep(WAIT_SEC*1000);
        if (SCROLL_COUNT > 0) {            
            for (int i=0; i<SCROLL_COUNT; i++) {
                JavascriptExecutor JavascriptExecutor = ((JavascriptExecutor)driver);
                Long beforeHeight = (Long) JavascriptExecutor.executeScript("return document.body.scrollHeight;");
                System.out.println("브라우저 높이 (before) : " + beforeHeight);
                // 스크롤 다운 후가 전보다 커질때까지 스크롤 다운 반복
                Long afterHeight = 0L;
                while(beforeHeight >= afterHeight) {
//                    driver.manage().timeouts().implicitlyWait(WAIT_SEC, TimeUnit.SECONDS);
                    Thread.sleep(WAIT_SEC*1000);
                    JavascriptExecutor.executeScript("window.scrollTo(0, document.body.scrollHeight);");
                    afterHeight = (Long)JavascriptExecutor.executeScript("return document.body.scrollHeight;");
                    System.out.println("브라우저 높이 (after) : " + afterHeight);
                }
            }
        }
    }
}
 
Colored by Color Scripter

cs

스크롤 다운 후 로딩되는 항목을 크롤링 해야할 때 유용하다. (5, 6번 라인은 상황에 맞게 결정)
첫번째 크롤링 소스의 52번 다음 라인에서 해당 함수의 호출을 추가하면 된다.

(1) 13, 14번 라인은 브라우저 높이를 구하기 위해 자바스크립트를 실행하는 코드다.
(2) 18~24번 라인은 스크롤 전후 높이 값을 비교하여 후 값이 큰 경우 정상 스크롤 된 것으로 판단하고 스크롤 된 카운트를 증가 시킨다.

(3) 추가 - 9, 19번 라인의 implicitlyWait는 페이지 로딩에 사용하는 것이라 적당하지 않아 Thread.sleep으로 변경했다.

* 참고한 글

[Java] 크롤링 crawling, 셀레니움 Selenium

[Java] 크롤링 crawling, 셀레니움 Selenium 웹 크롤링의 정식 명칭은 Web Scraping이며, 웹 사이트에서 원하는 정보를 추출하는 것을 의미한다. 보통 웹 사이트는 HTML기반이기 때문에 정보를 추출할 페이

heekng.tistory.com

[크롤링] Selenium을 이용한 JAVA 크롤러 (1) - HTML 파싱

2020/02/25 - [Back-end/JAVA] - [크롤링] Jsoup을 이용한 JAVA 크롤러 (1) - HTML 파싱 2020/02/25 - [Back-end/JAVA] - [크롤링] Jsoup을 이용한 JAVA 크롤러 (2) - 파일 다운로드 0. 서론 이전 포스트에서 Js..

heodolf.tistory.com

[Python] Selenium 웹페이지 스크롤하기 scrollTo, Scroll down

Python 의 selenium 을 이용해서 스크롤 하기 크롤링 할 때 웹페이지를 스크롤 다운해야하는 경우가 있죠. 스크롤다운해서 끝까지 가야 그 다음 데이터를 조회하는 경우가 있고 그 외에도 필요한 경

hello-bryan.tistory.com

4. 크롤링 소스 서버 배포 에러
처음 배포할때 멋도 모르고 로컬 windows에서 처럼 사이트에서 webdriver 파일을 받아 원하는 위치에 넣고
그 폴더의 파일을 지정해 주면 되는 줄 알았다. 이 착각이 내 큰 고난의 이유 중 하나다.

먼저 난 windows에서 처럼 webdriver 파일을 받아 원하는 위치에 넣었다.
서버는 돌아가는데 크롤링을 하면 아래와 같은 에러가 나온다.
* 참고 : /drivers/firefox/geckodriver 는 내가 webdriver 파일을 배치한 위치

1) gradle selenium-java 4.1.2 version 사용시

1
2
3
4
5
6
7

/drivers/firefox/geckodriver: 1: ELF: not found
/drivers/firefox/geckodriver: 2: Syntax error: ")" unexpected
20220317 17:47:13.083 [http-nio-8091-exec-6] ERROR c.d.b.c.a.ExceptionControllerAdvice - 
org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.NoClassDefFoundError: org/openqa/selenium/internal/Require
.
.
.
Colored by Color Scripter

cs

2) gradle selenium-java 3.141.59 version 사용시

1
2
3
4
5
6
7
8
9
10

/drivers/firefox/geckodriver: 1: ELF: not found
/drivers/firefox/geckodriver: 2: Syntax error: ")" unexpected
20220317 17:59:14.320 [http-nio-8091-exec-6] ERROR c.d.b.c.a.ExceptionControllerAdvice - 
org.openqa.selenium.WebDriverException: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:3165
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:17:03'
System info: host: 'server02', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'aarch64', os.version: '5.13.0-1021-oracle', java.version: '11.0.14'
Driver info: driver.version: FirefoxDriver
.
.
.
Colored by Color Scripter

cs

구글을 검색해도 나오는 것이라곤 버전을 내려라. 버전을 브라우저에 맞춰라. 등등인데...
나의 경우는 근본적으로 버전 이전에 ubuntu에 firefox에 대한 webdriver를 설치하지 않았다는게 문제였다.
아주 우연히 거의 포기 직전에 찾은 정보인데 webdriver를 설치하고 위치를 지정하라는 것이다.

1
2
3
4

# firefox webdriver 설치
apt-get install firefox-geckodriver
 
# 설치된 webdriver 위치 : /usr/bin/geckodriver

cs

* 출처 : https://ubuntu.pkgs.org/20.04/ubuntu-updates-universe-arm64/firefox-geckodriver_99.0+build2-0ubuntu0.20.04.2_arm64.deb.html

그리고 배포하니 위 1)번의 nested exception is java.lang.NoClassDefFoundError: org/openqa/selenium/internal/Require 에러가 나시 나타났고, (1,2번 라인 에러 없이)
그동안의 검색으로 얻은 정보를 이용해 gradle selenium-java 3.141.59 버전으로 변경하니 크롤링이 실행 되었다.

이상으로 selenium을 사용하며 내가 겪었던 일들이다.
나도 전문가는 아니기에 여기 저기서 조각조각 모아 결국 반영한 내용이다.
부디 나 이후에 하실 분들은 나 같은 삽질을 안 하시길... 빌어본다. '') (먼산)

728x90

저작자표시 (새창열림)

'코딩 삽질' 카테고리의 다른 글

[java] integer 비교연산 이상 (== or !=) (0)	2022.05.12
[ubuntu, docker, tomcat] manage log files (0)	2022.04.26
docker 컨테이너의 환경설정 (0)	2022.04.05
docker 컨테이너로 tomcat 실행하기 (0)	2022.04.04
[mysql] Incorrect string value: '\xF0\x9F\x98\x99 "...' (0)	2022.04.04

무인도

[java, selenium] web crawling (웹 크롤링)

'코딩 삽질' 카테고리의 다른 글

+ Recent posts

티스토리툴바