빅데이터 프로그래밍/Python

[Python] 27. [Scraping] Web Scraping 기초, 한글 처리, BeautifulSoup 설치, 기본 트리 운행, 정규 표현식 이용

밍글링글링 2017. 8. 5.

728x90

[01] Web Scraper

1. Webpage 소스읽어오기

- http://www.pythonscraping.com/exercises/exercise1.html

[실행 화면]

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h....

▷ basic.basicExample.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

# urllib 패키지의 request 모듈에서 urlopen() 함수를 가져옴
from urllib.request import urlopen

#Retrieve HTML string from the URL
# 한글 출력, b': 바이트 스트림을 의미, 한글 깨짐
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
print(html.read())
  

-------------------------------------------------------------------------------------

2. urlopen 함수와 한글 처리

[실행 화면]

utf-8

.....

▷ basic.urlopen.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from urllib.parse import quote  # 한글 처리 함수

# 영문 사이트
# html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon").read()
# print(html)

# print(sys.getdefaultencoding()) # utf-8

# 한글 출력, b': 바이트 스트림을 의미, 한글 깨짐
# https://ko.wikipedia.org/wiki/%EC%BC%80%EB%B9%88_%EB%B2%A0%EC%9D%B4%EC%BB%A8
print('------------------------------------------------------------------')
html = urlopen("https://ko.wikipedia.org/wiki/" + quote("케빈_베이컨")).read()
print(html[:300]) # 0 ~ 299개의 문자만 출력, 한글 깨짐
print('------------------------------------------------------------------')

print('한글 처리한 경우')
html = urlopen("https://ko.wikipedia.org/wiki/" + quote("케빈_베이컨")).read()
print(str(html, "utf-8")[:300]) # 응답 문자열 한글 출력

print('------------------------------------------------------------------')



-------------------------------------------------------------------------------------

[02] BeautifulSoup

- HTML 문서를 XML 형식의 Python 객체로 자동변환하여 편리한 태그 검색을 지원

- 정규표현식을 이용하여 검색을 할 수 있으나 HTML의 복잡한 구조 때문에,

Regular Expression(정규 표현식)과 BeautifulSoup을 같이 사용하는 권장함.

- https://www.crummy.com/software/BeautifulSoup/

1) BeautifulSoup 객체 구조

html → <html> .... </html>

head → <head> .... </head>

title → <title>A Useful Page</title>

body → <body> ..... </body>

h1 →<h1>An Interesting Title</h1>

div → <div> .... </div>
.....
.....

1. BeautifulSoup 설치

1) pip 설치

C:\Users\soldesk>pip install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.5.3-py3-none-any.whl (85kB)
    100% ■■■■■■■■■■■■■■■■■■■■ 92kB 347kB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.3

2) Eclipse 재시작을 진행하지 않아도 자동 인식, 인식 안될시 Eclipse 재시작

2. BeautifulSoup 실행

- http://www.pythonscraping.com/exercises/exercise1.html

[실행 화면]

<h1>An Interesting Title</h1>

▷ basic.beautifulSoup.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
bsObj = BeautifulSoup(html.read()) # BeautifulSoup 객체 생성
print(bsObj.html.body.h1) # <HTML><BODY><H1> ~ </H1>
print(bsObj.body.h1)       # <HTML><BODY><H1> ~ </H1>
print(bsObj.html.h1)        # <HTML><BODY><H1> ~ </H1>
print(bsObj.h1)              # <HTML><BODY><H1> ~ </H1>


-------------------------------------------------------------------------------------

3. 존재하지 않는 URL에 대한 Exception 처리

[실행 화면]

HTTP Error 404: Not Found

Title could not be found

<h1>An Interesting Title</h1>

▷ basic.exceptionHandling.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except HTTPError as e:
        print(e)
        return None # 처리 종료
    else:
        return title   # 정상 처리
    finally:
        print('처리를 종료합니다.')

title = getTitle("http://www.pythonscraping.com/exercises/exercise1000.html")
# title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")

if title == None:
    print("요청 페이지가 존재하지 않습니다.")
else:
    print(title)     
    
    
-------------------------------------------------------------------------------------

4. CSS style class 속성의 접근

- http://www.pythonscraping.com/pages/warandpeace.html

[실행 화면]

Anna

Pavlovna Scherer

Empress Marya

Fedorovna

Prince Vasili Kuragin

.....

▷ basic.selectByClass.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html, "html.parser")
# <span class="green">...
nameList = bsObj.findAll("span", {"class":"green"})

# <class 'bs4.element.ResultSet'>
print(type(nameList)) 

for name in nameList:
    # type(name): <class 'bs4.element.Tag'> 
    print('type(name): ' + str(type(name)))
    print(name.get_text()) # 소스가 라인이 분리되어 있으면 라인이 변경됨
               

-------------------------------------------------------------------------------------

5. get_text()

- 태그를 없애고 순수 문자열을 산출합니다.

- http://www.pythonscraping.com/pages/warandpeace.html

[실행 화면]

"Well, Prince, so Genoa and Lucca are now just family estates of the

Buonapartes. But I warn you, if you don't tell me that this means war,

if you still try to defend the infamies and horrors perpetrated by

that Antichrist- I really believe he is Antichrist- I will have

nothing more to do with you and you are no longer my friend, no longer

my 'faithful slave,' as you call yourself! But how do you do? I see

I have frightened you- sit down and tell me all the news."

.....

▷basic.selectByAttribute.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

# html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html, "html.parser")

# allText = bsObj.findAll(id="text") # <div id="text"> ... </div>
allText = bsObj.findAll(id="firstHeading") # <h1 id="firstHeading" class="firstHeading" lang="en">Kevin Bacon</h1>

print(allText[0].get_text())


------------------------------------------------------------------------------------

* 검색 실습

https://en.wikipedia.org/wiki/Kevin_Bacon

<h1 id="firstHeading" class="firstHeading" lang="en">Kevin Bacon</h1>

6. find(), findAll() 함수

- 하나의 태그 검색

- find("table",{"id":"giftList"}): <TABLE> 태그중에 id 속성의 값이 'giftList'인 태그
- findAll(id="firstHeading"): 여러개의 태그 검색, children 속성 지원안함.

- children: 후손 태그의 list_iterator, list 값을 순차적으로 추출 가능

- http://www.pythonscraping.com/pages/page3.html

[실행 화면]

<class 'bs4.element.Tag'>

<class 'list_iterator'>

<tr><th>

Item Title

</th><th>

Description

</th><th>

Cost

</th><th>

Image

</th></tr>

.....

<tr class="gift" id="gift5"><td>

Mystery Box

</td><td>

If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>

</td><td>

$1.50

</td><td>

<img src="../img/gifts/img6.jpg">

</img></td></tr>

▷ basic.findDescendants.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")

# <class 'bs4.element.Tag'>
print(type(bsObj.find("table",{"id":"giftList"})))
print('==================================')

# <class 'list_iterator'>
print(type(bsObj.find("table",{"id":"giftList"}).children)) # 자식 태그들
print('==================================')

for child in bsObj.find("table",{"id":"giftList"}).children: # 자식 태그들 출력
    print(child)
    print('--------------------------------------')
  
print('==================================')
for child in bsObj.findAll("tr",{"class":"gift"}): # 자식 태그들 출력
    print(child)
    print('--------------------------------------')
    
    


-------------------------------------------------------------------------------------

7. 형제 태그 다루기

- find("table",{"id":"giftList"}).tr.next_siblings: <table> 태그중에 id 속성의 값이 'giftList' 태그를 검색 한 후

두번째 <tr>태그부터 산출함.

[실행 화면]

<tr class="gift" id="gift1"><td>

Vegetable Basket

</td><td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>

</td><td>

$15.00

</td><td>

▷ basic.findSiblings.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")

# tr.next_siblings: 첫번재 <tr> 태그를 지나서 다음부터 출력
# 첫번째 <TR> 태그는 데이터 보다 컬럼명인 경우가 많음
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling) 
    
    

-------------------------------------------------------------------------------------

8. 부모 태그 다루기

- print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text()):

. img 태그중에 src 속성의 값이 '../img/gifts/img1.jpg'인 태그를 검색

. 부모 태그 검색

. 부모 태그의 이전 형제 검색

<td>

$15.00

</td>

<td>

</td>

[실행 화면]

$15.00

▷ basic.findParents.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")

'''
<tr>
<td>
    $15.00
</td>
<td>
    <img src="../img/gifts/img1.jpg">
</td>
'''
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())



-------------------------------------------------------------------------------------

9. 정규 표현식

- images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})

- re.compile(""): 정규 표현식을 사용하기전에 파싱하여 메모리상에 최적화 함.
- .는 '\'를 이용하여 기능이 없는 단순 문자로 사용 가능

- "\.\.\/img\/gifts/img.*\.jpg": ../img/gifts/img로 시작하며 .*는 한문자와 대응하는 모든 문자열이며

\.jpg는 .jpg로 끝나는 모든 문자열

- .*: 모든 문자와 대응, 앞에 나오는 문자가 없거나 한번 이상 대응, 가장 긴 문자열 최대 일치,

줄바꿈을 제외한 모든문자

[실행 화면]

../img/gifts/img1.jpg

../img/gifts/img2.jpg

../img/gifts/img3.jpg

../img/gifts/img4.jpg

../img/gifts/img6.jpg

▷ basic.regularExpressions.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")

images = bsObj.findAll("img", {"src":re.compile("\.\./img/gifts/img.*\.jpg")})
# print(str(type(images)))

for image in images: 
    # print(str(type(image)))
    print(image["src"])
     

-------------------------------------------------------------------------------------

[실습 1] 'http://www.kma.go.kr/index.jsp' 페이지에 접속하여
'서울.경기'의 온도를 가져오는 스크립트를 작성하세요.

[실행 화면]
서울/경기 온도: 20.1

▷ /basic/kma.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.kma.go.kr/index.jsp")
bsObj = BeautifulSoup(html, "html.parser")

# 태그 배열
# temperature = bsObj.find_all("div", "weather_area_0930 ML22")[0]

# 하나의 태그
temperature = bsObj.find("div", "weather_area_0930 ML22")
dls = temperature.select('dl')
print(type(dls))  # list
print('---------------------------------------------------')   
for item in dls:  # 모든 dl 태그 출력
    print(item)
    print()
print('---------------------------------------------------')
    
val = dls[2].dd.p.getText()  # 3번째 태그 산출

print("서울/경기 온도: " + str(val))
    
    

-------------------------------------------------------------------------------------

[실습 2] 'http://www.daum.net -> 뉴스 -> 랭킹 -> 많이 본 -> 종합' 페이지에 접속하여
50개 뉴스의 제목을 가져오는 스크립트를 작성하세요.

[실행 화면]

1. [단독] 며느리가 구박한다고 살해한 시아버지

2. [단독]"성주 사드 기지에 드론 침투"..10여차례 총성

3. "석방된 장시호, 7개월만에 잠든 아들 보고 펑펑 울었다"

4. '검찰의 별' 검사장 하룻만에 4명 단칼 정리.."무섭다"
.....

▷ /basic/daum.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://media.daum.net/ranking/popular")
bsObj = BeautifulSoup(html, "html.parser")

# 태그 배열
# temperature = bsObj.find_all("div", "weather_area_0930 ML22")[0]

# 하나의 태그
tags = bsObj('a', {'class': 'link_txt'})
print(type(tags))  # <class 'bs4.element.ResultSet'>

for index in range(0, 50): # 0 ~ 49
    print(str(index+1) + ". " + tags[index].getText())
  
  
-------------------------------------------------------------------------------------

728x90

'빅데이터 프로그래밍 > Python' 카테고리의 다른 글

[Python] 29. [Scraping] KoNLPy 자연어 처리 패키지, JPype 설치, 명사 분리 추출 후, 단어 사용 빈도 계산하기 (0)	2017.08.16
[Python] 28. [Scraping] 한겨레 신문 뉴스, Naver 뉴스, 동아 일보 뉴스 검색 drawling (1)	2017.08.16
[Python] 26. [Scraping] 재귀 호출 함수, Lamda 함수 이용 , random 난수 발생, LX (0)	2017.08.05
[Python] 25. Google Gmail SMTP 서버를 이용한 Mail 전송 (2)	2017.08.05
[Python] 24. Regular Expression(정규 표현식) 기본 문법 실습 2, Pyperclip library, cx_freeze로 EXE 만들기 (0)	2017.08.05

[Python] 27. [Scraping] Web Scraping 기초, 한글 처리, BeautifulSoup 설치, 기본 트리 운행, 정규 표현식 이용

'빅데이터 프로그래밍 > Python' 카테고리의 다른 글

댓글

티스토리툴바