๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Python/ETC

[Python] ํŒŒ์ด์ฌ BeautifulSoup multiprocessing ์†๋„๊ฐœ์„  :: ๋งˆ์ด์ž๋ชฝ

by ๐ŸŒปโ™š 2018. 12. 31.

๋„ค์ด๋ฒ„ ๋‰ด์Šค ํฌ๋กค๋ง ํŽ˜์ด์ง•

์ด์ „๊ธ€์—์„œ ๋„ค์ด๋ฒ„๋‰ด์Šค ํ•œํŽ˜์ด์ง€์˜ ์ œ๋ชฉ ํฌ๋กค๋ง์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ, ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ• ๋•Œ ํ•„์š”ํ•œ ์ˆ˜์ง‘๋ฐ์ดํ„ฐ๋Š” ํ•œํŽ˜์ด์ง€๊ฐ€ ์•„๋‹ˆ๋ผ ์—ฌ๋ŸฌํŽ˜์ด์ง€์˜ ๋‚ด์šฉ์„ ์ˆ˜์ง‘ํ•ด์•ผํ•œ๋‹ค.

๋ฐฉ๋ฒ•์€ ๊ฐ„๋‹จํ•˜๋‹ค.

url request์š”์ฒญ์„ ๋ฐ˜๋ณต๋ฌธ์„ ํ†ตํ•ด์„œ ๋ฐ›์•„์™€์•ผ ํ•œ๋‹ค.

ํ˜„์žฌ ๋„ค์ด๋ฒ„์˜ url ํ˜•์‹์„ ๋ณด๋ฉด page๊ฐ€ ์•„๋‹Œ ๊ฒŒ์‹œ๊ธ€์˜ ์‹œ์ž‘ ์ง€์ ์„ start๋ผ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฐ›์•„์˜ค๊ณ  ์žˆ๋‹ค.

 

 

 

 

 

๋„ค์ด๋ฒ„ ๋‰ด์Šค ํฌ๋กค๋ง multiprocessing(X)

1๋ฒˆ์งธ ๊ฒŒ์‹œ๋ฌผ๋ถ€ํ„ฐ 300๋ฒˆ์งธ๊นŒ์ง€ ์ด 30๋ฒˆ์— ๊ฑธ์ณ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์˜จ๋‹ค.
time๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•ด์„œ ๋ฐ›์•„์˜ค๋Š” ์‹œ๊ฐ„๊นŒ์ง€ ํ™•์ธํ•œ๋‹ค.
from bs4 import BeautifulSoup
import requests
import time
search_word = "์‚ผ์„ฑ" #๊ฒ€์ƒ‰์–ด ์ง€์ •
start = 1
end = 300 #๋งˆ์ง€๋ง‰ ๋‰ด์Šค ์ง€์ •
title_list = []

if __name__ == '__main__':
    start_time = time.time()
    while 1:
        if start > end:
            break
        print(start)

        url = 'https://search.naver.com/search.naver?where=news&sm=tab_jum&query={}&start={}'.format(search_word,start)
        req = requests.get(url)

        #์ •์ƒ์ ์ธ request ํ™•์ธ
        if req.ok:
            html = req.text
            soup = BeautifulSoup(html,'html.parser')
    
         #๋‰ด์Šค์ œ๋ชฉ ๋ฝ‘์•„์˜ค๊ธฐ
            titles = soup.select(
                'ul.type01 > li > dl > dt > a'
            )

            #list์— ๋„ฃ์–ด์ค€๋‹ค
            for title in titles:
                title_list.append(title['title'])
        start += 10
    print(title_list)
    print("์‹คํ–‰ ์‹œ๊ฐ„ : %s์ดˆ" % (time.time() - start_time))
 

 

 

 

 

multiprocessing(X) ๊ฒฐ๊ณผ

๋Œ€๋Ÿ‰ 30๋ฒˆ์˜ ์›นํŽ˜์ด์ง€ ๋ฐ์ดํ„ฐ์ˆ˜์ง‘์œผ๋กœ 300๊ฐœ์ •๋„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์˜ค๋Š”๋ฐ 21์ดˆ์ •๋„๊ฐ€ ๊ฑธ๋ ธ๋‹ค.
๋„ˆ๋ฌด ๋Š๋ฆฌ๋‹ค.
์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ๋•Œ๋Š” ์ˆ˜๋งŒ ์ˆ˜์‹ญ๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ทจ๊ธ‰ํ•˜๋Š”๋ฐ ๊ณ ์ž‘ 300๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์˜ค๋Š”๋ฐ 21์ดˆ๋‚˜ ๊ฑธ๋ฆฌ๋Š” ๊ฒƒ์€ ์‹œ๊ฐ„๋‚ญ๋น„๋‹ค.

 

 

 

 

 

๋„ค์ด๋ฒ„ ๋‰ด์Šค ํฌ๋กค๋ง multiprocessing(O)

์•„๋ž˜๋Š” ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ์„ ์‚ฌ์šฉํ•œ ์ฝ”๋“œ์ด๋‹ค.
multiprocessing์˜ Pool๊ณผ Manager๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
 
Pool
- ๊ฐ’์„ process์— ๋ถ„๋ฐฐํ•˜์—ฌ ๋ณ‘๋ ฌํ™”ํ•˜์—ฌ ํ•จ์ˆ˜์‹คํ–‰ํ•œ๋‹ค.
-map ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ (ํ•จ์ˆ˜,ํ•จ์ˆ˜์˜ ์ธ์ž๋กœ ์‚ฌ์šฉํ•  ๊ฐ’๋“ค) ํ˜•ํƒœ์˜ ์ธ์ž๋ฅผ ์‚ฌ์šฉ
 
Manager
-global๊ฐ’์„ ๊ณต์œ ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ Manager ์‚ฌ์šฉ
-Manager list๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ผ๋ฐ˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ์†์ด ๋นˆ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ• ๊ฒƒ์ด๋‹ค.
์ด๋Š” multiprocessing๋˜๋ฉด์„œ ๊ฐ™์€ global์ „์—ญ ๋ณ€์ˆ˜๋ฅผ ๋ฐ”๋ผ๋ณด์ง€ ์•Š๊ธฐ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•˜๋Š” ํ˜„์ƒ์ด๋‹ค.
Manager์˜ listํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค๊ฐ€ Shared๋œ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ”๋กœ๋ณด๋„๋ก ํ•œ๋‹ค.
from bs4 import BeautifulSoup
import requests
import time
from multiprocessing import Pool, Manager
search_word = "์‚ผ์„ฑ" #๊ฒ€์ƒ‰์–ด ์ง€์ •
end = 300 #๋งˆ์ง€๋ง‰ ๋‰ด์Šค ์ง€์ •

#list๋ฅผ ๊ณต์œ  ํ•˜๊ธฐ ์œ„ํ•ด
manager = Manager()
title_list = manager.list()

def title_to_list(start):
    global title_list
    #url making
    url = 'https://search.naver.com/search.naver?where=news&sm=tab_jum&query={}&start={}'.format(search_word,start)
    req = requests.get(url)

    #์ •์ƒ์ ์ธ request ํ™•์ธ
    if req.ok:
        html = req.text
        soup = BeautifulSoup(html,'html.parser')
    
        #๋‰ด์Šค์ œ๋ชฉ ๋ฝ‘์•„์˜ค๊ธฐ
        titles = soup.select(
            'ul.type01 > li > dl > dt > a'
        )

        #list์— ๋„ฃ์–ด์ค€๋‹ค
        for title in titles:
            title_list.append(title['title'])

if __name__ == '__main__':
    start_time = time.time()
    pool = Pool(processes=4) #4๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค ๋™์‹œ์— ์ž‘๋™
    pool.map(title_to_list,range(1,end,10)) #title_to_list๋ผ๋Š” ํ•จ์ˆ˜์— 1 ~ end๊นŒ์ง€ 10์”ฉ๋Š˜๋ ค๊ฐ€๋ฉฐ ์ธ์ž๋กœ ์ ์šฉ
    print(title_list)
    print("์‹คํ–‰ ์‹œ๊ฐ„ : %s์ดˆ" % (time.time() - start_time))
 
 

 

 

 

 

 

 

multiprocessing(O) ๊ฒฐ๊ณผ

4๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํฌ๋กค๋ง์„ ํ•œ ๊ฒฐ๊ณผ 300๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์˜ค๋Š”๋ฐ 6์ดˆ ์ •๋„์˜ ์‹œ๊ฐ„์ด ์†Œ์š”๋ฌ๋‹ค.
22์ดˆ -> 6์ดˆ ๋Œ€๋žต 4๋ฐฐ์ •๋„์˜ ์‹œ๊ฐ„์ด ์ ˆ๊ฐ๋˜์—ˆ๋‹ค.

 

๋Œ“๊ธ€