'개발언어/Python' 카테고리의 글 목록

개발언어/Python

hel 2023.04.18
#2 - 네이버 뉴스 크롤링(python) 2023.04.10

hel

느리게가는시계 2023. 4. 18. 01:46

2023. 4. 18. 01:46

'개발언어 > Python' 카테고리의 다른 글

#2 - 네이버 뉴스 크롤링(python) (0)	2023.04.10

#2 - 네이버 뉴스 크롤링(python)

느리게가는시계 2023. 4. 10. 19:54

2023. 4. 10. 19:54

# 1. 패키지 importing

import requests

from pandas import DataFrame

from bs4 import BeautifulSoup

import re

from datetime import datetime

import os

# 2. 현재 시간 저장

date = str(datetime.now())

date = date[:date.rfind(':')].replace(' ', '_')

date = date.replace(':','시') + '분'

# 3. Input 생성

query = input('검색 키워드를 입력하세요 : ')

query = query.replace(' ', '+')

news_num = int(input('총 필요한 뉴스기사 수를 입력해주세요(숫자만 입력) : '))

# 4. 요청할 URL 생성 및 요청

news_url = 'https://search.naver.com/search.naver?where=news&sm=tab_jum&query={}'

req = requests.get(news_url.format(query))

soup = BeautifulSoup(req.text, 'html.parser')

# 5. 원하는 정보를 담을 변수 생성(딕셔너리)

news_dict = {}

idx = 0

cur_page = 1

# 6. parsing 한 HTML 코드에서 원하는 정보 탐색(뉴스 기사 title, URL)

print()

print('크롤링 중...')

while idx < news_num:

### 네이버 뉴스 웹페이지 구성이 바뀌어 태그명, class 속성 값 등을 수정함(20210126) ###

table = soup.find('ul',{'class' : 'list_news'})

li_list = table.find_all('li', {'id': re.compile('sp_nws.*')})

area_list = [li.find('div', {'class' : 'news_area'}) for li in li_list]

a_list = [area.find('a', {'class' : 'news_tit'}) for area in area_list]

for n in a_list[:min(len(a_list), news_num-idx)]:

news_dict[idx] = {'title' : n.get('title'),

'url' : n.get('href') }

idx += 1

cur_page += 1

pages = soup.find('div', {'class' : 'sc_page_inner'})

next_page_url = [p for p in pages.find_all('a') if p.text == str(cur_page)][0].get('href')

req = requests.get('https://search.naver.com/search.naver' + next_page_url)

soup = BeautifulSoup(req.text, 'html.parser')

# 7. 데이터 프레임 변환 및 저장

print('크롤링 완료')

print('데이터프레임 변환')

news_df = DataFrame(news_dict).T

folder_path = os.getcwd()

xlsx_file_name = '네이버뉴스_{}_{}.xlsx'.format(query, date)

news_df.to_excel(xlsx_file_name)

print('엑셀 저장 완료 v.01')

#print('엑셀 저장 완료 | 경로 : {}\\{}'.format(folder_path, xlsx_file_name))

# os.startfile(folder_path)

https://everyday-tech.tistory.com/entry/%EC%89%BD%EA%B2%8C-%EB%94%B0%EB%9D%BC%ED%95%98%EB%8A%94-%EB%84%A4%EC%9D%B4%EB%B2%84-%EB%89%B4%EC%8A%A4-%ED%81%AC%EB%A1%A4%EB%A7%81python-2%ED%83%84

[2탄] 쉽게 따라하는 네이버 뉴스 크롤링(python) - title, URL 가져오기

"본 포스팅은 네이버 웹 크롤링 실제 python 코드를 작성하는 2탄입니다. 전 단계인 수행계획을 확인하고 싶으신 분들은 아래링크(1탄)을 참고해주세요 :)" 네이버 웹 페이지 구성이 바뀌어 내용,

everyday-tech.tistory.com

'개발언어 > Python' 카테고리의 다른 글

hel (0)	2023.04.18

PREV 이전 1 NEXT 다음

느리게가는시계