데이터 정리 - 파싱parsing

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

새로새록

데이터 정리 - 파싱parsing 본문

소프트웨어융합/코드잇 정리.py

데이터 정리 - 파싱parsing

류지나 2021. 11. 24. 04:21

import time
import pandas as pd
import requests
from bs4 import BeautifulSoup

# 빈 리스트 생성
records = []

# 시작 페이지 지정
page_num = 1

while True:
    # HTML 코드 받아오기
    response = requests.get("http://www.ssg.com/search.ssg?target=all&query=nintendo&page=" + str(page_num))

    # BeautifulSoup 타입으로 변형하기
    soup = BeautifulSoup(response.text, 'html.parser')

    # "prodName" 클래스가 있을 때만 상품 정보 가져오기
    if len(soup.select('.csrch_tip')) == 0:
        product_names = soup.select('.cunit_info > div.cunit_md.notranslate > div > a > em.tx_ko')
        product_prices = soup.select('.cunit_info > div.cunit_price.notranslate > div.opt_price > em')
        product_urls = soup.select('.cunit_prod > div.thmb > a > img')
        page_num += 1
        time.sleep(3)
        
        # 상품의 정보를 하나의 레코드로 만들고, 리스트에 순서대로 추가하기
        for i in range(len(product_names)):
            record = []
            record.append(product_names[i].text)
            record.append(product_prices[i].text.strip())
            record.append("https://www.ssg.com" + product_urls[i].get('src'))
            records.append(record)
    else:
        break

# DataFrame 만들기
df = pd.DataFrame(data = records, columns = ["이름", "가격", "이미지 주소"])

# DataFrame 출력
df.head()

import pandas as pd
import requests
from bs4 import BeautifulSoup

# 기간 지정
years = range(2010, 2013)
months = range(1, 13)
weeks = range(0, 5)

# 페이지를 담는 빈 리스트 생성
rating_pages = []

for year in years:
    for month in months:
        for week in weeks:
            # HTML 코드 받아오기
            url = "https://workey.codeit.kr/ratings/index?year={}&month={}&weekIndex={}".format(year, month, week)
            response = requests.get(url)

            # BeautifulSoup 타입으로 변환하기
            soup = BeautifulSoup(response.text, 'html.parser')

            # "row" 클래스가 1개를 넘는 경우만 페이지를 리스트에 추가
            if len(soup.select('.row')) > 1:
                rating_pages.append(soup)

# 레코드를 담는 빈 리스트 만들기
records = []

# 각 페이지 파싱해서 정보 얻기
for page in rating_pages:
    date = page.select('option[selected=selected]')[2].text
    ranks = page.select('.row .rank')[1:]
    channels = page.select('.row .channel')[1:]
    programs = page.select('.row .program')[1:]
    percents = page.select('.row .percent')[1:]

    # 페이지에 있는 10개의 레코드를 리스트에 추가
    for i in range(10):
        record = []
        record.append(date)
        record.append(ranks[i].text)
        record.append(channels[i].text)
        record.append(programs[i].text)
        record.append(percents[i].text)
        records.append(record)

# DataFrame 만들기
df = pd.DataFrame(data=records, columns=['period', 'rank', 'channel', 'program', 'rating'])

# 결과 출력
df.head()

저작자표시 (새창열림)

'소프트웨어융합 > 코드잇 정리.py' 카테고리의 다른 글

bs4 -메소드 체이닝 & 엑셀/csv저장 (0)	2021.12.17
html css 정리하기 (0)	2021.12.01
머신러닝 기초 완성! (0)	2021.11.22
머신러닝 기본학습알고리즘 - 다항회귀 / 로지스틱회귀 파이썬코드 (0)	2021.11.17
java (0)	2021.11.16

'소프트웨어융합/코드잇 정리.py' Related Articles

새로새록

데이터 정리 - 파싱parsing 본문

데이터 정리 - 파싱parsing

'소프트웨어융합 > 코드잇 정리.py' 카테고리의 다른 글

티스토리툴바