๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Python/ETC

[Python ๊ฐ์ •๋ถ„์„] ์˜์–ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ :: ๋งˆ์ด์ž๋ชฝ

by ๐ŸŒปโ™š 2019. 2. 9.

Python ๊ฐ์ • ๋ถ„์„

ํ…์ŠคํŠธ๋กœ ๋ถ€ํ„ฐ ์–ด๋– ํ•œ ์ฃผ๊ด€์ ์ธ ์˜๊ฒฌ์„ ๋ฝ‘์•„๋‚ด๋Š” ๊ฒƒ์ด ๊ฐ์ •๋ถ„์„์ด๋‹ค.
๊ธฐ๋ณธ์ ์œผ๋กœ train data๋กœ ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ•˜๊ณ   test data์— ๋ฐ˜์˜์„ ํ•œ๋‹ค.
์ฃผ๋กœ ๊ธ์ •/๋ถ€์ • ํ˜•์‹์˜ 2์ง„ ๋‹ต๋ณ€์„ ๋ฐ˜ํ™˜ํ•˜์—ฌ  1/0 ๋˜๋Š” 1/-1์˜ ๊ฐ’์œผ๋กœ ๊ธ์ • ๋ถ€์ •์„ ํŒ๋‹จํ•œ๋‹ค.
๊ฐ ์–ธ์–ด๋ณ„๋กœ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฅด๋‹ค.
 
 

์˜์–ด ๋ฐ์ดํ„ฐ ์…‹ ๊ตฌํ•˜๊ธฐ

๋ฐ์ดํ„ฐ ์…‹์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์ฐธ์กฐํ•˜๊ธฐ ์ข‹์€ ์‚ฌ์ดํŠธ์— ๋Œ€ํ•œ ๊ธ€์ด๋‹ค.
 
์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ์บ๊ธ€์—์„œ ํŠธ์œ„ํ„ฐ ๊ฐ์ •๋ถ„์„์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์…‹์„ ๋‹ค์šด๋ฐ›์•„ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•ด๋ณผ๊ฑฐ๋‹ค.

 

ํ•ด๋‹น ์‚ฌ์ดํŠธ์—์„œ sentiment๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋ฉด ๊ฐ์ข… ๊ฐ์ •๋ถ„์„์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ๊ณตํ•ด์ค€๋‹ค.

 

 

 

ํŠธ์œ„ํ„ฐ ๊ธ€์— ๋Œ€ํ•œ test์™€ train ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์ค€๋‹ค.
 
 

์˜์–ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

header=0 :  ํŒŒ์ผ ์ฒซ๋ฒˆ์งธ ํ–‰์— ์—ด ์ด๋ฆ„์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ธ์ง€
delimiter : ๊ตฌ๋ถ„์ž
quoting=3 : ์Œ๋”ฐ์˜ดํ‘œ๋ฅผ ๋ฌด์‹œํ•˜๋„๋ก ํ•œ๋‹ค.
import pandas as pd

train = pd.read_csv(
	'/Users/Jamong/Desktop/data/twitter_sentiment/train.tsv',
	header=0, 
    delimiter='\t', 
    quoting=3
)

test = pd.read_csv(
	'/Users/Jamong/Desktop/data/twitter_sentiment/test.tsv',
	header=0, 
    delimiter='\t', 
    quoting=3
)
 
 
์˜๋ฏธ์žˆ๋Š” ์˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์˜ค๊ธฐ ์œ„ํ•ด์„œ 4~5๋‹จ๊ณ„์˜ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
์•„๋ž˜๋Š” train data ์—๋Œ€ํ•œ raw data์˜ ์ผ๋ถ€์ด๋‹ค.

1. HTML ์ œ๊ฑฐ(์žˆ๋Š” ๊ฒฝ์šฐ์—๋งŒ)

ํ•ด๋‹น ๋ฐ์ดํ„ฐ์—์„œ๋Š” htmlํƒœ๊ทธ๊ฐ€ ์—†๋‹ค.
HTML์ œ๊ฑฐ ์ž‘์—…์ด ํ•„์š” ์—†์ง€๋งŒ, ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ  BeautifulSoup๋ฅผ ์ด์šฉํ•ด์„œ ์ œ๊ฑฐ์ž‘์—…์„ ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.
from bs4 import BeautifulSoup

raw_tweet = BeautifulSoup(raw_tweet, 'html.parser').get_text()
 
 

2. ์˜์–ด๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋Œ€์ฒด

์˜์–ด๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž๋Š” ์ œ๊ฑฐํ•˜์—ฌ ๋‹จ์–ดํŒ ๋‚จ๊ธฐ๋„๋ก ํ•œ๋‹ค.

์ด๋•Œ ์ •๊ทœํ‘œํ˜„์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์ œ๊ฑฐํ•ด์ค€๋‹ค.

import re

# ์˜๋ฌธ์ž ์ด์™ธ ๋ฌธ์ž๋Š” ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
only_english = re.sub('[^a-zA-Z]', ' ', data)
 
 

3. ๋Œ€๋ฌธ์ž๋Š” ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜

์˜์–ด์˜ ๊ฒฝ์šฐ ๋ฌธ์žฅ์˜ ์‹œ์ž‘์ด๋‚˜ ๊ณ ์œ ๋ช…์‚ฌ๋Š” ๋Œ€๋ฌธ์ž๋กœ ์‹œ์ž‘ํ•˜์—ฌ ๋ถ„์„ํ• ๋•Œ "Apple"๊ณผ "apple"์„ ๋‹ค๋ฅธ ๋‹จ์–ด ์ทจ๊ธ‰ํ•˜๊ฒŒ๋œ๋‹ค.

๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

# ์†Œ๋ฌธ์ž ๋ณ€ํ™˜
no_capitals = only_english.lower().split()
 
 

4. ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ

ํ•™์Šต ๋ชจ๋ธ์—์„œ ์˜ˆ์ธก์ด๋‚˜ ํ•™์Šต์— ์‹ค์ œ๋กœ ๊ธฐ์—ฌํ•˜์ง€ ์•Š๋Š” ํ…์ŠคํŠธ๋ฅผ ๋ถˆ์šฉ์–ด๋ผ๊ณ ํ•œ๋‹ค.
I, that, is, the, a  ๋“ฑ๊ณผ ๊ฐ™์ด ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์ด์ง€๋งŒ ์‹ค์ œ๋กœ ์˜๋ฏธ๋ฅผ ์ฐพ๋Š”๋ฐ ๊ธฐ์—ฌํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๋“ค์„ ์ œ๊ฑฐํ•˜๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค.
from nltk.corpus import stopwords

# ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
stops = set(stopwords.words('english'))
no_stops = [word for word in no_capitals if not word in stops]
 
 
nltk stopwords๊ด€๋ จ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋ฉด nltk download ์ž‘์—…์„ ์ฝ”๋“œ์ƒ์œผ๋กœ ํ•ด์ค€๋‹ค.
nltk.download('stopwords')
 

5. ์–ด๊ฐ„ ์ถ”์ถœ

see, saw, seen

run, running, ran

์œ„ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ์–ดํ˜•์ด ๊ณผ๊ฑฐํ˜•์ด๋“  ๋ฏธ๋ž˜ํ˜•์ด๋“  ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ์ทจ๊ธ‰ํ•˜๊ธฐ ์œ„ํ•œ ์ฒ˜๋ฆฌ์ž‘์—…์ด๋‹ค.

nltk์—์„œ ์ œ๊ณตํ•˜๋Š” ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์–ด๊ฐ„์ถ”์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜(Porter, Lancaster, Snowball ๋“ฑ๋“ฑ)์ด ์กด์žฌํ•œ๋‹ค.

import nltk

# ์–ด๊ฐ„ ์ถ”์ถœ
stemmer = nltk.stem.SnowballStemmer('english')
stemmer_words = [stemmer.stem(word) for word in no_stops]
 
 

์ „์ฒ˜๋ฆฌ ์ „์ฒด ์ž‘์—…

def data_text_cleaning(data):

    # ์˜๋ฌธ์ž ์ด์™ธ ๋ฌธ์ž๋Š” ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
    only_english = re.sub('[^a-zA-Z]', ' ', data)

    # ์†Œ๋ฌธ์ž ๋ณ€ํ™˜
    no_capitals = only_english.lower().split()

    # ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
    stops = set(stopwords.words('english'))
    no_stops = [word for word in no_capitals if not word in stops]

    # ์–ด๊ฐ„ ์ถ”์ถœ
    stemmer = nltk.stem.SnowballStemmer('english')
    stemmer_words = [stemmer.stem(word) for word in no_stops]

    # ๊ณต๋ฐฑ์œผ๋กœ ๊ตฌ๋ถ„๋œ ๋ฌธ์ž์—ด๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜
    return ' '.join(stemmer_words)
 

 

๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ ์ž‘์—…

ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ์ „์ฒ˜๋ฆฌ ์ž‘์—…ํ•˜๋Š”๋ฐ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค.
๋ฐ์ดํ„ฐ ์ˆ˜, ๋ฐ์ดํ„ฐ ๊ธธ์ด๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก PC์‚ฌ์–‘์— ๋”ฐ๋ผ ์ „์ฒ˜๋ฆฌ ์ž‘์—… ์‹œ๊ฐ„ ์ฐจ์ด๊ฐ€ ๋งŽ์•„๋‚œ๋‹ค.
์ „์ฒ˜๋ฆฌ ์ž‘์—…๋งŒ ์ง„ํ–‰ํ•˜๋Š”๋ฐ ๋ช‡๋ถ„์”ฉ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ๋‹ค.
์ด๋Ÿฌํ•œ ์ž‘์—…์„ ๋น ๋ฅด๊ฒŒ ์ง„ํ–‰ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ์œผ๋กœ ์ž‘์—…์„ ์ง„ํ–‰ํ•œ๋‹ค.
from multiprocessing import Pool

def use_multiprocess(func, iter, workers):
    pool = Pool(processes=workers)
    result = pool.map(func, iter)
    pool.close()
    return result
 

 

import time

if __name__ == '__main__':
    start_time = time.time()
    train = pd.read_csv('/Users/Jamong/Desktop/data/twitter_sentiment/train.tsv',
                        header=0, delimiter='\t', quoting=3)

    test = pd.read_csv('/Users/Jamong/Desktop/data/twitter_sentiment/test.tsv',
                       header=0, delimiter='\t', quoting=3)

    clean_processed_tweet = use_multiprocess(data_text_cleaning, train['tweet'], 3)
    print('์‹คํ–‰ ์‹œ๊ฐ„ :', (time.time() - start_time))โ€‹

 

 

Tweet ๋‹จ์–ด ์ˆ˜ ํ™•์ธ

seaborn์„ ์‚ฌ์šฉํ•ด์„œ tweet๋ณ„ ๋‹จ์–ด์ˆ˜ ์™€ ๊ณ ์œ ๋‹จ์–ด ์ˆ˜๋ฅผ ์‹œ๊ฐํ™”ํ•ด์„œ ํ™•์ธํ•ด๋ณผ๊ฑฐ๋‹ค.
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns

def show_tweet_word_count_stat(data):
    num_word = []
    num_unique_words = []
    for item in data:
        num_word.append(len(str(item).split()))
        num_unique_words.append(len(set(str(item).split())))

    # ์ผ๋ฐ˜
    train['num_words'] = pd.Series(num_word)
    # ์ค‘๋ณต ์ œ๊ฑฐ
    train['num_unique_words'] = pd.Series(num_unique_words)

    x = data[0]
    x = str(x).split()
    print(len(x))

    rc('font', family='AppleGothic')

    fig, axes = plt.subplots(ncols=2)
    fig.set_size_inches(18, 6)
    print('Tweet ๋‹จ์–ด ํ‰๊ท  ๊ฐ’ : ', train['num_words'].mean())
    print('Tweet ๋‹จ์–ด ์ค‘๊ฐ„ ๊ฐ’', train['num_words'].median())
    sns.distplot(train['num_words'], bins=100, ax=axes[0])
    axes[0].axvline(train['num_words'].median(), linestyle='dashed')
    axes[0].set_title('Tweet ๋‹จ์–ด ์ˆ˜ ๋ถ„ํฌ')

    print('Tweet ๊ณ ์œ  ๋‹จ์–ด ํ‰๊ท  ๊ฐ’ : ', train['num_unique_words'].mean())
    print('Tweet ๊ณ ์œ  ๋‹จ์–ด ์ค‘๊ฐ„ ๊ฐ’', train['num_unique_words'].median())
    sns.distplot(train['num_unique_words'], bins=100, color='g', ax=axes[1])
    axes[1].axvline(train['num_unique_words'].median(), linestyle='dashed')
    axes[1].set_title('Tweet ๊ณ ์œ ํ•œ ๋‹จ์–ด ์ˆ˜ ๋ถ„ํฌ')

    plt.show()

show_tweet_word_count_stat(clean_processed_tweet)
 

 

 

๊ฒฐ๊ณผ

๊ฐ Tweet๋ณ„ ๋‹จ์–ด ์ตœ๋Œ€ ์ตœ์†Œ ์ˆ˜์™€ ์ค‘๊ฐ„, ํ‰๊ท  ๊ฐ’์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ํ™•์ธ

 

 

 

 

์ „์ฒด ์‹คํ–‰ ์ฝ”๋“œ

from multiprocessing import Pool
import pandas as pd
import re
import time
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns


def use_multiprocess(func, iter, workers):
    pool = Pool(processes=workers)
    result = pool.map(func, iter)
    pool.close()
    return result


def check_basic_info():
    print("-----train-----")
    print(train.head())
    print(train.info())
    print(train['tweet'][0:10])

    print("\n\n-----test-----")
    print(test.head())
    print(test.info())


def data_text_cleaning(data):

    # ์˜๋ฌธ์ž ์ด์™ธ ๋ฌธ์ž๋Š” ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
    only_english = re.sub('[^a-zA-Z]', ' ', data)

    # ์†Œ๋ฌธ์ž ๋ณ€ํ™˜
    no_capitals = only_english.lower().split()

    # ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
    stops = set(stopwords.words('english'))
    no_stops = [word for word in no_capitals if not word in stops]

    # ์–ด๊ฐ„ ์ถ”์ถœ
    stemmer = nltk.stem.SnowballStemmer('english')
    stemmer_words = [stemmer.stem(word) for word in no_stops]

    # ๊ณต๋ฐฑ์œผ๋กœ ๊ตฌ๋ถ„๋œ ๋ฌธ์ž์—ด๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜
    return ' '.join(stemmer_words)


def show_tweet_word_count_stat(data):
    num_word = []
    num_unique_words = []
    for item in data:
        num_word.append(len(str(item).split()))
        num_unique_words.append(len(set(str(item).split())))

    # ์ผ๋ฐ˜
    train['num_words'] = pd.Series(num_word)
    # ์ค‘๋ณต ์ œ๊ฑฐ
    train['num_unique_words'] = pd.Series(num_unique_words)

    x = data[0]
    x = str(x).split()
    print(len(x))

    rc('font', family='AppleGothic')

    fig, axes = plt.subplots(ncols=2)
    fig.set_size_inches(18, 6)
    print('Tweet ๋‹จ์–ด ํ‰๊ท  ๊ฐ’ : ', train['num_words'].mean())
    print('Tweet ๋‹จ์–ด ์ค‘๊ฐ„ ๊ฐ’', train['num_words'].median())
    sns.distplot(train['num_words'], bins=100, ax=axes[0])
    axes[0].axvline(train['num_words'].median(), linestyle='dashed')
    axes[0].set_title('Tweet ๋‹จ์–ด ์ˆ˜ ๋ถ„ํฌ')

    print('Tweet ๊ณ ์œ  ๋‹จ์–ด ํ‰๊ท  ๊ฐ’ : ', train['num_unique_words'].mean())
    print('Tweet ๊ณ ์œ  ๋‹จ์–ด ์ค‘๊ฐ„ ๊ฐ’', train['num_unique_words'].median())
    sns.distplot(train['num_unique_words'], bins=100, color='g', ax=axes[1])
    axes[1].axvline(train['num_unique_words'].median(), linestyle='dashed')
    axes[1].set_title('Tweet ๊ณ ์œ ํ•œ ๋‹จ์–ด ์ˆ˜ ๋ถ„ํฌ')

    plt.show()


if __name__ == '__main__':
    start_time = time.time()
    train = pd.read_csv('/Users/Jamong/Desktop/data/twitter_sentiment/train.tsv',
                        header=0, delimiter='\t', quoting=3)

    test = pd.read_csv('/Users/Jamong/Desktop/data/twitter_sentiment/test.tsv',
                       header=0, delimiter='\t', quoting=3)

    check_basic_info()
    clean_processed_tweet = use_multiprocess(data_text_cleaning, train['tweet'], 3)
    print('์‹คํ–‰ ์‹œ๊ฐ„ :', (time.time() - start_time))

    show_tweet_word_count_stat(clean_processed_tweet)โ€‹

 

๋Œ“๊ธ€