[프로젝트] 고인물의 스팀 게임추천 #3 데이터탐색

앞에서 수집한 데이터를 mongoDB에서 아래와 같이 가져올 수 있다.

nosql은 처음 사용해보는데 겉으로 보이는 모습만으로는 크롤링 할 때의 BeautifulSoup 사용하는것처럼 느껴진다

수집한 reiview 데이터에서 리뷰 자체가 없는 경우가 많아서 아래와 같이 빈 list 가 있는 것은 제외하여 가져왔다.

docs = steam_appid.find({'reviews' : {"$ne": []}}  )
docs

참고로 비교연산자는 아래와 같다

$lte : 작거나 같다. (less than or equal)
$lt : 작다. (less than)
$eq : 같다. (equal)
$gte : 크거나 같다. (greater than or equal)
$gt : 크다. (greater than)
$ne : 같지 않다. (not equal)

다음으로 reviews 의 구조를 보면 author 가 depth가 하나 더 들어가있다. 제거하기에는 필요한 데이터가 포함되어 있기에 이것을 평탄화? 정확한 용어는 모르겠지만 recommendationid 와 같은 깊이로 끌어올리려 한다.

reviews2_df_list=[]

for doc in tqdm(docs):
    try:
        author_df = pd.DataFrame(columns=doc['reviews'][0]['author'].keys())
        for i in range(0,len(doc['reviews'])):
            author_df.loc[i]=doc['reviews'][i]['author'].values()

        reviews_df = pd.DataFrame(doc['reviews'])
        reviews_df['appid'] = doc['appid']
        reviews2_df = pd.concat([reviews_df, author_df] , axis=1)
        reviews2_df_list.append(reviews2_df)
    except:
        pass
        
flattened_pdf = pd.concat(reviews2_df_list)

이제부터 수집한 데이터를 조금 자세히 살펴보자.

데이터탐색의 가장 기본적인 info() 로 전체 count와 데이터 타입을 확인하면 아래와 같다.

15만개 정도의 row가 있으며 type 에는 object가 섞여있다.

Data columns (total 26 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   recommendationid             155336 non-null  object 
 1   author                       155336 non-null  object 
 2   language                     155336 non-null  object 
 3   review                       155336 non-null  object 
 4   timestamp_created            155336 non-null  int64  
 5   timestamp_updated            155336 non-null  int64  
 6   voted_up                     155336 non-null  bool   
 7   votes_up                     155336 non-null  int64  
 8   votes_funny                  155336 non-null  int64  
 9   weighted_vote_score          155336 non-null  object 
 10  comment_count                155336 non-null  int64  
 11  steam_purchase               155336 non-null  bool   
 12  received_for_free            155336 non-null  bool   
 13  written_during_early_access  155336 non-null  bool   
 14  hidden_in_steam_china        155336 non-null  bool   
 15  steam_china_location         155336 non-null  object 
 16  appid                        155336 non-null  int64  
 17  steamid                      155336 non-null  object 
 18  num_games_owned              155336 non-null  int64  
 19  num_reviews                  155336 non-null  int64  
 20  playtime_forever             155332 non-null  object 
 21  playtime_last_two_weeks      155332 non-null  object 
 22  playtime_at_review           135700 non-null  float64
 23  last_played                  155332 non-null  object 
 24  timestamp_dev_responded      1386 non-null    float64
 25  developer_response           1386 non-null    object 
dtypes: bool(5), float64(2), int64(8), object(11)
memory usage: 26.8+ MB

일단 편하게 타입을 변경하기 위해서 convert_dtypes()를 사용했다.

flattened_pdf = flattened_pdf.convert_dtypes()

flattened_pdf.dtypes

recommendationid                string
author                          object
language                        string
review                          string
timestamp_created                Int64
timestamp_updated                Int64
voted_up                       boolean
votes_up                         Int64
votes_funny                      Int64
weighted_vote_score             object
comment_count                    Int64
steam_purchase                 boolean
received_for_free              boolean
written_during_early_access    boolean
hidden_in_steam_china          boolean
steam_china_location            string
appid                            Int64
steamid                         string
num_games_owned                  Int64
num_reviews                      Int64
playtime_forever                 Int64
playtime_last_two_weeks          Int64
playtime_at_review               Int64
last_played                      Int64
timestamp_dev_responded          Int64
developer_response              string
dtype: object

아직 타입이 object 인 컬럼을 확인해보면 author 과 weighted_vote_score 가 있다.

author 의 경우에 위에서 평탄화 작업(군대에서 많이한 그것)을 하였고,

weighted_vote_score의 경우 0~1 사이의 값으로 되어있다. 타입을 float으로 변경하였다.

flattened_pdf['weighted_vote_score']=flattened_pdf['weighted_vote_score'].astype('float')

학습에 사용할 주요피처인 아래 3개를 기준으로 na값들은 모두 날려버려주자.

flattened_pdf = flattened_pdf.dropna(subset=['playtime_at_review', 'playtime_forever','weighted_vote_score'], how='any', axis=0)
flattened_pdf.count()

recommendationid               135696
author                         135696
language                       135696
review                         135696
timestamp_created              135696
timestamp_updated              135696
voted_up                       135696
votes_up                       135696
votes_funny                    135696
weighted_vote_score            135696
comment_count                  135696
steam_purchase                 135696
received_for_free              135696
written_during_early_access    135696
hidden_in_steam_china          135696
steam_china_location           135696
appid                          135696
steamid                        135696
num_games_owned                135696
num_reviews                    135696
playtime_forever               135696
playtime_last_two_weeks        135696
playtime_at_review             135696
last_played                    135696
timestamp_dev_responded          1345
developer_response               1345

필요없는 컬럼도 제거해주자.

flattened_pdf2=flattened_pdf.drop(['timestamp_dev_responded', 'developer_response','author'], axis=1)

마지막으로 중복데이터가 있다. 수집 test를 같은 collection 에서 하다보니 실수로 중복데이터가 들어가있다.

제거해주자.

flattened_pdf3=flattened_pdf2.drop_duplicates()
flattened_pdf3.count()

recommendationid               105933
language                       105933
review                         105933
timestamp_created              105933
timestamp_updated              105933
voted_up                       105933
votes_up                       105933
votes_funny                    105933
weighted_vote_score            105933
comment_count                  105933
steam_purchase                 105933
received_for_free              105933
written_during_early_access    105933
hidden_in_steam_china          105933
steam_china_location           105933
appid                          105933
steamid                        105933
num_games_owned                105933
num_reviews                    105933
playtime_forever               105933
playtime_last_two_weeks        105933
playtime_at_review             105933
last_played                    105933

모든 값이 동일하므로 이후 모델 학습을 때 에러가 안났으면 좋겠다.

어느 정도 깨끗해진 데이터를 앞에서와 같이 Spark 데이터프레임으로 변경후 hdfs에 저장하자.

df = spark.createDataFrame(flattened_pdf4)

df.write.format("parquet")\
    .mode('overwrite')\
    .save("/dahy/steam/data/app_reviews")

hdfs에 저장해놓은 review데이터와 appid 데이터를 불러와서 캐쉬하자.

import pyspark

app_reviews = spark.read.format("parquet")\
    .option("header", "true")\
    .load("/dahy/steam/data/app_reviews")
    
app_reviews.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2) 
app_ids= spark.read.format("parquet")\
    .option("header", "true")\
    .load("/dahy/steam/data/app_ids")
app_ids.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2) 
# DISK_ONLY
# DISK_ONLY_2
# DISK_ONLY_3
# MEMORY_AND_DISK
# MEMORY_AND_DISK_2
# MEMORY_AND_DISK_DESER
# MEMORY_ONLY
# MEMORY_ONLY_2
# OFF_HEAP

캐쉬할 때 옵션은 DISK나 MEMORY 등등 다양하고 상황에 따라서 적절하게 선택해주면 좋을것 같다.

제플린 좋은 점이 z.show 사용하면 데이터를 보기 훨씬 편한다. 기본적인 정렬이나 테이블 형태 외에서 드래그앤드롭 방식으로 그래프로도 쉽게 변환하여 볼 수 있다.

다음으로 유용점 점수는 앞에서 얘기하였듯이 0~1 사이로 1에 가까울수록 유용한 리뷰를 알려주는 피처이므로 0.5 이상인 데이터만 필터링 하였다.

from pyspark.sql.functions import col
app_reviews=app_reviews.filter( col("weighted_vote_score") >= 0.5 )

Spark 데이터프레임에서 printSchema 함수를 사용하면 테이블의 스키마 정보를 확인할 수 있다.

app_reviews.printSchema()

root
 |-- recommendationid: string (nullable = true)
 |-- language: string (nullable = true)
 |-- review: string (nullable = true)
 |-- timestamp_created: long (nullable = true)
 |-- timestamp_updated: long (nullable = true)
 |-- voted_up: boolean (nullable = true)
 |-- votes_up: long (nullable = true)
 |-- votes_funny: long (nullable = true)
 |-- weighted_vote_score: double (nullable = true)
 |-- comment_count: long (nullable = true)
 |-- steam_purchase: boolean (nullable = true)
 |-- received_for_free: boolean (nullable = true)
 |-- written_during_early_access: boolean (nullable = true)
 |-- hidden_in_steam_china: boolean (nullable = true)
 |-- steam_china_location: string (nullable = true)
 |-- appid: long (nullable = true)
 |-- steamid: long (nullable = true)
 |-- num_games_owned: long (nullable = true)
 |-- num_reviews: long (nullable = true)
 |-- playtime_forever: long (nullable = true)
 |-- playtime_last_two_weeks: long (nullable = true)
 |-- playtime_at_review: long (nullable = true)
 |-- last_played: long (nullable = true)

다행히 별다른 문제는 보이지 않는다.

ALS 알고리즘으로 사용할 때 가장 핵심이 될 playtime_at_review 컬럼에 대해서 데이터 분포를 시각화 해봤다.

(위에 언급했듯이, .z.show 사용. pandas에서 df['playtime_at_review'].plot() 같은...)

플레이 타임이 이상하다.

이상치를 제거해보자.

4분위는 날려버렸다. ( 추후에 이상치 제거한 데이터와 안한 데이터로 모델 평가했을 때 차이가 컸다)

하지만 위와 같은 방식으로 제거하였어도 납득이 되지 않았다. 데이터분포를 확인했을 때 10000을 넘어간 경우는 거의 없으므로 1만시간 이하로 하였고, 게임을 리뷰를 남길 때 1시간 하고 남긴 사람과 100시간 한 사람이 남긴 리뷰는 리뷰의 신뢰도에 많은 영향을 주므로 100시간 이상한 리뷰만 필터링하였다.

data=data.filter((col("playtime_at_review") <= 10000) & (col("playtime_at_review") >= 100) )
z.show(data)

ALS 알고리즘 학습에 최종적으로 사용할 데이터는 아래와 같다.

13~15만개 정도에서 4만개 정도로 확 줄었고, playtime의 경우 100~ 10000시간으로 제한한게 확연히 보인다.

내가 좋아하는 게임인 '프로젝트 좀보이드'의 appid를 확인해보자.

steam에서 게임 상점페이지 들어가면 url에서 app/ 우측에 appid 가 보인다.

https://store.steampowered.com/app/108600/Project_Zomboid/

Project Zomboid on Steam

Project Zomboid is the ultimate in zombie survival. Alone or in MP: you loot, build, craft, fight, farm and fish in a struggle to survive. A hardcore RPG skillset, a vast map, massively customisable sandbox and a cute tutorial raccoon await the unwary. So

store.steampowered.com

appid는 108600 이다.

app_ids 데이터프레임에 실제 있는지 확인해보자.

존재한다. 그러면 전처리가 끝난 data 데이터프레임에서 리뷰데이터를 확인해보자

아무것도 보이지 않는다. 추후에 데이터를 재수집하는 과정이 필요할 것 같다.

일단은 나에 대한 리뷰 데이터를 추가한다. 본인의 steamid 는 스팀 앱에서 설정에 앱에서 url 보이도록 설정한 다음 계정 프로필 들어가보면 url에 steamid 확인이 가능하다.

appid 는 위에서 언급한 프로젝트좀보이드 게임 아이디를 적었고, 플레이시간은 400시간으로 늘려서 입력한 다음 1개의 row를 가지는 데이터프레임을 만들었고, 새로 만든 데이터프레임을 기존 data와 병합했다.

%spark_yarn.pyspark
from pyspark.sql.functions import lit, col

my_steamid=76561198212429999
my_appid=108600
my_playtime_at_review=400

new_row_df = spark.createDataFrame([(my_steamid, my_appid, my_playtime_at_review)], ["steamids", "appid", "playtime_at_review"])
data = data.union(new_row_df)
data.filter(col("steamids") == my_steamid).show()

다행히 잘 들어가 있다.

+-----------------+------+------------------+
|         steamids| appid|playtime_at_review|
+-----------------+------+------------------+
|76561198212429999|108600|               400|
+-----------------+------+------------------+

2023.01.25 - [Project] - [프로젝트] 고인물의 스팀 게임추천 #4 모델링

저작자표시 비영리 변경금지 (새창열림)

'Project' 카테고리의 다른 글

[프로젝트] 뉴진스 트윗 클러스터링 #2 프로그래밍 (0)	2023.01.25
[프로젝트] 뉴진스 트윗 클러스터링 #1 개요 (0)	2023.01.25
[프로젝트] 고인물의 스팀 게임추천 #4 모델링 (0)	2023.01.25
[프로젝트] 고인물의 스팀 게임추천 #2 데이터수집 (0)	2023.01.12
[프로젝트] 고인물의 스팀 게임추천 #1개요 (0)	2023.01.12