1. 데이터 로드¶

2. 전처리¶

3. 교차검정 & 평균차이검정, 상관관계 분석¶

또오겠습니다

통계분석_성별/혼인/피부/가격/재사용 등

1. 데이터 로드¶

2. 전처리¶

3. 교차검정 & 평균차이검정, 상관관계 분석¶

통계분석_성별/혼인/피부/가격/재사용 등

1. 데이터 로드¶

2. 전처리¶

3. 교차검정 & 평균차이검정, 상관관계 분석¶

1 ) 성별¶

2 ) 결혼여부¶

3 ) 왜도와 첨도 분석해보기¶

4 ) Outlier 의 탐지 및 제거와 전후 분포 비교¶

1 ) '구매성향'을 기준으로 '피부타입'의 빈도를 분석해보기¶

2 ) 두 집단간의 평균차이를 검정해보기¶

3 ) 두 집단간의 평균차이를 검정해보기¶

1 ) 성별¶

2 ) 결혼여부¶

3 ) 왜도와 첨도 분석해보기¶

4 ) Outlier 의 탐지 및 제거와 전후 분포 비교¶

1 ) '구매성향'을 기준으로 '피부타입'의 빈도를 분석해보기¶

2 ) 두 집단간의 평균차이를 검정해보기¶

3 ) 두 집단간의 평균차이를 검정해보기¶

티스토리툴바

개인정보

단축키

1 ) 성별¶

2 ) 결혼여부¶

3 ) 왜도와 첨도 분석해보기¶

4 ) Outlier 의 탐지 및 제거와 전후 분포 비교¶

1 ) '구매성향'을 기준으로 '피부타입'의 빈도를 분석해보기¶

2 ) 두 집단간의 평균차이를 검정해보기¶

3 ) 두 집단간의 평균차이를 검정해보기¶

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

내 블로그

블로그 게시글

모든 영역

'Programming > Python' 카테고리의 다른 글

'Programming > Python' 카테고리의 다른 글

import tensorflow as tf

C:\python\envs\cpu_env\lib\site-packages\tensorflow\python\framework\dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\python\envs\cpu_env\lib\site-packages\tensorflow\python\framework\dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\python\envs\cpu_env\lib\site-packages\tensorflow\python\framework\dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\python\envs\cpu_env\lib\site-packages\tensorflow\python\framework\dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\python\envs\cpu_env\lib\site-packages\tensorflow\python\framework\dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\python\envs\cpu_env\lib\site-packages\tensorflow\python\framework\dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))

import pandas as pd
import seaborn as sns
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt

%matplotlib inline

df = pd.read_csv('cosmetics_.csv', encoding='utf-8')
df.head()

# 성별 전처리
from collections import Counter
Counter(df['gender'])

Counter({1: 132, 2: 115})

# 성별 count 및 pieplot으로 시각화
from collections import Counter

group = Counter(df['gender']).values()
group_names = ['man','woman']
group_colors = ['yellowgreen', 'lightcoral']

plt.pie(group,
       labels = group_names,
       colors = group_colors,
        autopct='%1.2f%%',)

([<matplotlib.patches.Wedge at 0x9c181b6c88>,
  <matplotlib.patches.Wedge at 0x9c181cb390>],
 [Text(-0.1186911936408577, 1.0935777981250847, 'man'),
  Text(0.11869129602900033, -1.0935777870124084, 'woman')],
 [Text(-0.06474065107683147, 0.5964969807955007, '53.44%'),
  Text(0.06474070692490927, -0.5964969747340408, '46.56%')])

# 결혼여부 전처리
Counter(df['marriage'])

Counter({1: 71, 2: 176})

# 결혼여부 count 및 pieplot으로 시각화
group = Counter(df['marriage']).values()
group_names = ['single','married']
group_colors = ['yellowgreen', 'lightcoral']

plt.pie(group,
       labels = group_names,
       colors = group_colors,
        autopct='%1.2f%%',)

([<matplotlib.patches.Wedge at 0x9c18239710>,
  <matplotlib.patches.Wedge at 0x9c18239dd8>],
 [Text(0.6811406849364731, 0.863740335589506, 'single'),
  Text(-0.6811407658056765, -0.8637402718165088, 'married')],
 [Text(0.37153128269262164, 0.4711310921397305, '28.74%'),
  Text(-0.37153132680309625, -0.4711310573544593, '71.26%')])

# 왜도
from scipy.stats import skew
df['amount'].skew()

8.727245406515182

# 첨도
from scipy.stats import kurtosis
df['amount'].kurtosis()

94.95150601199587

# amount 칼럼의 박스플롯 시각화를 통해 이상치 분포 확인
plt.boxplot(df['amount'])

{'whiskers': [<matplotlib.lines.Line2D at 0x9c182872e8>,
  <matplotlib.lines.Line2D at 0x9c18287630>],
 'caps': [<matplotlib.lines.Line2D at 0x9c18287978>,
  <matplotlib.lines.Line2D at 0x9c18287cc0>],
 'boxes': [<matplotlib.lines.Line2D at 0x9c1827ae80>],
 'medians': [<matplotlib.lines.Line2D at 0x9c18296048>],
 'fliers': [<matplotlib.lines.Line2D at 0x9c18296390>],
 'means': []}

# 칼럼의 데이터 요약정보 확인
df['amount'].describe()

count    2.470000e+02
mean     1.539393e+05
std      3.980750e+05
min      3.000000e+03
25%      3.000000e+04
50%      5.200000e+04
75%      1.000000e+05
max      5.000000e+06
Name: amount, dtype: float64

# 사분위수 값 확인
df['amount'].quantile()

52000.0

# 중위값 확인
df['amount'].median()

52000.0

# Inter-Quantile == 바닥부터 75% 지점의 값 - 바닥부터 25% 지점의 값
Q1 = df['amount'].quantile(q = 0.25)
Q3 = df['amount'].quantile(q = 0.75)

IQR = Q3 - Q1
IQR

70000.0

# 상한치 : 바닥부터 75% 지점의 값 + IQR의 1.5배 
# 하한치 : 바닥부터 25% 지점의 값 - IQR의 1.5배 
# 그 기준을 넘기면 Outlier로 판단이 가능 

upper = Q3 + IQR * 1.5
lower = Q1 - IQR * 1.5

mask_upper = df['amount'] > upper
mask_lower = df['amount'] < lower

new_df = df[~(mask_upper | mask_lower)]

# 변경된 데이터에 대한 박스플롯
plt.boxplot(new_df['amount'])

{'whiskers': [<matplotlib.lines.Line2D at 0x9c1844f4e0>,
  <matplotlib.lines.Line2D at 0x9c1844f828>],
 'caps': [<matplotlib.lines.Line2D at 0x9c1844fb70>,
  <matplotlib.lines.Line2D at 0x9c1844feb8>],
 'boxes': [<matplotlib.lines.Line2D at 0x9c1844f1d0>],
 'medians': [<matplotlib.lines.Line2D at 0x9c18459240>],
 'fliers': [<matplotlib.lines.Line2D at 0x9c18459588>],
 'means': []}

propensity : 구매 성향 (비교적 저렴한 제품, 중간 정도의 제품, 비교적 고가의 제품)
skin : 피부 타입 (건성, 민감성, 중성, 지성/여드름성, 복합성)

# 구매성향 - 피부타입 간의 빈도 분석
result = pd.crosstab(new_df.propensity, new_df.skin, margins=True)
result.rename(columns = { 1 : '건성', 2 : '민감성', 3 : '중성', 4 : '지성', 5 : '여드름성', 'All' : '합계'},
             index = {1 : '비교적 저렴한 제품', 2 : '중간정도의 제품', 3 : '비교적고가의 제품', 'All' : '합계'})

# 구매성향 - 피부타입 간의 P-value값 확인
sp.stats.chisquare(new_df.propensity, f_exp=new_df.skin)

Power_divergenceResult(statistic=265.81666666666666, pvalue=0.011742407950613324)

sp.stats.power_divergence(new_df.propensity, f_exp=new_df.skin)

Power_divergenceResult(statistic=265.81666666666666, pvalue=0.011742407950613324)

new_df.propensity = new_df.propensity.replace(1,'low cost').replace(2,'middle cost').replace(3, 'high cost')
new_df.skin = new_df.skin.replace(1, 'dry skin').replace(2, 'sensitive skin').replace(3, 'neutral skin').replace(4, 'oily skin').replace(5, 'complex skin')

C:\python\envs\cpu_env\lib\site-packages\pandas\core\generic.py:5208: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

result = pd.crosstab(new_df.propensity, new_df.skin, margins=True)
result.rename(columns = { 1 : '건성', 2 : '민감성', 3 : '중성', 4 : '지성', 5 : '여드름성', 'All' : '합계'},
             index = {1 : '비교적 저렴한 제품', 2 : '중간정도의 제품', 3 : '비교적고가의 제품', 'All' : '합계'})
result.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x9c1841a9e8>

result.plot(kind='bar', stacked='True')

<matplotlib.axes._subplots.AxesSubplot at 0x9c1a493cf8>

독립표본 t-test 분석과 시각화

df = pd.read_csv('cosmetics_.csv', encoding='utf-8')
df.head()

# 전반적인 만족도(satisf_al) 의 성별(표본집단 끼리) 비교
tmp = df.loc[:,['satisf_al', 'gender']]
#tmp.gender = tmp.gender.replace(1,'man').replace(2,'woman')
tmp.groupby('gender').describe()

man = tmp.loc[(tmp.gender == 1) == True, 'satisf_al'].values
woman = tmp.loc[(tmp.gender == 2) == True, 'satisf_al'].values

#perform paired t-test
tTestResult= stats.ttest_ind(man, woman)
print("The T-statistic is %.3f and the p-value is %.3f" % tTestResult)

The T-statistic is -0.495 and the p-value is 0.621

# 전반적인 만족도(satisf_al) 의 성별(표본집단 끼리)집단 시각화
tmp.boxplot(column='satisf_al', by='gender')

<matplotlib.axes._subplots.AxesSubplot at 0x393e996b00>

# kde=False 로 해주어야 각각의 빈도 수에 따른 그래프를 가리고 평균만 나타낼 수 있음
# fit에 scipy.stats.norm을 지정하여 평균을 나타내는 그래프 선을 그릴 수 있음
# hist_kws & fit_kws 는 히스토그램과 fitting line의 서식을 지정하는 keywords


sns.distplot(man, kde=False, fit=stats.norm,
            hist_kws={'color': 'r', 'alpha': 0.2}, fit_kws={'color': 'r'})

sns.distplot(woman, kde=False, fit=stats.norm, 
             hist_kws={'color': 'g', 'alpha': 0.2}, fit_kws={'color': 'g'})

<matplotlib.axes._subplots.AxesSubplot at 0x393ea11908>

대응표본 t-test 분석과 시각화 : 동일한 모집단으로부터 추출된 두 변수의 평균값을 비교 분석

satisf_b : 구매 가격에 대한 만족도 (5점 척도)
satisf_i : 구매 문의에 대한 만족도 (5점 척도)

# 두 칼럼의 요약 정보 확인
x = df[['satisf_b', 'satisf_i']]
x.describe()

x

# t-test 분석
stats.ttest_rel(x["satisf_b"], x["satisf_i"])

Ttest_relResult(statistic=-7.155916401026872, pvalue=9.518854506666397e-12)

sns.distplot(x["satisf_b"], kde=False, fit=stats.norm,
            hist_kws={'color': 'r', 'alpha': 0.2}, fit_kws={'color': 'r'})

sns.distplot(x["satisf_i"], kde=False, fit=stats.norm, 
             hist_kws={'color': 'g', 'alpha': 0.2}, fit_kws={'color': 'g'})

<matplotlib.axes._subplots.AxesSubplot at 0x3947248048>

	gender	marriage	edu	job	mincome	aware	count	amount	decision	propensity	skin	promo	location	satisf_b	satisf_i	satisf_al	repurchase
0	1	1	4	1	2	2	1	11000	2	1	1	1	2	5	2	2	2
1	2	1	4	9	2	1	4	30000	1	1	3	2	3	2	3	3	4
2	2	2	4	4	3	1	6	100000	3	2	3	2	2	4	5	4	4
3	2	2	4	7	5	2	6	65000	3	2	5	2	3	3	4	4	4
4	1	2	6	6	5	2	2	50000	2	2	3	2	3	3	3	3	3

	gender	marriage	edu	job	mincome	aware	count	amount	decision	propensity	skin	promo	location	satisf_b	satisf_i	satisf_al	repurchase
0	1	1	4	1	2	2	1	11000	2	1	1	1	2	5	2	2	2
1	2	1	4	9	2	1	4	30000	1	1	3	2	3	2	3	3	4
2	2	2	4	4	3	1	6	100000	3	2	3	2	2	4	5	4	4
3	2	2	4	7	5	2	6	65000	3	2	5	2	3	3	4	4	4
4	1	2	6	6	5	2	2	50000	2	2	3	2	3	3	3	3	3

	satisf_b	satisf_i
0	5	2
1	2	3
2	4	5
3	3	4
4	3	3
...	...	...
242	2	1
243	3	4
244	2	5
245	4	3
246	2	3

f-string (0)	2021.05.31
[ 데이터 분석 실무 with python ] 1. 사드 배치의 영향으로 중국인 관광객이 얼마나 줄었을까? (0)	2020.05.23
[ 모두의 데이터 분석 with python ] 3.지하철 데이터 (0)	2020.05.08
[ 모두의 데이터 분석 with python ] 2. 인구 데이터 (0)	2020.05.05
[ 모두의 데이터 분석 with python ] 1. 기온 데이터 (0)	2020.05.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

	satisf_al
	count	mean	std	min	25%	50%	75%	max
gender
1	132.0	3.439394	0.712565	1.0	3.0	3.0	4.0	5.0
2	115.0	3.486957	0.798740	1.0	3.0	4.0	4.0	5.0

	satisf_b	satisf_i
count	247.000000	247.000000
mean	2.890688	3.404858
std	0.780995	0.830110
min	1.000000	1.000000
25%	2.000000	3.000000
50%	3.000000	3.000000
75%	3.000000	4.000000
max	5.000000	5.000000

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

propensity

비교적 저렴한 제품