Python - Pandas 기초 (1)

June 30, 2021 2 분 소요

pandas - 구조적 데이터 처리를 위한 라이브러리

import numpy as np
import pandas as pd

1. 시리즈 생성

s = pd.Series([1, 3, 5, 6, 8])  #자동으로 인덱스를 할당한다
s

  1
  3
  5
  6
  8
dtype: int64

s.index
#RangeIndex(start=0, stop=5, step=1) - '0에서부터 5까지 스텝 1씩 증가하는 인덱스다'

RangeIndex(start=0, stop=5, step=1)

s.values

array([1, 3, 5, 6, 8], dtype=int64)

s2 = pd.Series([1,2,3,'a','b','c'])  #타입이 다른 요소들을 저장. s2의 dtype은 object가 된다.
s2

  1
  2
  3
  a
  b
  c
dtype: object

#딕셔너리로 시리즈 생성시 키가 인덱스로 사용됨
s3 = pd.Series({'name':'aaa', 'tel':'111', 'addr':'asdfasd'})
s3

name        aaa
tel         111
addr    asdfasd
dtype: object

s4 = pd.Series({'kor':65, 'eng':78, 'math':89})
s4

kor     65
eng     78
math    89
dtype: int64

s4['math']

idx = []  #사용할 인덱스
vals = [] #시리즈로 생성할 값

#학생 이름 리스트
names = ['aaa', 'bbb', 'ccc', 'ddd', 'eee']

for i in range(0, 5): #인덱스와 성적 자동 생성
    s = 'student'+str(i+1)  #인덱스로 사용할 문자열 생성
    idx.append(s)  #생성한 인덱스를 idx에 저장
    
    #한사람의 이름과 성적을 한 리스트에 담음
    val = [names[i], np.random.randint(0, 100, (3))]
    vals.append(val)
    
stu = pd.Series(vals, index = idx)  #인덱스와 리스트를 이요하여 시리즈 생성
stu

student1    [aaa, [92, 58, 85]]
student2    [bbb, [88, 72, 73]]
student3    [ccc, [80, 28, 28]]
student4    [ddd, [94, 52, 94]]
student5     [eee, [64, 44, 3]]
dtype: object

2. DataFrame 생성

*df = pd.DataFrame(data, [,index, columns])

data = [[1,2,3],[4,5,6],[7,8,9]]
d1 = pd.DataFrame(data)
d1

	0	1	2
0	1	2	3
1	4	5	6
2	7	8	9

my_index = ['row1', 'row2', 'row3']
my_col = ['col1', 'col2', 'col3']
d2 = pd.DataFrame(data, index=my_index, columns=my_col)
d2

	col1	col2	col3
row1	1	2	3
row2	4	5	6
row3	7	8	9

d2.index

Index(['row1', 'row2', 'row3'], dtype='object')

d2.columns

Index(['col1', 'col2', 'col3'], dtype='object')

d2.values

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]], dtype=int64)

3. 데이터 연산

a=pd.Series([1,2,3,4])
b=pd.Series([5,6,7,8])
a+b

   6
   8
  10
  12
dtype: int64

a-b

 -4
 -4
 -4
 -4
dtype: int64

a*b

   5
  12
  21
  32
dtype: int64

b/a

  5.000000
  3.000000
  2.333333
  2.000000
dtype: float64

a=pd.Series([1,2,3])
b=pd.Series([5,6,7,8])  #값이 매칭되지 않는경우에는 NaN이 된다
a+b

   6.0
   8.0
  10.0
   NaN
dtype: float64

d1 = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
d1

	A	B	C
0	1	4	7
1	2	5	8
2	3	6	9

d2 = pd.DataFrame({'A':[11,22], 'B':[33,44], 'C':[55,66]})
d2

	A	B	C
0	11	33	55
1	22	44	66

d1+d2

	A	B	C
0	12.0	37.0	62.0
1	24.0	49.0	74.0
2	NaN	NaN	NaN

names = ['aaa', 'bbb', 'ccc']
d = {'국어':[54,65,76], '영어':[67,56,45], '수학':[98,78,76], '사회':[98,76,45], '과학':[89,97,56]}
d3 = pd.DataFrame(d, index=names)
d3

	국어	영어	수학	사회	과학
aaa	54	67	98	98	89
bbb	65	56	78	76	97
ccc	76	45	76	45	56

4. 통계함수

sum():합
mean(): 평균
std(): 표준 편차
var(): 분산
min(): 최소값
max(): 최대값
cumsum(): 누적합
cumprod(): 누적곱

d3.sum()

국어    195
영어    168
수학    252
사회    219
과학    242
dtype: int64

d3.mean()  #컬럼별 평균

국어    65.000000
영어    56.000000
수학    84.000000
사회    73.000000
과학    80.666667
dtype: float64

d3.mean(axis=1) #행별로 평균

aaa    81.2
bbb    74.4
ccc    59.6
dtype: float64

d3.describe()

	국어	영어	수학	사회	과학
count	3.0	3.0	3.000000	3.000000	3.000000
mean	65.0	56.0	84.000000	73.000000	80.666667
std	11.0	11.0	12.165525	26.627054	21.733231
min	54.0	45.0	76.000000	45.000000	56.000000
25%	59.5	50.5	77.000000	60.500000	72.500000
50%	65.0	56.0	78.000000	76.000000	89.000000
75%	70.5	61.5	88.000000	87.000000	93.000000
max	76.0	67.0	98.000000	98.000000	97.000000

Changmin Lucas Lee

Python - Pandas 기초 (1)

pandas - 구조적 데이터 처리를 위한 라이브러리

1. 시리즈 생성

2. DataFrame 생성

3. 데이터 연산

4. 통계함수

댓글남기기

참고

블로그 이전

첫번째 이직

Python - 타입 힌트 (Type Hint)

TDD - Test Driven Development