pandas2025-07-21
pandaspandas is an open-source library that makes working with tabular data easy
numpy: good at arrays/matrix operations (e.g. all float values)pandas: good at heterogeneous data (e.g. some float, some character, etc.)pandaspandas Data Structurespandas1We’ll always use the pd prefix for pandas. We’ll also import numpy with its standard prefix2
Seriespandas has a similar object to np.array – Series:
0    4
1    7
2   -5
3    3
dtype: int64
SeriesThe index allows us more descriptive access to values:
np.int64(-5)
We can still use boolean indexing without removing the index link:
d    6
b    7
c    3
dtype: int64
Note that the index is still there, and both array and index are accessible with Series methods:
<NumpyExtensionArray>
[np.int64(6), np.int64(7), np.int64(3)]
Length: 3, dtype: int64
Series from dictSince we’re arguing that Series is very similar to a dict, it’s logical that we can create one from a dict:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
SeriesWe can pass an index directly in the creation of a Series object:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
Why do we have a NaN (not a number)? Where is Utah’s entry?
NaNUse pd.isna function to identify these values:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
We often want to look at arrays that have null values removed:
Ohio      35000.0
Oregon    16000.0
Texas     71000.0
dtype: float64
They are ignored by default in pandas aggregation methods (the opposite of R)…
np.float64(122000.0)
DataFrameThe power of pandas is in the DataFrame object. There are many ways to construct a DataFrame but a common one is passing a dict:
| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 | 
| 1 | Ohio | 2001 | 1.7 | 
| 2 | Ohio | 2002 | 3.6 | 
| 3 | Nevada | 2001 | 2.4 | 
| 4 | Nevada | 2002 | 2.9 | 
| 5 | Nevada | 2003 | 3.2 | 
head and tail| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 | 
| 1 | Ohio | 2001 | 1.7 | 
numpy Array as DataFrameLook up pd.DataFrame’s documentation and create a \(1000 \times 26\) DataFrame where each column has a letter of the alphabet as its header, and its rows are 1000 Poisson random variables.
numpy Array as DataFrame| a | b | c | d | e | f | g | h | i | j | ... | q | r | s | t | u | v | w | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | ... | 1 | 0 | 4 | 1 | 1 | 1 | 0 | 2 | 1 | 0 | 
| 1 | 1 | 0 | 1 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 4 | 2 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 
| 2 | 1 | 0 | 2 | 0 | 1 | 2 | 0 | 3 | 0 | 0 | ... | 0 | 2 | 3 | 0 | 0 | 2 | 1 | 1 | 0 | 2 | 
3 rows × 26 columns
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
The latter is safer – the former will not work with variable names like “my var”
| state | year | |
|---|---|---|
| 0 | Ohio | 2000 | 
| 1 | Ohio | 2001 | 
| 2 | Ohio | 2002 | 
| 3 | Nevada | 2001 | 
| 4 | Nevada | 2002 | 
| 5 | Nevada | 2003 | 
Note that this is a DataFrame, while the previous objects were Series
DataFrame| state | |
|---|---|
| 0 | Ohio | 
| 1 | Ohio | 
| 2 | Ohio | 
| 3 | Nevada | 
| 4 | Nevada | 
| 5 | Nevada | 
| state | year | pop | debt | debt2 | |
|---|---|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 | 1.5 | 0 | 
| 1 | Ohio | 2001 | 1.7 | 1.5 | 1 | 
| 2 | Ohio | 2002 | 3.6 | 1.5 | 2 | 
| 3 | Nevada | 2001 | 2.4 | 1.5 | 3 | 
| 4 | Nevada | 2002 | 2.9 | 1.5 | 4 | 
| 5 | Nevada | 2003 | 3.2 | 1.5 | 5 | 
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
Suppose we want to set a new index on this DataFrame. What happens?
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
Why is a NaN induced?
reindex for Economic Data| price | quantity | |
|---|---|---|
| 1 | 0.703893 | 0.824737 | 
| 3 | 0.398582 | 0.100284 | 
| 5 | 0.423146 | 0.603552 | 
reindex BehaviorWhat should we do if we want one observation per day?
ffillHow should we interpolate day 2’s data?
pandas Indexing (numpy Notation)We can filter data via the same notation as in numpy:
| price | quantity | |
|---|---|---|
| 1 | 0.703893 | 0.824737 | 
| 2 | 0.703893 | 0.824737 | 
| 3 | 0.398582 | 0.100284 | 
| 4 | 0.398582 | 0.100284 | 
| 5 | 0.423146 | 0.603552 | 
pandas Indexing (.loc)The preferred method for subsetting data in pandas is using the .loc method1:
| price | quantity | |
|---|---|---|
| 1 | 0.703893 | 0.824737 | 
| 2 | 0.703893 | 0.824737 | 
| 3 | 0.398582 | 0.100284 | 
| 4 | 0.398582 | 0.100284 | 
| 5 | 0.423146 | 0.603552 | 
Note that we previously were using Boolean indexing – we can of course also index using the index itself
| price | quantity | |
|---|---|---|
| 1 | 0.703893 | 0.824737 | 
| 3 | 0.398582 | 0.100284 | 
| 5 | 0.423146 | 0.603552 | 
.loc1    0.824737
2    0.824737
3    0.100284
4    0.100284
5    0.603552
Name: quantity, dtype: float64
.loc| price | quantity | |
|---|---|---|
| 0 | 0.000000 | 0.000000 | 
| 1 | 0.703893 | 0.824737 | 
| 2 | 0.703893 | 0.824737 | 
| 3 | 0.398582 | 0.100284 | 
| 4 | 0.398582 | 0.100284 | 
| 5 | 0.423146 | 0.603552 | 
ufuncs with pandas| b | d | e | |
|---|---|---|---|
| Utah | -0.175078 | 0.083681 | -1.221911 | 
| Ohio | 1.166406 | 1.269834 | 0.864876 | 
| Texas | 0.083402 | -0.828317 | -0.793863 | 
| Oregon | -0.154846 | -0.468383 | -0.709768 | 
np.abs| b | d | e | |
|---|---|---|---|
| Utah | 0.175078 | 0.083681 | 1.221911 | 
| Ohio | 1.166406 | 1.269834 | 0.864876 | 
| Texas | 0.083402 | 0.828317 | 0.793863 | 
| Oregon | 0.154846 | 0.468383 | 0.709768 | 
applyb    1.341484
d    2.098151
e    2.086787
dtype: float64
apply along rowsUtah      1.305592
Ohio      0.404958
Texas     0.911719
Oregon    0.554922
dtype: float64
applyapply to take the logarithm of a column in your DataFrame. Use np.log
.map method for how you could do this if you wanted toUse and your \(1000 \times 26\) DataFrame and return the sample mean and variance for each column. Use apply and numpy for one and only numpy for the other. Is there a time difference?
(26,)
CPU times: user 939 μs, sys: 80 μs, total: 1.02 ms
Wall time: 967 μs
pandasd    0
a    1
b    2
c    3
dtype: int64
We also want to sort data by specific variables, that are not the index
d    0
a    1
b    2
c    3
dtype: int64
As we’ve seen, we can call aggregating functions on DataFrame objects to get some summary stats
| one | two | |
|---|---|---|
| a | 1.40 | NaN | 
| b | 7.10 | -4.5 | 
| c | NaN | NaN | 
| d | 0.75 | -1.3 | 
one    9.25
two   -5.80
dtype: float64
What are these methods doing with NaN values?
A fuller list of summary statistics methods for pandas is available here.
DataFrame.sum versus np.sumNote that np.sum will have different default behavior depending on the object that it’s called on:
np.int64(25866)
describeWe can also use the built-in describe method:
| one | two | |
|---|---|---|
| count | 3.000000 | 2.000000 | 
| mean | 3.083333 | -2.900000 | 
| std | 3.493685 | 2.262742 | 
| min | 0.750000 | -4.500000 | 
| 25% | 1.075000 | -3.700000 | 
| 50% | 1.400000 | -2.900000 | 
| 75% | 4.250000 | -2.100000 | 
| max | 7.100000 | -1.300000 | 
describe with Non-numeric Datacount     16
unique     3
top        a
freq       8
dtype: object
uniqueWe often want to know the unique (or distinct) values in a Series (equivalently, column in our DataFrame)
array(['c', 'a', 'd', 'b'], dtype=object)
isinWe also can check which elements of a Series object match certain values:
0    c
5    b
6    b
7    c
8    c
dtype: object
pandas is that it has methods for reading an assortment of stored data into a DataFramepd.read_csvReads data from a comma-separated value file (CSV). The data looks like this
| a | b | c | |
|---|---|---|---|
| 0 | 1 | 2 | europe | 
| 1 | 3 | 5 | asia | 
| 2 | -2 | 6 | north america | 
pd.read_csv Argumentssep: what the delineator in the file is (“,” and “|” are common)na_values: list of strings to coerce to NaN (e.g. “NULL”, “N/A”, “NA”)names: pass your own column names instead of inferring from the dataskiprows: tell pandas to not read in certain rows of the fileindex_col: tell pandas which column to use as the indexpd.read_excelReads in data from a specific Excel sheet (remember you can have multiple sheets in a workbook)
| FY 2014 | FY 2015 | |
|---|---|---|
| 0 | 80540.0 | 78570.00 | 
| 1 | 1813246.0 | 1813246.00 | 
| 2 | 39560.0 | 19995.65 | 
| 3 | 465407.0 | 460904.00 | 
| 4 | 223206.0 | 208056.00 | 
pd.read_excel Argumentssheet_name: if a workbook has multiple sheets, access the one that you wantnames: pass your own column names instead of inferring from the datathousands/decimal: tell pandas what the separator for thousands or decimals is if you want to parse a text field to numericindex_col: tell pandas which column to use as the indexpd.read_htmlpandas to read it directlyDataFrame objectsmatch argument to get the table you want| Course | Course Title (click for details) | SLN | Instructor | Meeting Time | |
|---|---|---|---|---|---|
| 0 | ECON 200 A | Introduction to Microeconomics | 14207 | Melissa Knox | TTh 10:00am - 11:20am | 
| 1 | ECON 200 B | Introduction to Microeconomics | 14218 | Yael Midnight | TTh 8:30am - 9:50am | 
| 2 | ECON 200 C | Introduction to Microeconomics | 14229 | NaN | TTh 4:30pm - 6:20pm | 
pd.read_html Argumentspd.read_csv and pd.read_excelattrs: a dictionary of attributes to identify a table in the webpage’s HTML
pandas.unique() method to get the unique values from a SeriesDataFrames| k1 | k2 | |
|---|---|---|
| 0 | one | 1 | 
| 1 | two | 1 | 
| 2 | one | 2 | 
| 3 | two | 3 | 
| 4 | one | 3 | 
| 5 | two | 4 | 
| 6 | two | 4 | 
If we want to bin data, it can be surprisingly tricky to code – luckily, pandas has a built-in method for us:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
Let’s return to our Econ department data
0     ECON 200 A
1     ECON 200 B
2     ECON 200 C
3     ECON 200 D
4     ECON 200 E
5     ECON 200 F
6     ECON 201 A
7     ECON 201 B
8     ECON 300 A
9     ECON 300 B
10    ECON 300 C
11    ECON 300 D
12    ECON 300 E
13    ECON 301 A
14    ECON 301 B
15    ECON 345 A
16    ECON 382 A
17    ECON 399 A
18    ECON 400 A
19    ECON 400 B
20    ECON 409 A
21    ECON 411 A
22    ECON 422 A
23    ECON 425 A
24    ECON 425 B
25    ECON 436 A
26    ECON 447 A
27    ECON 448 A
28    ECON 450 A
29    ECON 471 A
30    ECON 481 A
31    ECON 483 A
32    ECON 483 B
33    ECON 485 A
34    ECON 487 A
35    ECON 496 A
36    ECON 499 A
37    ECON 500 A
38    ECON 502 A
39    ECON 516 A
40    ECON 580 A
41    ECON 583 A
42    ECON 587 A
43    ECON 590 A
44    ECON 592 A
45    ECON 600 A
46    ECON 601 A
47    ECON 602 A
48    ECON 800 A
Name: Course, dtype: object
How could we classify courses as 200-level, 300-level, 400-level, and graduate level?
str.slice| Course | course_no | |
|---|---|---|
| 0 | ECON 200 A | 200 | 
| 1 | ECON 200 B | 200 | 
| 2 | ECON 200 C | 200 | 
| 3 | ECON 200 D | 200 | 
How can we check?
str.slicecourse_no
200    6
300    5
301    2
483    2
201    2
400    2
425    2
583    1
500    1
502    1
516    1
580    1
587    1
496    1
590    1
592    1
600    1
601    1
602    1
499    1
481    1
487    1
485    1
471    1
450    1
448    1
447    1
436    1
422    1
411    1
409    1
399    1
382    1
345    1
800    1
Name: count, dtype: int64
When would this approach fail?
np.where| Course | course_no | classification | |
|---|---|---|---|
| 6 | ECON 201 A | 201 | 200 level | 
| 3 | ECON 200 D | 200 | 200 level | 
pandas to grab the 3-digit substring within any larger string\d{3} (match three digits)pandasThese are extremely useful in working with string data in pandas
| Course | course_no | |
|---|---|---|
| 0 | ECON 200 A | 200 | 
| 1 | ECON 200 B | 200 | 
| 2 | ECON 200 C | 200 | 
| 3 | ECON 200 D | 200 | 
[A-Z][a-z][A-Za-z].*+\s or a literal space(?<=a).+(?=b)|Write a regular expression that will match each element in its entirety in these arrays:
np and pd prefixes)A useful engine for practice: regex101
Hint: check out what ? does.
0    285a
1    2a86
2     44b
3     abc
Name: 0, dtype: object
0      1 + 1 = 2
1    10 + 2 = 12
2        3+11=14
Name: 0, dtype: object
pandasWhen we want to link data in two DataFrames by keys, we will join or merge them
(  key  data1
 0   b      0
 1   b      1
 2   a      2
 3   c      3
 4   a      4
 5   a      5
 6   b      6,
   key  data2
 0   a      0
 1   b      1
 2   d      2)
pd.merge| key | data1 | data2 | |
|---|---|---|---|
| 0 | b | 0 | 1 | 
| 1 | b | 1 | 1 | 
| 2 | a | 2 | 0 | 
| 3 | a | 4 | 0 | 
| 4 | a | 5 | 0 | 
| 5 | b | 6 | 1 | 
How does it know how to join? Do we have all of our original observations?
A best practice is to explicitly state the columns you want to join on
| key | data1 | data2 | |
|---|---|---|---|
| 0 | b | 0 | 1 | 
| 1 | b | 1 | 1 | 
| 2 | a | 2 | 0 | 
| 3 | a | 4 | 0 | 
| 4 | a | 5 | 0 | 
| 5 | b | 6 | 1 | 
| key1 | data1 | key2 | data2 | |
|---|---|---|---|---|
| 0 | b | 0 | b | 1 | 
| 1 | b | 1 | b | 1 | 
| 2 | a | 2 | a | 0 | 
| 3 | a | 4 | a | 0 | 
| 4 | a | 5 | a | 0 | 
| 5 | b | 6 | b | 1 | 
| key1 | data1 | key2 | data2 | |
|---|---|---|---|---|
| 0 | a | 2 | a | 0 | 
| 1 | a | 4 | a | 0 | 
| 2 | a | 5 | a | 0 | 
| 3 | b | 0 | b | 1 | 
| 4 | b | 1 | b | 1 | 
| 5 | b | 6 | b | 1 | 
| 6 | c | 3 | NaN | <NA> | 
| 7 | NaN | <NA> | d | 2 | 
| key1 | data1 | key2 | data2 | |
|---|---|---|---|---|
| 0 | b | 0 | b | 1 | 
| 1 | b | 1 | b | 1 | 
| 2 | a | 2 | a | 0 | 
| 3 | c | 3 | NaN | <NA> | 
| 4 | a | 4 | a | 0 | 
| 5 | a | 5 | a | 0 | 
| 6 | b | 6 | b | 1 | 
| key1 | data1 | key2 | data2 | |
|---|---|---|---|---|
| 0 | a | 2 | a | 0 | 
| 1 | a | 4 | a | 0 | 
| 2 | a | 5 | a | 0 | 
| 3 | b | 0 | b | 1 | 
| 4 | b | 1 | b | 1 | 
| 5 | b | 6 | b | 1 | 
| 6 | NaN | <NA> | d | 2 | 
(  key  data1
 0   b      0
 1   b      1
 2   a      2
 3   c      3
 4   a      4
 5   b      5,
   key  data2
 0   a      0
 1   b      1
 2   a      2
 3   b      3
 4   d      4)
   key  data1  data2
0    b      0      1
1    b      0      3
2    b      1      1
3    b      1      3
4    a      2      0
5    a      2      2
6    c      3   <NA>
7    a      4      0
8    a      4      2
9    b      5      1
10   b      5      3
  key  data1  data2
0   b      0      1
1   b      0      3
2   b      1      1
3   b      1      3
4   a      2      0
5   a      2      2
6   a      4      0
7   a      4      2
8   b      5      1
9   b      5      3
joinTo merge on the index of a DataFrame, we can use join1
(   Ohio  Nevada
 a     1       2
 c     3       4
 e     5       6,
    Missouri  Alabama
 b         7        8
 c         9       10
 d        11       12
 e        13       14)
join   Ohio  Nevada  Missouri  Alabama
c     3       4         9       10
e     5       6        13       14
DataFrame that has every combination of the integers from 1-10 and their names as strings (i.e. “one”, “two”, “three”, etc) – you should have 100 combinations.(100, 2)
(30, 4)
     ints  chars  n_chars  exp
key                           
1       3    one        3   27
1       3    two        3   27
1       3  three        5  243
1       3   four        4   81
1       3   five        4   81
pd.read_csv(...) and then combine thempd.concata    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: Int64
Note that we passed a list to pd.concat
DataFrame of all the Economics courses UW offered in 2023pd.read_htmlpd.concatdef download_econ_courses(year:int, quarter: str) -> pd.DataFrame:
    """
    Download Economics Course Offerings for `year` and `quarter`
    """
    url = f'https://econ.washington.edu/courses/{year}/{quarter}/all'
    df = pd.read_html(url, match = 'ECON Courses')[0]
    df['quarter'] = quarter
    return df
quarters = ['winter', 'spring', 'summer', 'autumn']
data_2023 = pd.concat([download_econ_courses(2023, x) for x in quarters])
print(data_2023.head(2))       Course Course Title (click for details)    SLN     Instructor  \
0  ECON 200 A   Introduction to Microeconomics  13928   Melissa Knox   
1  ECON 200 C   Introduction to Microeconomics  13942  Mahtab Karimi   
          Meeting Time quarter  
0  TTh 8:30am - 9:50am  winter  
1  MW 8:30am - 10:20am  winter  
        Course Course Title (click for details)    SLN         Instructor  \
43  ECON 594 A                  Economic Growth  14241  Stephen Turnovsky   
44  ECON 600 A    Independent Study or Research  14242                NaN   
45  ECON 601 A                       Internship  14243                NaN   
46  ECON 602 A  Teaching Introductory Economics  14244      Yael Midnight   
47  ECON 800 A            Doctoral Dissertation  14245                NaN   
            Meeting Time quarter  
43  MW 10:30am - 11:50am  autumn  
44        to be arranged  autumn  
45        to be arranged  autumn  
46    W 9:30am - 11:20am  autumn  
47        to be arranged  autumn  
What if we wanted to know courses taught by a professor over the year (400 level)?
        Course            Course Title (click for details)    SLN  \
21  ECON 400 A                     Advanced Microeconomics  13971   
35  ECON 485 A  Game Theory with Applications to Economics  13986   
33  ECON 448 A                  Population and Development  13683   
      Instructor          Meeting Time quarter  
21  Melissa Knox   TTh 1:30pm - 3:20pm  winter  
35  Fahad Khalil  MW 10:30am - 12:20pm  winter  
33  Melissa Knox  TTh 8:30am - 10:20am  spring  
| quarter | spring | winter | 
|---|---|---|
| Instructor | ||
| Fahad Khalil | NaN | ECON 485 A | 
| Melissa Knox | ECON 448 A | ECON 400 A | 
Will this work for the full dataset?
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[99], line 1 ----> 1 data_2023.pivot( 2 index='Instructor', 3 columns='quarter', 4 values='Course' 5 ) File /opt/anaconda3/envs/website/lib/python3.13/site-packages/pandas/core/frame.py:9339, in DataFrame.pivot(self, columns, index, values) 9332 @Substitution("") 9333 @Appender(_shared_docs["pivot"]) 9334 def pivot( 9335 self, *, columns, index=lib.no_default, values=lib.no_default 9336 ) -> DataFrame: 9337 from pandas.core.reshape.pivot import pivot -> 9339 return pivot(self, index=index, columns=columns, values=values) File /opt/anaconda3/envs/website/lib/python3.13/site-packages/pandas/core/reshape/pivot.py:570, in pivot(data, columns, index, values) 566 indexed = data._constructor_sliced(data[values]._values, index=multiindex) 567 # error: Argument 1 to "unstack" of "DataFrame" has incompatible type "Union 568 # [List[Any], ExtensionArray, ndarray[Any, Any], Index, Series]"; expected 569 # "Hashable" --> 570 result = indexed.unstack(columns_listlike) # type: ignore[arg-type] 571 result.index.names = [ 572 name if name is not lib.no_default else None for name in result.index.names 573 ] 575 return result File /opt/anaconda3/envs/website/lib/python3.13/site-packages/pandas/core/series.py:4615, in Series.unstack(self, level, fill_value, sort) 4570 """ 4571 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame. 4572 (...) 4611 b 2 4 4612 """ 4613 from pandas.core.reshape.reshape import unstack -> 4615 return unstack(self, level, fill_value, sort) File /opt/anaconda3/envs/website/lib/python3.13/site-packages/pandas/core/reshape/reshape.py:517, in unstack(obj, level, fill_value, sort) 515 if is_1d_only_ea_dtype(obj.dtype): 516 return _unstack_extension_series(obj, level, fill_value, sort=sort) --> 517 unstacker = _Unstacker( 518 obj.index, level=level, constructor=obj._constructor_expanddim, sort=sort 519 ) 520 return unstacker.get_result( 521 obj._values, value_columns=None, fill_value=fill_value 522 ) File /opt/anaconda3/envs/website/lib/python3.13/site-packages/pandas/core/reshape/reshape.py:154, in _Unstacker.__init__(self, index, level, constructor, sort) 146 if num_cells > np.iinfo(np.int32).max: 147 warnings.warn( 148 f"The following operation may generate {num_cells} cells " 149 f"in the resulting pandas object.", 150 PerformanceWarning, 151 stacklevel=find_stack_level(), 152 ) --> 154 self._make_selectors() File /opt/anaconda3/envs/website/lib/python3.13/site-packages/pandas/core/reshape/reshape.py:210, in _Unstacker._make_selectors(self) 207 mask.put(selector, True) 209 if mask.sum() < len(self.index): --> 210 raise ValueError("Index contains duplicate entries, cannot reshape") 212 self.group_index = comp_index 213 self.mask = mask ValueError: Index contains duplicate entries, cannot reshape
        Instructor quarter      Course
33   Alan Griffith  autumn  ECON 496 A
41   Alan Griffith  autumn  ECON 587 A
23   Alan Griffith  winter  ECON 410 A
47   Alan Griffith  winter  ECON 590 A
43   Brian Greaney  spring  ECON 509 A
..             ...     ...         ...
53      Yanqin Fan  spring  ECON 599 A
23    Yu-chin Chen  autumn  ECON 426 A
36    Yu-chin Chen  autumn  ECON 502 A
46  Yuya Takahashi  spring  ECON 534 A
52  Yuya Takahashi  spring  ECON 596 B
[67 rows x 3 columns]
To unpivot (melt), we need the key variable to not be the index:
quarter    Instructor      spring      winter
0        Fahad Khalil         NaN  ECON 485 A
1        Melissa Knox  ECON 448 A  ECON 400 A
Then we can melt:
     Instructor quarter       value
0  Fahad Khalil  spring         NaN
1  Melissa Knox  spring  ECON 448 A
2  Fahad Khalil  winter  ECON 485 A
3  Melissa Knox  winter  ECON 400 A
pandasgroupbyThe first step is telling pandas how to split (which groups to use):
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1476d8590>
This is our dataset, but with instructions about how to group
We can just call some methods directly on the grouped data (this will be fastest)
            Course Title (click for details)  SLN  Instructor  Meeting Time  \
Course                                                                        
ECON 200 A                                 4    4           4             4   
ECON 200 B                                 3    3           3             3   
ECON 200 C                                 2    2           2             2   
ECON 200 D                                 2    2           2             2   
ECON 200 E                                 2    2           2             2   
ECON 200 F                                 2    2           2             2   
ECON 201 A                                 4    4           4             4   
ECON 201 B                                 4    4           4             4   
ECON 201 C                                 2    2           2             2   
ECON 201 D                                 2    2           2             2   
            quarter  
Course               
ECON 200 A        4  
ECON 200 B        3  
ECON 200 C        2  
ECON 200 D        2  
ECON 200 E        2  
ECON 200 F        2  
ECON 201 A        4  
ECON 201 B        4  
ECON 201 C        2  
ECON 201 D        2  
Course
ECON 200 A    4
ECON 200 B    3
ECON 200 C    2
ECON 200 D    2
ECON 200 E    2
Name: Instructor, dtype: int64
                           0
time                        
2024-04-05 13:30:00  169.540
2024-04-05 13:31:00  169.505
2024-04-05 13:32:00  169.300
2024-04-05 13:33:00  169.132
2024-04-05 13:34:00  169.030
                 time    price        date
0 2024-04-05 13:30:00  169.540  2024-04-05
1 2024-04-05 13:31:00  169.505  2024-04-05
2 2024-04-05 13:32:00  169.300  2024-04-05
3 2024-04-05 13:33:00  169.132  2024-04-05
4 2024-04-05 13:34:00  169.030  2024-04-05
              open    high     low   close
date                                      
2024-04-05  169.54  170.38  169.03  169.47
If we’re using agg, we often don’t like the names it spits out – it’s easy to change them, we just need to pass a tuple:
| date | price | ||
|---|---|---|---|
| average_price | sd | ||
| 0 | 2024-04-05 | 169.681324 | 0.281085 | 
Note that the columns here are now hierarchical
MultiIndex([( 'date',              ''),
            ('price', 'average_price'),
            ('price',            'sd')],
           )
If we want to get columns from a level, we can do so:
Index(['', 'average_price', 'sd'], dtype='object')
We can also just flatten the index:
Index([('date', ''), ('price', 'average_price'), ('price', 'sd')], dtype='object')
              ohlc                                mean     min     max
              open    high     low   close       price   price   price
date                                                                  
2024-04-05  169.54  170.38  169.03  169.47  169.681324  169.03  170.38
This gives you the flavor and use cases that I’ve most encountered, but if you want to get really in-depth (pandas can do some crazy stuff), check out McKinney’s chapter.