0

I am trying to assign each game in the NFL a value for the week in which they occur. Example for the 2008 season all the games that occur in the range between the 4th and 10th of September occur in week 1

i = 0
week = 1
start_date = df2008['date'].iloc[0]
end_date = df2008['date'].iloc[-1]
week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')
for row in df2008['date']:
   row = row.date()
   if row in week_range[i]:
        df2008['week'] = week
   else:
       week += 1

However, this is updating all of the games to week 1

           date  week
1601 2008-09-04     1
1602 2008-09-07     1
1603 2008-09-07     1
1604 2008-09-07     1
1605 2008-09-07     1
...         ...   ...
1863 2009-01-11     1
1864 2009-01-11     1
1865 2009-01-18     1
1866 2009-01-18     1
1867 2009-02-01     1

I have tried using print statements to debug and these are my results. "In Range" are games that occur in week 1 and are returning as expected.

In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range

df_sample:

    display(df2008[['date', 'home', 'away', 'week']])

    date    home    away    week
1601    2008-09-04  Giants  Redskins    1
1602    2008-09-07  Falcons Lions   1
1603    2008-09-07  Bills   Seahawks    1
1604    2008-09-07  Titans  Jaguars 1
1605    2008-09-07  Dolphins    Jets    1
... ... ... ... ...
1863    2009-01-11  Giants  Eagles  1
1864    2009-01-11  Steelers    Chargers    1
1865    2009-01-18  Cardinals   Eagles  1
1866    2009-01-18  Steelers    Ravens  1
1867    2009-02-01  Cardinals   Steelers    1

Can anyone point out where I am going wrong?

  • 1
    In your loop, `row = row.date()` I don't think that line in necessary. What does your df look like? – jpf5046 Dec 06 '19 at 18:36

2 Answers2

0

Consider avoiding any looping and use pandas.Series.dt.week on datetime fields which returns week in year. Then, subtract from the first week. However, a wrinkle occurs when considering the new year so must handle conditionally by adding difference of end of year and then weeks of new year. Fortunately, weeks start on Monday (so Thursday - Sunday maintain same week number).

first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values

# FIND LAST SUNDAY OF YEAR (NOT NECESSARILY DEC 31)
end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values   

new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values

# CONDITIONALLY ASSIGN    
df2008['week'] = np.where(df2008['date'] < '2009-01-01', 
                          (df2008['date'].dt.week - first_week) + 1,
                          ((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
                          )

To demonstrate with random seeded data (including new year dates). Will replace for OP's reproducible sample.

Data

import numpy as np
import pandas as pd

### DATA BUILD
np.random.seed(120619)
df2008 = pd.DataFrame({'group': np.random.choice(['sas', 'stata', 'spss', 'python', 'r', 'julia'], 500),
                       'int': np.random.randint(1, 10, 500),
                       'num': np.random.randn(500),
                       'char': [''.join(np.random.choice(list('ABC123'), 3)) for _ in range(500)],
                       'bool': np.random.choice([True, False], 500),
                       'date': np.random.choice(pd.date_range('2008-09-04', '2009-01-06'), 500)
                      })

Calculation

first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values

end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values

new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values

df2008['week'] = np.where(df2008['date'] < '2008-12-28', 
                          (df2008['date'].dt.week - first_week) + 1,
                          ((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
                          )

df2008 = df2008.sort_values('date').reset_index(drop=True)

print(df2008.head(10))
#     group  int       num char   bool       date  week
# 0     sas    2  0.099927  A2C  False 2008-09-04     1
# 1  python    3  0.241393  2CB  False 2008-09-04     1
# 2  python    8  0.516716  ABC  False 2008-09-04     1
# 3    spss    2  0.974715  3CB  False 2008-09-04     1
# 4   stata    9 -1.582096  CAA   True 2008-09-04     1
# 5     sas    3  0.070347  1BB  False 2008-09-04     1
# 6       r    5 -0.419936  1CA   True 2008-09-05     1
# 7  python    6  0.628749  1AB   True 2008-09-05     1
# 8  python    3  0.713695  CA1  False 2008-09-05     1
# 9  python    1 -0.686137  3AA  False 2008-09-05     1

print(df2008.tail(10))    
#       group  int       num char   bool       date  week
# 490    spss    5 -0.548257  3CC   True 2009-01-04    17
# 491   julia    8 -0.176858  AA2  False 2009-01-05    18
# 492   julia    5 -1.422237  A1B   True 2009-01-05    18
# 493   stata    2 -1.710138  BB2   True 2009-01-05    18
# 494  python    4 -0.285249  1B1   True 2009-01-05    18
# 495    spss    3  0.918428  C23   True 2009-01-06    18
# 496       r    5 -1.347936  1AC  False 2009-01-06    18
# 497   stata    3  0.883093  1C3  False 2009-01-06    18
# 498  python    9  0.448237  12A   True 2009-01-06    18
# 499    spss    3  1.459097  2A1  False 2009-01-06    18
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

OP's original question was: "Can anyone point out where I am going wrong?", so - though as Parfait pointed out using pandas.Series.dt.week is a fine pandas solution - to help him to find answer to it, I followed OP's original code logic, with some fixing:

import pandas as pd

i = 0
week = 1

df2008 = pd.DataFrame({"date": [pd.Timestamp("2008-09-04"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-02-01")],
"home": ["Giants", "Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals"],
"away": ["Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals", "Ravens"]
})

i = 0
week = 1
start_date = df2008['date'].iloc[0]
#end_date = df2008['date'].iloc[-1]
end_date = pd.Timestamp("2009-03-01")

week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')

df2008['week'] = None
for i in range(len(df2008['date'])):
    rd = df2008.loc[i, 'date'].date()

    while True:
        if week == len(week_range):
            break
        if rd in week_range[week - 1]:
            df2008.loc[i, 'week'] = week
            break
        else:
            week += 1

print(df2008)

Out:

        date       home       away  week
0 2008-09-04     Giants     Falcon     1
1 2008-09-07     Falcon      Bills     1
2 2008-09-07      Bills     Titans     1
3 2008-09-07     Titans   Dolphins     1
4 2008-09-07   Dolphins     Giants     1
5 2009-01-11     Giants   Steelers    19
6 2009-01-11   Steelers  Cardinals    19
7 2009-01-18  Cardinals   Steelers    20
8 2009-01-18   Steelers  Cardinals    20
9 2009-02-01  Cardinals     Ravens    22
Geeocode
  • 5,705
  • 3
  • 20
  • 34