1

Pandas Filter by Date How can I filter a CSV with Dates

Example CSV

User    Dates       Hours   shift
User1   01.01.2012      5   aaa 
User1   02.01.2012      5   aaa
User1   03.01.2012      2   bbb
User1   04.01.2012      3   aaa
.....
User1   12.03.2012      1   aaa
User1   13.03.2012      8   ccc
.....
User2   04.02.2012      4   aaa
User2   05.02.2012      3   bbb

end so on

I can filter by a User with

use = users.loc["User1"]

also I can sum all Hours

print(use["Hours"].sum()

and I can count his shifts

counts = use.loc[ou['Shift'] == 'aaa', 'Hours'].value_counts()

But I dont know how I can filter by the Date and and the Statements above. Like count all shifts in March by User2 or Sum all Hours done in Feb by User1

More or less I manged to Filter the Table by Date and User with

use['Date'] = pd.to_datetime(use['Date'], infer_datetime_format=True, exact=True)
mask = (use['Datum'] > Start) & (use['Date'] <= End)
print(use.loc[mask])

But I cant figure out how to combine them. Desired Output

Overview March 2016
User1 made 3 aaa shifts
User1 worked 12h in March 2016

update: I made some progress

print(use[use['Date'] > '02.01.2012'],['hours'].sum()))

works fine but not exctly what I want. With:

print(use[use['Date'] > '02.01.2012'] & (use[use['Date'] < '02.05.2012'],['hours'].sum()))

I get

AttributeError: 'list' object has no attribute 'sum'
Kenny
  • 88
  • 3
  • 14

2 Answers2

1

I think you can use:

Start = '2012-01-01'
End = '2012-03-03'
use['Dates'] = pd.to_datetime(use['Dates'], dayfirst=True)
mask = (use['Dates'] > Start) & (use['Dates'] <= End) & (use['shift'] == 'aaa')
use1 = use.loc[mask]
print (use1)
    User      Dates  Hours shift
1  User1 2012-01-02      5   aaa
3  User1 2012-01-04      3   aaa
6  User2 2012-02-04      4   aaa

use1 = use.query('Dates > @Start and Dates <= @End and shift == "aaa"')
print (use1)
    User      Dates  Hours shift
1  User1 2012-01-02      5   aaa
3  User1 2012-01-04      3   aaa
6  User2 2012-02-04      4   aaa

print (mask.sum())
3

counts = use.loc[mask, 'Hours'].value_counts()
print (counts)
3    1
5    1
4    1
Name: Hours, dtype: int64

EDIT:

Start = '2012-01-01'
End = '2012-03-03'
use['Dates'] = pd.to_datetime(use['Dates'], dayfirst=True)
mask = (use['Dates'] > Start) & (use['Dates'] <= End)
use1 = use.loc[mask]
print (use1)
    User      Dates  Hours shift
1  User1 2012-01-02      5   aaa
2  User1 2012-01-03      2   bbb
3  User1 2012-01-04      3   aaa
6  User2 2012-02-04      4   aaa
7  User2 2012-02-05      3   bbb


counts = use1.groupby(['User','shift'])['Hours'].agg({'SUM':'sum', 'COUNT':'size'})
             .reset_index()
print (counts)
    User shift  SUM  COUNT
0  User1   aaa    8      2
1  User1   bbb    2      1
2  User2   aaa    4      1
3  User2   bbb    3      1

EDIT1:

if need more condition use loc:

print(use.loc[(use['Date'] > '02.01.2012') & (use['Date'] < '02.05.2012'),'hours'].sum())
0

All together:

use = pd.DataFrame({'Date': ['01.01.2012', '02.01.2012', '03.01.2012', '04.01.2012', '12.03.2012', '13.03.2012', '04.02.2012', '05.02.2012'], 'User': ['User1', 'User1', 'User1', 'User1', 'User1', 'User1', 'User2', 'User2'], 'hours': [5, 5, 2, 3, 1, 8, 4, 3], 'shift': ['aaa', 'aaa', 'bbb', 'aaa', 'aaa', 'ccc', 'aaa', 'bbb']})
print (use)

    User        Date  hours shift
0  User1  01.01.2012      5   aaa
1  User1  02.01.2012      5   aaa
2  User1  03.01.2012      2   bbb
3  User1  04.01.2012      3   aaa
4  User1  12.03.2012      1   aaa
5  User1  13.03.2012      8   ccc
6  User2  04.02.2012      4   aaa
7  User2  05.02.2012      3   bbb

Start = '2012-01-01'
End = '2012-01-30'
User = 'User1'
shift = 'aaa'

use['Date'] = pd.to_datetime(use['Date'], dayfirst=True)

#how many Hours by dates (sum)
print(use.loc[(use['Date'] > Start) & (use['Date'] < End),'hours'].sum())
10

#how many Hours by dates and user (sum)
print(use.loc[(use['Date'] > Start) & (use['Date'] < End) & 
              (use['User'] == User),'hours'].sum())
10

#how many Hours by dates and user (count)
print(((use['Date'] > Start) & (use['Date'] < End) & 
       (use['User'] == User)).sum())
3

#how many Hours by dates and user and shift (count)
print(((use['Date'] > Start) & (use['Date'] < End) & 
       (use['User'] == User ) & (use['shift'] == shift)).sum())
2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • How can I count in these range for example all shifts? mask = (use['Date'] > Start) & (use['Date'] <= End) & (use['shift'] == 'aaa').value_counts() so that i know how many aaa shifts are done by user1 – Kenny Apr 11 '17 at 11:40
  • I update answer, can you check it? What is desired output? – jezrael Apr 11 '17 at 11:44
  • I think you can filter by dates first and then output aggregate - [docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once). Is it what you want? – jezrael Apr 11 '17 at 12:02
  • I guess thats what I want.But anyway I try to make it a bit more clear. I have a large CSV and I want to filter this CSV for every user. I need some filter options. First by User and Date Range. For example I want to filter out how many hours User3 has done in Dec 2012. As output I only want the number for example 45. And then for example I want to know how many aaa shifts the user 1 has done in March 2012. Desired output if he made 3 aaa shifts in march then only '3'. I hope that explains it a bit better. – Kenny Apr 11 '17 at 12:07
  • I think you have 2 possible solutions - filtering with boolean indexing and then `sum` boolean mask. But I think better is use groupby with aggregation, get nice output table (df) with all combination which are in input data. Faster is boolean indexing – jezrael Apr 11 '17 at 12:17
  • 1
    Glad can help you and good luck! pandas is really good library. – jezrael Apr 12 '17 at 09:01
  • But not so easy sometime as a beginner. – Kenny Apr 12 '17 at 09:01
  • 1
    No problem, every of us starts. But later it will be only better and better. Good luck! – jezrael Apr 12 '17 at 09:03
  • I get a diferent output on that #how many Hours by dates and user and shift. I changed my code a bit cause I only want the shift count. ur = (use['Date'] > '01.01.2016') & (use['Date'] < "01.05.2016") & (use['shift'] == "aaa").sum(). I got false false false and so on. – Kenny Apr 12 '17 at 09:13
  • Do you have same data? Do youi convert dates to datetime `use['Date'] = pd.to_datetime(use['Date'], dayfirst=True)` ? – jezrael Apr 12 '17 at 09:14
  • SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy use['Date'] = pd.to_datetime(use['Date'], dayfirst=True) – Kenny Apr 12 '17 at 09:17
  • This works shifts = (ou['shift'] == 'aaa').sum() but if I want to combine this with the date it wont work – Kenny Apr 12 '17 at 09:21
  • Hmmm, it seems you need `copy` - check [this](http://stackoverflow.com/a/42439031/2901002) – jezrael Apr 12 '17 at 09:22
0

You'll need to narrow down your dataset before running the aggregation.

use[use['Dates'] == '01.01.2012']['hours'].sum()

The first part of that line is the filtering:

use[use['Dates'] == '01.01.2012']
Quitty
  • 144
  • 6
  • But I need a Date Range for example from 01.01.2012 to 06.01.2012 – Kenny Apr 11 '17 at 11:43
  • Same drill. `use[use['Dates'] >= '01.01.2012'][use['Dates'] <= '05.01.2012']` You may need to reformat your date, though - DataFrame standard is YYYY-MM-DD – Quitty Apr 11 '17 at 12:22
  • Not working ValueError: cannot reindex from a duplicate axis – Kenny Apr 11 '17 at 12:32