0

I have the foll. dataframe:

enter image description here

I want to add 2 new columns, region_1_code and region_2_code such that they are both 9 digit numbers. For region_1_code, it uniquely identifies each region_1 and is constructuted by first appending a 0 to the country_code is country_code is less than 100. Then the region_1's are alphabetically sorted and assigned a numeric code starting from 1 and followed by as many 0's as needed to reach a length of 6. Finally the country_code and the newly computed code are concatenated to get the region_1_code e.g. in this example, the region_1_code for region B is 880100000.

Similarly, region_2_code for region D will br 880100100. The final dataframe should look like this:

enter image description here

user308827
  • 21,227
  • 87
  • 254
  • 417
  • Just curious if all your country_code is less than 100? –  Aug 25 '22 at 03:32
  • 1
    Please look at [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors-when-asking-a-question) and [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – ThePyGuy Aug 25 '22 at 03:33
  • @Kevin, it is not, but it is defn less than 999 – user308827 Aug 25 '22 at 03:35

2 Answers2

1

Using groupby.ngroup:

df['region_1_id'] = (df.groupby(['country', 'region_1'])
                     .ngroup().add(1)
                     .add(df['country_code'].mul(100))
                     .mul(100000)
                     )

df['region_2_id'] = (df['region_1_id']
                     .add(df.groupby(['country', 'region_2'])
                          .ngroup().add(1)
                          .mul(100))
                     )

Output:

  country region_1 region_2  country_code  region_1_id  region_2_id
0       A        B        D            88    880100000    880100100
1       A        B        E            88    880100000    880100200
2       A        B        F            88    880100000    880100300
3       A        C        G            88    880200000    880200400
4       A        C        H            88    880200000    880200500
5       A        C        I            88    880200000    880200600

maximizing the information

If you want to handle up to 999 of each country_code/region_1/region_2, then use 3 digits for each one:

df['region_1_id'] = (df.groupby(['country', 'region_1'])
                     .ngroup().add(1)
                     .add(df['country_code'].mul(1000))
                     .mul(1000)
                     )

df['region_2_id'] = (df['region_1_id']
                     .add(df.groupby(['country', 'region_2'])
                          .ngroup().add(1))
                     )

df['region_1_id'] = df['region_1_id'].astype(str).str.zfill(9)
df['region_2_id'] = df['region_2_id'].astype(str).str.zfill(9)

output:

  country region_1 region_2  country_code region_1_id region_2_id
0       A        B        D            88   088001000   088001001
1       A        B        E            88   088001000   088001002
2       A        B        F            88   088001000   088001003
3       A        C        G            88   088002000   088002004
4       A        C        H            88   088002000   088002005
5       A        C        I            88   088002000   088002006
mozway
  • 194,879
  • 13
  • 39
  • 75
  • thanks @mozway, would this solution work beyond 10 region_1's or 10 region_2's since it seems like you are multiplying by 100? I want that the total length of output not exceed 9 elements. – user308827 Aug 25 '22 at 11:18
  • this should handle 99/999 (100/1000 if you start from 0) regions of id1:id2 respectively. The max value would be `99-99-999-00` (dashes for clarity), meaning 99 country_code, 99 id1, 999 id2. I just followed your guidelines, but here 2 digits are wasted. – mozway Aug 25 '22 at 11:26
  • your region_1_id for region B are all different. They should all be the same for the same region_1. – user308827 Aug 26 '22 at 02:24
0

This will give you the answer your looking for dynamically

df = pd.DataFrame({
    'Country' : ['A', 'A', 'A', 'A', 'A', 'A'],
    'Region_1' : ['B', 'B', 'B', 'C', 'C', 'C'],
    'Region_2' : ['D', 'E', 'F', 'G', 'H', 'I'],
    'Country_CD' : [88, 88, 88, 88, 88, 88]
})

df['Region_1_ID'] = df['Country_CD'].astype(str) + '0'
df['Region_1_ID_Holder'] = df.groupby(['Country'])['Region_1'].transform(lambda x: x.factorize()[0] + 1)
df['Region_1_ID_Holder'] = df['Region_1_ID_Holder'] * 100000
df['Region_1_ID'] = df['Region_1_ID'] + df['Region_1_ID_Holder'].astype(str)

df['Region_2_ID'] = df['Country_CD'].astype(str) + '0'
df['Region_2_ID_Holder'] = df.groupby(['Country'])['Country'].transform(lambda x: x.factorize()[0] + 1)
df['Region_2_ID_Holder'] = df['Region_2_ID_Holder'] * 100
df['Region_2_ID_Count_Holder'] = df.groupby(['Country'])['Region_2'].cumcount() + 1
df['Region_2_ID_Count_Holder'] = df['Region_2_ID_Count_Holder'] * 100
df['Region_2_ID'] = df['Region_2_ID'].astype(str) + df['Region_2_ID_Holder'].astype(str) + df['Region_2_ID_Count_Holder'].astype(str)
df = df.drop(columns = ['Region_1_ID_Holder', 'Region_2_ID_Holder', 'Region_2_ID_Count_Holder'])
df

However, this does not take into account if you have a Country_CD > 100, you could simply make this into a function and exclude those that are > 100

ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17