# Numpy group by multiple vectors, get group indices

By : chengchao
Date : November 22 2020, 04:01 AM
I wish this helpful for you After using np.stack on the arrays a and b, if you set the parameter return_inverse to True in np.unique then it is the output you are looking for:
code :
``````a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
_, inv = np.unique(np.stack([a,b]), axis=1, return_inverse=True)
print (inv)

array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
``````
``````def group_np_sum(groupcols):
groupcols_max = np.cumprod([ar.max()+1 for ar in groupcols[:-1]])
return np.unique( sum([groupcols[0]] +
[ ar*m for ar, m in zip(groupcols[1:],groupcols_max)]),
return_inverse=True)[1]
``````
``````a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print (group_np_sum([a,b]))
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
``````
``````a = np.array([3,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print(group_np2([a,b]))
print (group_np_sum([a,b]))
array([3, 1, 0, 0, 0, 2, 4, 0], dtype=int64)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
``````
``````a = np.random.randint(1, 100, 30000)
b = np.random.randint(1, 100, 30000)
c = np.random.randint(1, 100, 30000)
groupcols = [a,b,c]

%timeit group_pd(groupcols)
#13.7 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit group_np2(groupcols)
#34.2 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit group_np_sum(groupcols)
#3.63 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
``````

## Group by multiple columns, get group total count and specific column from last two rows in each group

By : user56946
Date : March 29 2020, 07:55 AM
around this issue I have an SQL Server table with the following columns: , I would attempt this by using the following WITH clause:
code :
``````WITH RUL AS (
select
UserId,
Area,
Action,
ObjectId,

LAG(RelatedUserLink) OVER (PARTITION BY UserId, Area, Action, ObjectId ORDER BY Created) as RelatedUserLink2,

ROW_NUMBER() OVER (PARTITION BY UserId, Area, Action, ObjectId ORDER BY Created DESC) latest_to_earliest,

MAX(Created) OVER (PARTITION BY UserId, Area, Action, ObjectId) as Created,

COUNT(*) OVER OVER (PARTITION BY UserId, Area, Action, ObjectId) as Count

from
where UserId = 10
)
select
UserId,
Area,
Action,
ObjectId,
Created,
Count
from
RUL
where
latest_to_earliest = 1;
``````

## Group by and aggregate problems for numpy arrays over word vectors

By : Asmita
Date : March 29 2020, 07:55 AM
wish helps you There is a bug in your code. Inside your lambda function you sum across the entire dataframe instead of just the group. This should fix things:
code :
``````movie_groupby = movie_data.groupby('movie_id').agg(lambda v: np.sum(v['textvec']))
``````

## Means within each group with more than 1 column of group indices

By : user6240922
Date : March 29 2020, 07:55 AM
wish of those help Another option, you can use apply(because you already have a matrix) to loop through columns( with Margin set to 2) and pass the column to ave function as group variable, you can either explicitly specify FUN parameter to be mean or not specify it as mean is the default function used:
code :
``````apply(groupings, 2, ave, x = var)  # pass the var as a named parameter since it is the
# parameter at the first position of ave function, if not
# ave will treat the column as the first position parameter
# which you don't want to

#      [,1]      [,2]   [,3]
#[1,] 0.630 0.5940000 0.5625
#[2,] 0.625 0.5940000 0.5625
#[3,] 0.470 0.5940000 0.5625
#[4,] 0.630 0.7900000 0.6500
#[5,] 0.470 0.4166667 0.5650
#[6,] 0.625 0.5940000 0.5650
#[7,] 0.470 0.4166667 0.5650
#[8,] 0.625 0.7900000 0.5650
#[9,] 0.630 0.5940000 0.5625
#[10,] 0.625 0.4166667 0.6400
``````
``````library(dplyr)
mutate_all(as.data.frame(groupings), funs(ave(var, .)))

#      V1        V2     V3
#1  0.630 0.5940000 0.5625
#2  0.625 0.5940000 0.5625
#3  0.470 0.5940000 0.5625
#4  0.630 0.7900000 0.6500
#5  0.470 0.4166667 0.5650
#6  0.625 0.5940000 0.5650
#7  0.470 0.4166667 0.5650
#8  0.625 0.7900000 0.5650
#9  0.630 0.5940000 0.5625
#10 0.625 0.4166667 0.6400
``````

## What is the fastest way to map group names of numpy array to indices?

By : user3478615
Date : March 29 2020, 07:55 AM
This might help you Constant number of indices per group Approach #1
We can perform dimensionality-reduction to reduce cubes to a 1D array. This is based on a mapping of the given cubes data onto a n-dim grid to compute the linear-index equivalents, discussed in detail here. Then, based on the uniqueness of those linear indices, we can segregate unique groups and their corresponding indices. Hence, following those strategies, we would have one solution, like so -
code :
``````N = 4 # number of indices per group
c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)
sidx = c1D.argsort()
indices = sidx.reshape(-1,N)
unq_groups = cubes[indices[:,0]]

# If you need in a zipped dictionary format
out = dict(zip(map(tuple,unq_groups), indices))
``````
``````s1,s2 = cubes[:,:2].max(0)+1
s = np.r_[s2,1,s1*s2]
c1D = cubes.dot(s)
``````
``````from scipy.spatial import cKDTree

idx = cKDTree(cubes).query(cubes, k=N)[1] # N = 4 as discussed earlier
I = idx[:,0].argsort().reshape(-1,N)[:,0]
unq_groups,indices = cubes[I],idx[I]
``````
``````c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)

sidx = c1D.argsort()
c1Ds = c1D[sidx]
split_idx = np.flatnonzero(np.r_[True,c1Ds[:-1]!=c1Ds[1:],True])
grps = cubes[sidx[split_idx[:-1]]]

indices = [sidx[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])]
# If needed as dict o/p
out = dict(zip(map(tuple,grps), indices))
``````
``````def numpy1(cubes):
c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)
sidx = c1D.argsort()
c1Ds = c1D[sidx]
indices = [sidx[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])]
return out
``````
``````from numba import  njit

@njit
def _numba1(sidx, c1D):
out = []
n = len(sidx)
start = 0
grpID = []
for i in range(1,n):
if c1D[sidx[i]]!=c1D[sidx[i-1]]:
out.append(sidx[start:i])
grpID.append(c1D[sidx[start]])
start = i
out.append(sidx[start:])
grpID.append(c1D[sidx[start]])
return grpID,out

def numba1(cubes):
c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)
sidx = c1D.argsort()
out = dict(zip(*_numba1(sidx, c1D)))
return out
``````
``````from numba import types
from numba.typed import Dict

int_array = types.int64[:]

@njit
def _numba2(sidx, c1D):
n = len(sidx)
start = 0
outt = Dict.empty(
key_type=types.int64,
value_type=int_array,
)
for i in range(1,n):
if c1D[sidx[i]]!=c1D[sidx[i-1]]:
outt[c1D[sidx[start]]] = sidx[start:i]
start = i
outt[c1D[sidx[start]]] = sidx[start:]
return outt

def numba2(cubes):
c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)
sidx = c1D.argsort()
out = _numba2(sidx, c1D)
return out
``````
``````In [4]: cubes = np.load('cubes.npz')['array']

In [5]: %timeit numpy1(cubes)
...: %timeit numba1(cubes)
...: %timeit numba2(cubes)
2.38 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.13 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 5.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````
``````import numexpr as ne

s0,s1 = cubes[:,0].max()+1,cubes[:,1].max()+1
d = {'s0':s0,'s1':s1,'c0':cubes[:,0],'c1':cubes[:,1],'c2':cubes[:,2]}
c1D = ne.evaluate('c0+c1*s0+c2*s0*s1',d)
``````

## Convert indices to vectors in Numpy

By : RUGIGANA SIMON PETER
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , A fairly common way to do this in NumPy is to compare data with arange and cast the boolean array to integer type: