Python: Fast indexing of strings in nested list without loop

Question

I have an array dataset from which I need to build D whose each element is a list of strings (Third column of dataset). The dataset looks like:
600,900,3418309  
600,900,3418309  
600,900,3418314   
600,900,3418314  
600,900,3418319   
600,900,3418319  
610,800,3418324  
610,700,3418324  
600,900,3418329  
620,900,3418329  
600,900,3418329  
600,900,3418334  
610,900,3418334  
600,900,3418339  
600,900,3418339  
600,900,3418339  
660,700,3418339  
610,800,3418339  
660,700,3418339  
600,900,3418339  
600,900,3418339

I want to check for every new string, and if it is not a part of any of the array element then append new string to only that array element.
Since the data on new strings to be added is large, I do not want to use loop. Is there any method to do it fast. I have to use Python.
Right now I am using the code, which is very slow:
for i in range(len(dataset)):
 for j in range(int(int(dataset[i,0]-600)),int(int(dataset[i,1]-600)+1)):
    
    if(str(dataset[i,2]) not in D[j]):
        
        D[j].append(str(dataset[i,2]))

n1k31t4 · Answer

This answers assumes I have correctly understood the question... I can alter my answer if OP updates the question with more details

Using your example data, you can use Pandas easily drop all duplicates.

Setup

First dump your data above into a Dataframe with three columns (one for each of the items in each row.

Import pandas:

import pandas as pd

import your data - assuming it is a list of lists - each of your rows is a list  of three items, so we have three columns:

df = pd.DataFrame.from_records(your_list_of_lists, columns=["col1", "col2", "col3"])

Have a look at the first 5 rows:

df.head()
   col1  col2     col3
0   600   900  3418309  
1   600   900  3418309  
2   600   900  3418314  
3   600   900  3418314  
4   600   900  3418319

The values will be integers by default, not strings (if they all were). But the solutions below should work the same in either case.

Solutions

If you want to get all unique values of col3, you can do one of the following:

uniques1 = set(df.col3)              # returns a Python set
uniques2 = df.col3.unique()          # returns a NumPy ndarray
uniques3 = df.col3.drop_duplicates() # returns a pandas Series object

If you want to remove only those rows where col3 repeats itself consecutively (which in some cases could produce the same result as the methods above, depending on the data), then you can look at the methods here. [An example] adapted for your columnar situation2:

def drop_consecutive_duplicates(a, col_name):      # returns the dataframe
    ...:         ar = a[col_name].values 
    ...:         return a[np.concatenate(([True],ar[:-1] != ar[1:]))]

This will return the entire dataframe, which those rows of consecutive values removed.

Performance

There are many other ways to achieve the same result. Of the above, the first method is the fastest (on your small dataset example). Here are the benchmarks:

In [23]: %timeit df.col3.drop_duplicates()                                      
263 µs ± 883 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: %timeit df.col3.unique()                                               
37.2 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [25]: %timeit set(df.col3)                                                   
10.5 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The consective version:

In [26]: %timeit drop_consecutive_duplicates(df, "col3")                                                                                                                                                                                          
266 µs ± 3.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I imagine the Pandas methods might be able to scale better to a DataFrame with many rows, so this example might be rather biased on the dummy dataset with only ~20 rows.

The final method clearly has a little bit of overhead as it has to perform some extra operations

Python: Fast indexing of strings in nested list without loop

One Answer

Setup

Solutions

This will return the entire dataframe, which those rows of consecutive values removed.

Performance

Add your own answers!

Ask a Question