Data Science Asked on May 10, 2021
I have an array dataset from which I need to build D whose each element is a list of strings (Third column of dataset). The dataset looks like:
600,900,3418309
600,900,3418309
600,900,3418314
600,900,3418314
600,900,3418319
600,900,3418319
610,800,3418324
610,700,3418324
600,900,3418329
620,900,3418329
600,900,3418329
600,900,3418334
610,900,3418334
600,900,3418339
600,900,3418339
600,900,3418339
660,700,3418339
610,800,3418339
660,700,3418339
600,900,3418339
600,900,3418339
I want to check for every new string, and if it is not a part of any of the array element then append new string to only that array element.
Since the data on new strings to be added is large, I do not want to use loop. Is there any method to do it fast. I have to use Python.
Right now I am using the code, which is very slow:
for i in range(len(dataset)):
for j in range(int(int(dataset[i,0]-600)),int(int(dataset[i,1]-600)+1)):
if(str(dataset[i,2]) not in D[j]):
D[j].append(str(dataset[i,2]))
This answers assumes I have correctly understood the question... I can alter my answer if OP updates the question with more details
Using your example data, you can use Pandas easily drop all duplicates.
First dump your data above into a Dataframe with three columns (one for each of the items in each row.
Import pandas:
import pandas as pd
import your data - assuming it is a list of lists - each of your rows is a list of three items, so we have three columns:
df = pd.DataFrame.from_records(your_list_of_lists, columns=["col1", "col2", "col3"])
Have a look at the first 5 rows:
df.head()
col1 col2 col3
0 600 900 3418309
1 600 900 3418309
2 600 900 3418314
3 600 900 3418314
4 600 900 3418319
The values will be integers by default, not strings (if they all were). But the solutions below should work the same in either case.
If you want to get all unique values of col3
, you can do one of the following:
uniques1 = set(df.col3) # returns a Python set
uniques2 = df.col3.unique() # returns a NumPy ndarray
uniques3 = df.col3.drop_duplicates() # returns a pandas Series object
If you want to remove only those rows where col3
repeats itself consecutively (which in some cases could produce the same result as the methods above, depending on the data), then you can look at the methods here. [An example] adapted for your columnar situation2:
def drop_consecutive_duplicates(a, col_name): # returns the dataframe
...: ar = a[col_name].values
...: return a[np.concatenate(([True],ar[:-1] != ar[1:]))]
There are many other ways to achieve the same result. Of the above, the first method is the fastest (on your small dataset example). Here are the benchmarks:
In [23]: %timeit df.col3.drop_duplicates()
263 µs ± 883 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: %timeit df.col3.unique()
37.2 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [25]: %timeit set(df.col3)
10.5 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The consective version:
In [26]: %timeit drop_consecutive_duplicates(df, "col3")
266 µs ± 3.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I imagine the Pandas methods might be able to scale better to a DataFrame with many rows, so this example might be rather biased on the dummy dataset with only ~20 rows.
The final method clearly has a little bit of overhead as it has to perform some extra operations
Answered by n1k31t4 on May 10, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP