Stack Overflow Asked by wonder kid on December 20, 2021
Let’s consider two dataframes : Person
and Movie
:
dataframe Person
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| | nconst | primaryName | primaryProfession | knownForTitles |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 0 | nm0000103 | Fairuza Balk | actress,soundtrack | tt0181875,tt0089908,tt0120586,tt0115963 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 1 | nm0000106 | Drew Barrymore | producer,actress,soundtrack | tt0120888,tt0343660,tt0151738,tt0120631 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 2 | nm0000117 | Neve Campbell | actress,producer,soundtrack | tt0134084,tt1262416,tt0120082,tt0117571 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 3 | nm0000132 | Claire Danes | actress,producer,soundtrack | tt0274558,tt0108872,tt1796960,tt0117509 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 4 | nm0000138 | Leonardo DiCaprio | actor,producer,writer | tt0120338,tt0993846,tt1375666,tt0407887 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
dataframe Movie
+---+-----------+-----------+---------------------+-----------------------+
| | tconst | titleType | originalTitle | genres |
+---+-----------+-----------+---------------------+-----------------------+
| 0 | tt0192789 | movie | While Supplies Last | Comedy,Musical |
+---+-----------+-----------+---------------------+-----------------------+
| 1 | tt4914592 | movie | Electric Heart | Adventure,Drama,Music |
+---+-----------+-----------+---------------------+-----------------------+
| 2 | tt4999994 | movie | Rain Doll | Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 3 | tt2690572 | movie | Polaris | Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 4 | tt1562859 | movie | Golmaal 3 | Action,Comedy |
+---+-----------+-----------+---------------------+-----------------------+
As you can see knownForTitles
from Person
is a list of tconst
from Movie
dataframe
Question :
actors
have ever acted in an action
movieFirst, we create person
as a DataFrame:
columns = ['nconst', 'primaryName', 'primaryProfession', 'knownForTitles',]
data = [
('nm0000103', 'Fairuza Balk', 'actress,soundtrack', 'tt0181875,tt0089908,tt0120586,tt0115963'),
('nm0000106', 'Drew Barrymore', 'producer,actress,soundtrack', 'tt0120888,tt0343660,tt0151738,tt0120631'),
('nm0000117', 'Neve Campbell', 'actress,producer,soundtrack', 'tt0134084,tt1262416,tt0120082,tt0117571'),
('nm0000132', 'Claire Danes', 'actress,producer,soundtrack', 'tt0274558,tt0108872,tt1796960,tt0117509'),
('nm0000138', 'Leonardo DiCaprio', 'actor,producer,writer', 'tt0120338,tt0993846,tt1375666,tt0407887'),
]
person = pd.DataFrame(data=data, columns=columns)
Second, we split strings into lists for two of the columns:
for field in ['primaryProfession', 'knownForTitles']:
person[field] = person[field].str.split(',')
Third, we use the explode
function to convert one row into many:
person = person.explode('knownForTitles').explode('primaryProfession')
Fourth, we select only actress/actor as the primary profession:
actor_actress = person[ person['primaryProfession'].isin(['actress', 'actor'])]
Now, we have a data frame in so-called tidy format (each cell has a single value, not a list):
nconst primaryName primaryProfession knownForTitles
0 nm0000103 Fairuza Balk actress tt0181875
0 nm0000103 Fairuza Balk actress tt0089908
0 nm0000103 Fairuza Balk actress tt0120586
0 nm0000103 Fairuza Balk actress tt0115963
1 nm0000106 Drew Barrymore actress tt0120888
At this point, we can repeat these steps for the Movie data frame, and then join actors (using knownForTitles) and Movies (using tconst).
Sorry for the length of this response. Key points this approach are to use str.split(',')
and then use explode()
to transform the data frame into a format suitable for join, merge, etc.
Answered by jsmart on December 20, 2021
I'm learning pandas, so there's a good chance I'm going the wrong way with this. That said, let's give this a go:
First, let's see if we can find all rows in df Movie that are action films. Looking at Pandas dataframe select rows where a list-column contains any of a list of strings, I came up with this:
Movies['isAction'] = [ 'Action' in x for x in Movies['genres'].tolist() ]
Here's the result:
tconst titleType originalTitle genres isAction
0 tt0407887 movie WhileSuppliesLast [Comedy, Musical] False
1 tt1375666 movie ElectricHeart [Adventure, Drama, Music] False
2 tt4999994 movie RainDoll [Drama] False
3 tt2690572 movie Polaris [Drama] False
4 tt0134084 movie Golmaal3 [Action, Comedy] True
I added the isAction
column to the Movies df. I also changed some of the tconst
values so that we can get some positive results (rows 0,1, and 4 changed).
I changed row 4
so that Neve Cambelle would appear in the results.
We can now produce a list of tconst
of Action Movies:
listOfActionMovies = Movies[ Movies["isAction"] == True]["tconst"].tolist()
Now using the solution from Pandas dataframe select rows where a list-column contains any of a list of strings again:
Person["inAction"] = pd.DataFrame(Person.knownForTitles.tolist()).isin( listOfActionMovies ).any(1)
This yields:
nconst primaryName primaryProfession knownForTitles inAction
0 nm0000103 FairuzaBalk [actress, soundtrack] [tt0181875, tt0089908, tt0120586, tt0115963] False
1 nm0000106 DrewBarrymore [producer, actress, soundtrack] [tt0120888, tt0343660, tt0151738, tt0120631] False
2 nm0000117 NeveCampbell [actress, producer, soundtrack] [tt0134084, tt1262416, tt0120082, tt0117571] True
3 nm0000132 ClaireDanes [actress, producer, soundtrack] [tt0274558, tt0108872, tt1796960, tt0117509] False
4 nm0000138 LeonardoDiCaprio [actor, producer, writer] [tt0120338, tt0993846, tt1375666, tt0407887] False
Now finally we can count all the People
in action movies:
len(Person[ Person["inAction"] == True ])
len()
solution provided by get dataframe row count based on conditions.
Answered by Mark on December 20, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP