Projet de WebScraping

J'aime lire. Des livres,de la presse, des magazines. J'adore faire de la veille et j'apprécie par dessus tout partager mes trouvailles, faire de la curation. Grâce au web, on peut aujourd'hui accéder à des journaux, magazines etc... du monde entier qu'il était difficile de trouver en version papier.

Pour ce projet de Webscraping, j'ai choisi de scraper 2 sites de médias web : Lily et Nylon.

Lily_logo nylon_logo

Objectif : Créer un tableau qui, pour un mot donnée, rassemble les articles issus des pages de résultats de recherche de ces deux sites.
Le challenge : réussir à uniformiser les résultats afin d'obtenir un tableau final contenant toutes les données.

Import des librairies

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

Création d'un champs input pour entrer le mot à chercher de son choix

In [3]:
word = input("Tapez un mot : ")

url_lily = f'https://www.thelily.com/tag/{word}/'
url_nylon = f'https://nylon.com/search/?q={word}'

html_lily = BeautifulSoup(requests.get(url_lily).content)
nylon_html=BeautifulSoup(requests.get(url_nylon).content)
Tapez un mot : psychology

Travail sur les données issues de Lily News

Mise en forme des données scrappées

In [4]:
lily_card=html_lily.find_all('div', class_='card-content align-items-start flex-container-row justify-space-between flex-mobile-column')

lily_article =[]
for x in lily_card:
    a=x.find_all('a')
    h3 = x.find_all('h3')
    lily_article.append([a,h3])
In [5]:
df_lily = pd.DataFrame(lily_article)
df_lily.columns=['titre','auteur']
df_lily['theme']=df_lily['titre']
In [6]:
def cleandf(a):
    return '|'.join([x.text.strip() for x in a])

dfcol=['titre','auteur']

for col in dfcol:
    df_lily[col]=list(map(cleandf, df_lily[col]))
In [7]:
def linkdf(b):
    return '|'.join([x.get('href') for x in b])

df_lily['theme']=list(map(linkdf, df_lily['theme']))
In [8]:
link = lambda a:'https://www.thelily.com'+str(a)
df_lily['theme']=list(map(link,df_lily['theme']))
In [9]:
df_lily['link'] = df_lily['theme'].str.split('|').str[0]
df_lily['theme'] = df_lily['titre'].str.split('|').str[1]
df_lily['titre'] = df_lily['titre'].str.split('|').str[2]
df_lily['accroche'] = df_lily['auteur'].str.split('|').str[0]
df_lily['date'] = df_lily['auteur'].str.split('|').str[2]
df_lily['auteur'] = df_lily['auteur'].str.split('|').str[1]
df_lily['source'] = 'The Lily'
In [10]:
df_lily = df_lily[['source', 'theme','titre','auteur','accroche','date','link',]]
df_lily.head(3)
Out[10]:
source theme titre auteur accroche date link
0 The Lily Books I’m a therapist, and a shocking breakup landed... Lori Gottlieb An excerpt from Lori Gottlieb’s ‘Maybe You Sho... May 3 https://www.thelily.com/im-a-therapist-and-a-s...
1 The Lily Family Two decades after disappearing, her daughter s... The Lily News She returned with children and a new name, and... March 14 https://www.thelily.com/two-decades-after-disa...
2 The Lily Books This book’s heroine unexpectedly sees her abus... Maureen Corrigan What comes next in Christobel Kent’s psycholog... February 8 https://www.thelily.com/this-books-heroine-une...

Travail sur les données issues de Nylon

In [11]:
nylon_card=nylon_html.find_all('div', class_='widget')
card = [x.find_all('article') for x in nylon_card]

nylon_article=[y.find_all(['a','span'])for x in card for y in x]

df_nylon = pd.DataFrame(nylon_article)
df_nylon = df_nylon.rename(columns={0: 'theme',1:'link',2:'titre',3:'auteur',4:'date'})
df_nylon['source']='nylon'
df_nylon['accroche']=''
In [12]:
dfcol=['theme','titre','auteur','date']
for col in dfcol:
    df_nylon[col]=([x.text.strip() for x in df_nylon[col]])

df_nylon['link'] = [x.get('href') for x in df_nylon['link']]
df_nylon = df_nylon[['source','theme','titre','auteur','accroche','date','link']]
In [13]:
df_nylon.head(3)
Out[13]:
source theme titre auteur accroche date link
0 nylon Film Megan Fox Says She Had A "Psychological Breakd... Bailey Calfee 19 September https://nylon.com/megan-fox-psychological-brea...
1 nylon Justice The Psychological Impact Of Anti-Abortion Legi... Ivana Rihter 21 August https://nylon.com/psychological-impact-anti-ab...
2 nylon TV Alia Shawkat Can't Wait To Lose Control Caitlin Wolper 16 October https://nylon.com/alia-shawkat-second-woman-in...

Fusion des deux tableaux

In [14]:
df_search_word = df_lily.append(df_nylon)
df_search_word
Out[14]:
source theme titre auteur accroche date link
0 The Lily Books I’m a therapist, and a shocking breakup landed... Lori Gottlieb An excerpt from Lori Gottlieb’s ‘Maybe You Sho... May 3 https://www.thelily.com/im-a-therapist-and-a-s...
1 The Lily Family Two decades after disappearing, her daughter s... The Lily News She returned with children and a new name, and... March 14 https://www.thelily.com/two-decades-after-disa...
2 The Lily Books This book’s heroine unexpectedly sees her abus... Maureen Corrigan What comes next in Christobel Kent’s psycholog... February 8 https://www.thelily.com/this-books-heroine-une...
3 The Lily Lifestyle Minimalism isn’t for me but here’s how I’m get... Monica Castillo After years of constant moving, I’m finally or... July 16 https://www.thelily.com/minimalism-isnt-for-me...
4 The Lily Violence The Annapolis shooting was an attack on a news... Elizabeth Chang The fight isn’t just about guns or the news media July 2 https://www.thelily.com/the-annapolis-shooting...
5 The Lily Obituaries Psychologist Anne Treisman ‘changed the way we... The Lily News She died last week at 82 February 15 https://www.thelily.com/psychologist-anne-trei...
0 nylon Film Megan Fox Says She Had A "Psychological Breakd... Bailey Calfee 19 September https://nylon.com/megan-fox-psychological-brea...
1 nylon Justice The Psychological Impact Of Anti-Abortion Legi... Ivana Rihter 21 August https://nylon.com/psychological-impact-anti-ab...
2 nylon TV Alia Shawkat Can't Wait To Lose Control Caitlin Wolper 16 October https://nylon.com/alia-shawkat-second-woman-in...
3 nylon Music Artist To Watch Mahalia Hates Drama, But Not T... Allison Stubblebine 29 October https://nylon.com/mahalia-love-compromise-inte...
4 nylon Film Everyone Who Wanted To See Natalie Portman In ... Allison Stubblebine 12 September https://nylon.com/lucy-sky-natalie-portman-dia...
5 nylon Film 'Don't Let Go' Is Not Just Another Time Travel... Sesali Bowen 28 August https://nylon.com/storm-reid-david-oyelowo-int...
6 nylon Film In 'The Lighthouse' Robert Pattinson Descends ... Jesse Hassenger 17 October https://nylon.com/the-lighthouse-review-robert...
7 nylon Astrology Here Are The Zodiac Signs For All The Characte... Gala Mukomolova 14 October https://nylon.com/zodiac-signs-euphoria-charac...
8 nylon Justice How Art Inspired By Sexual Assault Can Help Su... Caitlin Wolper 19 November https://nylon.com/sexual-assault-art-healing
9 nylon Justice 'The Good Place' Star Jameela Jamil Says Abort... Allison Stubblebine 14 May https://nylon.com/jameela-jamil-abortion
10 nylon Film How Octavia Spencer Became One Of Hollywood's ... Jesse Hassenger 01 August https://nylon.com/octavia-spencer-luce-review
11 nylon Books Jia Tolentino Tells The Stories Of Generation ... Joanna Demkiewicz 06 August https://nylon.com/review-jia-tolentino-trick-m...

Export dans un fichier .json

In [20]:
df_search_word.to_json('search_word_result.json', orient='records')

Piste d'amélioration et d'avancement

  • ajouter d'autres sites, de plusieurs langues
  • ajouter un module de traduction pour que la recherche soit possible sur ces différents sites
  • compter quand il sont présents le nombre d'articles par thème
  • analyser le rythme de publication autour du mots choisis par exemple pour voir la saisonnalité, si une tendance ressort etc...
In [ ]: