Projet de WebScraping

J'aime lire. Des livres,de la presse, des magazines. J'adore faire de la veille et j'apprécie par dessus tout partager mes trouvailles, faire de la curation. Grâce au web, on peut aujourd'hui accéder à des journaux, magazines etc... du monde entier qu'il était difficile de trouver en version papier.

Pour ce projet de Webscraping, j'ai choisi de scraper 2 sites de médias web : Lily et Nylon.

Objectif : Créer un tableau qui, pour un mot donnée, rassemble les articles issus des pages de résultats de recherche de ces deux sites.
Le challenge : réussir à uniformiser les résultats afin d'obtenir un tableau final contenant toutes les données.

Import des librairies¶

import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

Création d'un champs input pour entrer le mot à chercher de son choix¶

word = input("Tapez un mot : ")

url_lily = f'https://www.thelily.com/tag/{word}/'
url_nylon = f'https://nylon.com/search/?q={word}'

html_lily = BeautifulSoup(requests.get(url_lily).content)
nylon_html=BeautifulSoup(requests.get(url_nylon).content)

Tapez un mot : psychology

Travail sur les données issues de Lily News¶

Mise en forme des données scrappées

lily_card=html_lily.find_all('div', class_='card-content align-items-start flex-container-row justify-space-between flex-mobile-column')

lily_article =[]
for x in lily_card:
    a=x.find_all('a')
    h3 = x.find_all('h3')
    lily_article.append([a,h3])

df_lily = pd.DataFrame(lily_article)
df_lily.columns=['titre','auteur']
df_lily['theme']=df_lily['titre']

def cleandf(a):
    return '|'.join([x.text.strip() for x in a])

dfcol=['titre','auteur']

for col in dfcol:
    df_lily[col]=list(map(cleandf, df_lily[col]))

def linkdf(b):
    return '|'.join([x.get('href') for x in b])

df_lily['theme']=list(map(linkdf, df_lily['theme']))

link = lambda a:'https://www.thelily.com'+str(a)
df_lily['theme']=list(map(link,df_lily['theme']))

df_lily['link'] = df_lily['theme'].str.split('|').str[0]
df_lily['theme'] = df_lily['titre'].str.split('|').str[1]
df_lily['titre'] = df_lily['titre'].str.split('|').str[2]
df_lily['accroche'] = df_lily['auteur'].str.split('|').str[0]
df_lily['date'] = df_lily['auteur'].str.split('|').str[2]
df_lily['auteur'] = df_lily['auteur'].str.split('|').str[1]
df_lily['source'] = 'The Lily'

df_lily = df_lily[['source', 'theme','titre','auteur','accroche','date','link',]]
df_lily.head(3)

Travail sur les données issues de Nylon¶

nylon_card=nylon_html.find_all('div', class_='widget')
card = [x.find_all('article') for x in nylon_card]

nylon_article=[y.find_all(['a','span'])for x in card for y in x]

df_nylon = pd.DataFrame(nylon_article)
df_nylon = df_nylon.rename(columns={0: 'theme',1:'link',2:'titre',3:'auteur',4:'date'})
df_nylon['source']='nylon'
df_nylon['accroche']=''

dfcol=['theme','titre','auteur','date']
for col in dfcol:
    df_nylon[col]=([x.text.strip() for x in df_nylon[col]])

df_nylon['link'] = [x.get('href') for x in df_nylon['link']]
df_nylon = df_nylon[['source','theme','titre','auteur','accroche','date','link']]

df_nylon.head(3)

Fusion des deux tableaux¶

df_search_word = df_lily.append(df_nylon)
df_search_word

Export dans un fichier .json¶

df_search_word.to_json('search_word_result.json', orient='records')

Piste d'amélioration et d'avancement¶

ajouter d'autres sites, de plusieurs langues
ajouter un module de traduction pour que la recherche soit possible sur ces différents sites
compter quand il sont présents le nombre d'articles par thème
analyser le rythme de publication autour du mots choisis par exemple pour voir la saisonnalité, si une tendance ressort etc...

	source	theme	titre	auteur	accroche	date	link
0	The Lily	Books	I’m a therapist, and a shocking breakup landed...	Lori Gottlieb	An excerpt from Lori Gottlieb’s ‘Maybe You Sho...	May 3	https://www.thelily.com/im-a-therapist-and-a-s...
1	The Lily	Family	Two decades after disappearing, her daughter s...	The Lily News	She returned with children and a new name, and...	March 14	https://www.thelily.com/two-decades-after-disa...
2	The Lily	Books	This book’s heroine unexpectedly sees her abus...	Maureen Corrigan	What comes next in Christobel Kent’s psycholog...	February 8	https://www.thelily.com/this-books-heroine-une...

	source	theme	titre	auteur	date	link
0	nylon	Film	Megan Fox Says She Had A "Psychological Breakd...	Bailey Calfee	19 September	https://nylon.com/megan-fox-psychological-brea...
1	nylon	Justice	The Psychological Impact Of Anti-Abortion Legi...	Ivana Rihter	21 August	https://nylon.com/psychological-impact-anti-ab...
2	nylon	TV	Alia Shawkat Can't Wait To Lose Control	Caitlin Wolper	16 October	https://nylon.com/alia-shawkat-second-woman-in...

	source	theme	titre	auteur	accroche	date	link
0	The Lily	Books	I’m a therapist, and a shocking breakup landed...	Lori Gottlieb	An excerpt from Lori Gottlieb’s ‘Maybe You Sho...	May 3	https://www.thelily.com/im-a-therapist-and-a-s...
1	The Lily	Family	Two decades after disappearing, her daughter s...	The Lily News	She returned with children and a new name, and...	March 14	https://www.thelily.com/two-decades-after-disa...
2	The Lily	Books	This book’s heroine unexpectedly sees her abus...	Maureen Corrigan	What comes next in Christobel Kent’s psycholog...	February 8	https://www.thelily.com/this-books-heroine-une...
3	The Lily	Lifestyle	Minimalism isn’t for me but here’s how I’m get...	Monica Castillo	After years of constant moving, I’m finally or...	July 16	https://www.thelily.com/minimalism-isnt-for-me...
4	The Lily	Violence	The Annapolis shooting was an attack on a news...	Elizabeth Chang	The fight isn’t just about guns or the news media	July 2	https://www.thelily.com/the-annapolis-shooting...
5	The Lily	Obituaries	Psychologist Anne Treisman ‘changed the way we...	The Lily News	She died last week at 82	February 15	https://www.thelily.com/psychologist-anne-trei...
0	nylon	Film	Megan Fox Says She Had A "Psychological Breakd...	Bailey Calfee		19 September	https://nylon.com/megan-fox-psychological-brea...
1	nylon	Justice	The Psychological Impact Of Anti-Abortion Legi...	Ivana Rihter		21 August	https://nylon.com/psychological-impact-anti-ab...
2	nylon	TV	Alia Shawkat Can't Wait To Lose Control	Caitlin Wolper		16 October	https://nylon.com/alia-shawkat-second-woman-in...
3	nylon	Music	Artist To Watch Mahalia Hates Drama, But Not T...	Allison Stubblebine		29 October	https://nylon.com/mahalia-love-compromise-inte...
4	nylon	Film	Everyone Who Wanted To See Natalie Portman In ...	Allison Stubblebine		12 September	https://nylon.com/lucy-sky-natalie-portman-dia...
5	nylon	Film	'Don't Let Go' Is Not Just Another Time Travel...	Sesali Bowen		28 August	https://nylon.com/storm-reid-david-oyelowo-int...
6	nylon	Film	In 'The Lighthouse' Robert Pattinson Descends ...	Jesse Hassenger		17 October	https://nylon.com/the-lighthouse-review-robert...
7	nylon	Astrology	Here Are The Zodiac Signs For All The Characte...	Gala Mukomolova		14 October	https://nylon.com/zodiac-signs-euphoria-charac...
8	nylon	Justice	How Art Inspired By Sexual Assault Can Help Su...	Caitlin Wolper		19 November	https://nylon.com/sexual-assault-art-healing
9	nylon	Justice	'The Good Place' Star Jameela Jamil Says Abort...	Allison Stubblebine		14 May	https://nylon.com/jameela-jamil-abortion
10	nylon	Film	How Octavia Spencer Became One Of Hollywood's ...	Jesse Hassenger		01 August	https://nylon.com/octavia-spencer-luce-review
11	nylon	Books	Jia Tolentino Tells The Stories Of Generation ...	Joanna Demkiewicz		06 August	https://nylon.com/review-jia-tolentino-trick-m...