logo
down
shadow

Cannot extract the description and rating while scraping using requests and Beautiful Soup


Cannot extract the description and rating while scraping using requests and Beautiful Soup

By : Darek
Date : November 22 2020, 04:01 AM
I think the issue was by ths following , I'm a beginner to web scraping, I was scraping this particular web page https://myanimelist.net/anime/394 where I was unable to fetch the description and rating though my python code using requests and Beautiful Soup. The code is working fine for the other pages of the above url indexes. Can't find the bug in my code when the same code is working for other pages. , change:
code :
soup=BeautifulSoup(source.content,'lxml')
soup=BeautifulSoup(source.content,'html.parser')
Anime : Ai Yori Aoshi: Enishi
Rating : 7.22
Description : Two years after meeting Aoi, Kaoru and gang are still up to their normal habits. Kaoru now in grad school and the tenants being as rowdy as ever what will become of Aoi and Kaoru's love.

Two years has passed since Aoi and Kaoru were freed from the bonds of their families. They continue to live their normal lives with their usual friends in their house.
Ranked #2737

Episodes:
  12

None


Share : facebook icon twitter icon
Web scraping data using Python Beautiful Soup - can't extract field

Web scraping data using Python Beautiful Soup - can't extract field


By : Gab Esz
Date : March 29 2020, 07:55 AM
I wish this helpful for you I am trying to extract the ticker (South Africa 40) field from an IG Index page using Python Beautiful Soup, but I can't retrieve it. , Have you tried find_all instead of select? Something like:
code :
name_div = soup.find_all('div', {'class': 'ma-content title'})[0]
name = name_div.find('h1').text
Web Scraping: Extract different recurring classes and their content with Beautiful Soup in Python

Web Scraping: Extract different recurring classes and their content with Beautiful Soup in Python


By : Muhammad Ramadan
Date : March 29 2020, 07:55 AM
Hope this helps I am quite new to web scraping. I have a long website, which follows the following format: (class A being the title, class B subtitle, class C paragraphs of text) , I'm guessing that your HTML is somewhat like this:
code :
<p class="A">Title1</p>
<p class="B">Subtitle1</p>
<p class="C">Text1</p>
<p class="C">Text1</p>
<p class="C">Text1</p>
<p class="A">Title2</p>
<p class="B">Subtitle2</p>
<p class="C">Text2</p>
<p class="C">Text2</p>
<p class="C">Text2</p>
soup = BeautifulSoup(html, 'lxml')  # html is the above html

title, subtitle, para = '', '', ''
for p in soup.find_all('p', class_=['A', 'B', 'C']):
    if p['class'][0] == 'A':
        if title:
            print(title, subtitle, para)  # Or add these values in CSV
        title = p.text
        para = ''
        continue
    if p['class'][0] == 'B':
        subtitle = p.text
        continue

    para += p.text + ' '

print(title, subtitle, para)  # Or add these values in CSV
Title1 Subtitle1 Text1 Text1 Text1 
Title2 Subtitle2 Text2 Text2 Text2 
Scraping with correct requests for Beautiful Soup - HTML Error 400

Scraping with correct requests for Beautiful Soup - HTML Error 400


By : cabraham
Date : March 29 2020, 07:55 AM
around this issue In your URLs you have \n. This needs to be stripped out. There is no pre tag in the HTML so in this example I have found the second h1 tag to test with.
code :
import requests
from bs4 import BeautifulSoup

# In your function you need to strip out "\n" as it has no place in your URLs.
def build_url(gene):
    return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene.rstrip() + '.1?report=fasta'

csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
genes_urls = [build_url(gene) for gene in csv]

results = []
for url in genes_urls:
    r = requests.get(url)
    # Using html.parser but you can use lxml if you like.
    soup = BeautifulSoup(r.text,"html.parser") 
    # there is no <pre> tag in the soup so we will find the second occurrence of H1 for testing.
    result = soup.find_all('h1')[1].text
    print (result)
    results +=[result]

print (results)
Impatiens amoena internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens amphorata internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens andohahelae internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens andringitrensis internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens angulata voucher S.X. Yu 3777 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
Impatiens angulata voucher S.X. Yu 3777 atpB-rbcL intergenic spacer, partial sequence; chloroplast
Impatiens angulata voucher S.X. Yu 3777 tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene, partial sequence; plastid
Impatiens anovensis internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens apalophylla voucher S.X. Yu 4042 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
....
import requests
from bs4 import BeautifulSoup

# In your function you need to strip out "\n" as it has no place in your URLs.
def build_url(gene):
    return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene.rstrip() + '.1?report=fasta'

csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
genes_urls = [build_url(gene) for gene in csv]

results = []
for url in genes_urls:
    r = requests.get(url)
    # Using html.parser but you can use lxml if you like.
    soup = BeautifulSoup(r.text,"html.parser")
    # You need to get the vale of content in <meta content="38155510" name="ncbi_uidlist"/>
    content = soup.find('meta', {'name':"ncbi_uidlist"})['content']
    # Simulate the XHR request using "content"
    result = requests.get("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=" + content + "&db=nuccore&report=fasta&extrafeat=null&conwithfeat=on&retmode=ht").text
    print (result)
    results +=[result]

print (results)
>AY348795.1 Impatiens amoena internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
TCGAAAACTATTTCAAACAACCAGTGAACATAATAATAAATCTTGTGTTGAGATTGACTTTTGTTTAATC
TCTTCCTATTAATGTACTTGGAGTGCTTGCTTGGCAACAAATTTGTATGCCATTTTGTAGGTTCCCTCAA
CTCATAAACAAACCCCGGCGTAAACCGCCAAGGAATGTTAAAAACAATTGCCATTATTTTACCCATTTAT
ATGGGATGAAATTTTGGTTTTAGTTATCAATAAACTAAAATGACTCTCGACAACGGATATCTCGGCTCTC
GCATCGATGAAGAACGTAGCAAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTTT
TTGAACCCAAGTTGCGCCTGAAGCTATTAGGTTGAAGGCACGTCTGCCTGGGCGTCTCGCTTCGTGTCGT
CTCATTTCATCTATTATGGGACGGATAATGGCCTCCTGTACGTTTATATATCGAGCAGTTGGTTGAAATA
TAAGTCCATATTATAGGACACACGGTTAGTGGTGGTTGAAAAAACTGTTTCAAACCCGTGTTGTAACTTA
ATTTGGATTGATTGACCCTTCTTGTGCCTTTAATGGTGCATCGTTTGC

>AY348740.1 Impatiens amphorata internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
TTCATCACCGNCGAACTTGTTATTAAAATCGGGCTGCGATTGGCCTTTGGNCGGTCGCTTCCCATCATGC
GGTTGGGGTGCACGGTGTTGTATTCTATCTTGGGTACAATCGCGTGTTCCCCCNACTCATAAACAAACCC
CGGCGTAAACCGCCAAGGAATGTTAAAAAGGACTTCCCATACCAGACCCATTTTATTTTTGGGGGATGCG
TAATGGTGTTAGTTTTCCATAAACATAACGACTCTCGACAACGGATATCTCGGCTCTCGCATCGATGAAG
AACGTAGCAAAATGCGATACTTGGTGTGAATTGCARAATTCCCGTGAACCATCGAGTTTTTGAACGCAAG
TTGCGCCTGAAGCCATTAGGTTGAGGGCACGTCTGCCTGGGCGTCTCGCTTCGTGTCGCCCCATTTCATA
ACTGTTTTGGGACGTATAATGGCCTCCTGTGCAATACCCATGCAGCAGTTGGCCGAAATAGAAGTCCATA
TGATAGGACACACGGTTAGTGGTGGTTGARAAACTGTTTC
...
Extract data using Beautiful Soup and Requests

Extract data using Beautiful Soup and Requests


By : ignacio
Date : March 29 2020, 07:55 AM
I wish did fix the issue. I am trying to scrape data from Stackoverflow using Beautiful soup and requests package in Python. I have been able to extract most of the details, however when I try to extract the reputation scores of an user I am only able to pull data for reputation score and Gold, but not able to extract data for Silver and Bronze counts. , find() return first element, to get multiple element use find_all()
code :
badge = article.find('div', class_='-flair').find_all('span', class_='badgecount')
gold_badge = badge[0].text
silver_badge = badge[1].text
bronze_badge = badge[2].text
print(gold_badge, silver_badge, bronze_badge) # 2 7 26
Python Web scraping Beautiful Soup - Clinicaltrials.gov - getting detailed description (novice question)

Python Web scraping Beautiful Soup - Clinicaltrials.gov - getting detailed description (novice question)


By : user3212566
Date : March 29 2020, 07:55 AM
hop of those help? Here is an example of how to extract the short and long study descriptions using the module requests and lxml.html
code :
import requests
import lxml.html


def scraper(url: str, timeout: int = 5) -> tuple:
    """
    Scrap short and detailed study descriptions.

    :param url: The url of the study.
    :type url: str

    :param timeout: How long to wait for a response.
    :type timeout: int

    :return: A tuple consisting of the short and long study description.
    """
    # Add long description toggler to url
    url += "?show_desc=Y#desc"
    # Make the request and parse as tree
    response = requests.get(url=url, timeout=timeout)
    tree = lxml.html.fromstring(response.text)
    short, long = tree.find_class("ct-body3 tr-indent2")
    short, long = short.text_content(), long.text_content()
    return short, long
short, long = scraper("https://clinicaltrials.gov/ct2/show/study/NCT03089801")
print('Short description:\n\n%s\n%s\n\nLong description:\n\n%s' % (short, '-' * 25, long))
# Short description:
# 
# In order to enhance access to clinical and mental health services for Veterans who have geographic, clinical, or social barriers to in-person care, VA Offices of Connected Care and Rural Health began distributing 5,000 tablets to Veterans with access barriers in 2016. The objective of this Quality Improvement evaluation is to:
# 
#     Understand characteristics of Veterans who received tablets, the frequency and ways in which they used the tablets, and the effects of tablet use on access to VA services.
#     Through a survey of Veterans, evaluate patient experiences using the tablets, and determine how tablets influenced patients' experiences with VA care, including their satisfaction, communication with providers, and access to needed services.
#     Identify implementation barriers and facilitators to tablet distribution and use through interviews with clinicians and staff in a purposive sample of VA facilities
#     Evaluate the effects of tablet use on chronic medical condition outcomes (e.g., hypertension, diabetes) and mental health treatment initiation and engagement (e.g., for depression, PTSD, and substance use).
# 
# -------------------------
# 
# Long description:
# 
# Background:
#   Telehealth is a cornerstone of enhanced access for Veterans and across a range of conditions is associated with improved disease control, quality of life, and patient satisfaction. Increasingly Veterans are able to monitor their chronic conditions and communicate with clinicians and care teams via tablets and other devices. However, this service is currently only available to Veterans with in-home Internet and video capability, or Veterans who are able to travel to a VA community based outpatient clinics to connect with providers at other facilities. In 2016, in order to address this access gap and disparity, VA launched an initiative to distribute tablets to Veterans who have clinical needs for remote care, and barriers to traditional in-person access.
#   Veterans who meet specific need-based (access, technology, and clinical) criteria may be issued one of two devices: Commercially available Off the Shelf (COTS) for basic connectivity or Healthcare Access Tablet (HAT) with a general exam camera and optional peripheral devices (i.e., stethoscope, BP monitor, pulse oximeter, thermometer, or weight scale). VA providers refer eligible patients for the devices using a consult template in VA's electronic health record. Care delivered via the tablet is indicated in the referral and may include one or more of the following: Home Based Primary Care, Palliative Care, Mental Health Intensive Case Management, Spinal Cord Injury, Mental Health Care, care for patients with marked mobility problems, care for patients with cognitive problems (these patients must have a caregiver who can assist with technology), home evaluations, and rehabilitation/prosthetics. Once the patient is issued the device, he or she will receive tablet services from trained teleproviders.
#   The VA began distributing tablets in the spring of 2016, with the plan of distributing 5,000 tablets over the following 1-2 years. Veteran eligibility criteria for tablets include the following: 1) Enrolled in VA Healthcare, 2) Does not own a device or does not have working broadband or cellular internet connection, 3) Physically and cognitively able to operate the technology (or has caregiver who can assist), 4) Barriers to access, such as a) distance or geography, b) transportation issues, c) homebound or difficulty leaving home, d) other (described by provider), and 5) Provider and patient give informed consent agreeing to utilize telehealth for care.
#   The tablet initiative and evaluation have been designated as Quality Improvement by VA's Office of Rural Health. The evaluation will include the following:
# 
#     Tablet Recipient Characteristics, Use of Tablets, and Effects on Access. The investigators will first characterize Veterans who are issued and use the devices (e.g., age, sex, medical and mental health conditions, rural location/distance from VA). Investigators will describe the frequency of tablet use and the types of services that the Veteran receives (e.g., chronic disease management, mental health therapy, palliative care, home-based primary care). Investigators will analyze rates of in-person (outpatient, emergency care), telephone, and telehealth-based care before and after tablet distribution, and compare patterns to those observed in a cohort of comparable patients to assess whether tablets influence access and patterns of use.
#     Effects on Patient Experience. For patients receiving tablets beginning in March, 2017, the investigators will administer a survey at time of tablet receipt, and 3-6 months after that time, to examine changes in patients' satisfaction with VA care and their perceived access and communication, and to evaluate their experiences using the tablets. The survey will also assess patients' needs and risk factors (e.g., social support, health literacy), and how these factors impact patients' experiences with the tablets and VA care. If resources permit, the survey may be administered to a cohort of comparable patients who have not received tablets (to be determined as of March, 2017).
#     Implementation Evaluation. The implementation evaluation will be guided by the Consolidated Framework for Implementation Research (CFIR). The investigators will first administer an online survey to Facility Telehealth Coordinators (FTCs) at facilities that are distributing tablets. The survey will query FTCs about the tablet initiative, resources that facilitated implementation, and barriers that impeded implementation. The investigators will use survey responses to identify FTCs who represent a range of VA facilities (in terms of high vs. low tablet distribution rates). Follow-up interviews will be conducted by telephone. The investigators will transcribe and code the interviews using standard content analysis methods with the goal of understanding barriers and facilitators to tablet distribution within each of the CFIR domains.
#     Effects on Chronic Disease and Mental Health Outcomes. If resources are available in FY18, the investigators will evaluate how device distribution influences clinical outcomes for Veterans with common and high-risk conditions, such as hypertension, diabetes, and PTSD (conditions to be determined based on prevalence rates in the tablet recipient population). The investigators will compare measures of disease control (e.g., blood pressure readings, hemoglobin A1C levels) at 3 and 6 months after device shipment, and compare these levels to comparable patients from other facilities, using propensity scores to match patients on the basis of sociodemographic and clinical characteristics. The investigators will use similar methods to examine treatment initiation and engagement rates among patients with common mental health conditions, such as depression, PTSD, and substance use disorder.
# 
#   The proposed project will be conducted with support from the eHealth Partnered Evaluation Initiative, a partnership between QUERI and Office of Connected Health that aims to evaluate the implementation of patient-provider technologies across VA, and understand their impacts on Veteran experience, perceived burdens and benefits to clinical teams, access to care, other care processes, and Veteran health outcomes.
# 
Related Posts Related Posts :
  • Creating a Dataframe of Proportions
  • Scrapy with dynamic captcha
  • In python, how do I get urllib to recognize multiple lines in a string as separate URLs?
  • Add prefix and suffix to each element of a nested list
  • Generate string set from csv file in Python
  • Custom usage message for many-valued argument
  • Python Class, how to skip a wrong entry and proceed to next entry
  • Numpy efficient way to parse array of string
  • Kivy , Python: Update Label on_file_drop
  • What does it mean if a deeper conv layer converges first?
  • Selecting User in client.send_message() from arg list
  • python slicing multi levels list of dict using list comprehension
  • Value Error problem with multicell Dimensions must be equal, but are 20 and 13
  • How to print a board with coordinates?
  • Keras LSTM shape doesn't contain length of sequence
  • Boxplot with Pandas in Python
  • How can I rename a PySpark dataframe column by index? (handle duplicated column names)
  • How to calculate hash of a python class object
  • Using ideas from HashEmbeddings with sklearn's HashingVectorizer
  • keycloak.exceptions.KeycloakGetError: 404: b'' using Python 3.7
  • How to modify a column in a SQLite3?
  • VS Integration Services: flat file source to OLE DB destination - detect new data columns in file, add columns to table,
  • Customize xticks in matplotlib plot
  • How can I show the image in a labelframe which is inserted through askopenfilename?
  • Boxplot with distibution size histogram on top (and median regression)
  • Fit differential equation with scipy
  • ModuleNotFoundError: Correct setup
  • How to pass rendered plot to a html file through render_template?
  • Create flat ndarray from DataFrame column containing arrays
  • Bring radial axes labels in front of lines of polar plot matplotlib
  • Python3: Unable to split word from parsed data
  • Using Python to login to a website and web scrape
  • Customise shift in matplotlib offset
  • Combining and Reshaping rows and columns of 2 dataframes in R or Python
  • Regex condition after and before a known phrase
  • subplots based on records of two different pandas DataFrames ( with same structure) using Seaborn or Matplotlib
  • find numpy array in other numpy array
  • Print Triangle Pyramid pattern using for loop python
  • Python Script Running through Command Line Not Creating CSV
  • Questions about Subclassing
  • Creating list with dictionary instead of multiple dictionaries in python
  • Sorting queryset results in a template
  • Django Rest Framework allow not authenticated access to a view in certain cases
  • How do I efficiently map integers to URIs in a multidemnsional array?
  • Installing Python packages for Visual Studio Code
  • How to merge two columns into one in pandas dataframe
  • Decompose string of different symbols in python
  • pandas merge and group concat
  • How to traverse tree for making binary code from a HuffmanTree?
  • Check if IP is in network on Python3
  • non equally spaced points along x-axis in a plot
  • Concatenation of text files consisting list of lists?
  • Use regex to parse characters on a line of text
  • pandas df.fillna - filling NaNs after outer join with correct values
  • How can I undo a time series conversion of a pandas dataframe?
  • Virtual environment is not working in Django
  • FileNotFoundError in Python during Arabic text analysis
  • How to read email using python and smtplib
  • How to write a function which takes a string and turns into a single digit?
  • Linear Regression without Least Squares in sklearn
  • shadow
    Privacy Policy - Terms - Contact Us © bighow.org