Web Scraping com Python

Compartilhe!

Web Scraping com Python

Neste blog, vamos ler sobre web scraping e sua implementação usando a linguagem de programação Python.

Os usuários devem consultar nossos blogs anteriores sobre PNL para entender melhor os conceitos.

O web scraping é usado sempre que queremos extrair ou copiar grandes quantidades de informações de um site o mais rápido possível, sem ir manualmente a cada site para obter os dados.

A coleta da Web torna essa tarefa mais fácil e rápida.

Aplicações de Web Scraping

Web scraping pode ser usado por vários motivos, mas qual é a necessidade de coletar dados tão grandes de sites, vamos dar uma olhada:

  • Algumas empresas usam endereços de e-mail de usuários como meio de marketing. Portanto, eles usam web scraping para coletar ids de e-mail para que possam enviar e-mails em massa.
  • Às vezes, a web scraping é feita em sites de mídia social, como o Twitter, para coletar dados e descobrir o que é tendência. 
  • É feito para reunir os dados de diferentes sites de fóruns de revisão e implementar a análise de sentimento nos mesmos.
  • O web scraping também é feito para reunir dados para testar e treinar nossos modelos de aprendizado de máquina. Etc.

No entanto, existem alguns sites que impedem o web scraping. Para saber se um site permite web scraping ou não, tudo o que precisamos fazer é examinar o arquivo ‘robots.txt’ do site.

Precisamos apenas anexar → /robots.txt à URL que queremos extrair .

Como funciona o Web Scraping

Para isso, escolhemos uma URL na qual queremos realizar o scraping. Depois de executar o código, uma solicitação é enviada ao URL. O servidor envia os dados como uma solicitação e nos permite ler a página HTML / XML. 

O código então analisa a página, encontra os dados e os extrai.

Neste blog, encontraremos a frequência de palavras em uma página da web usando urllib e BeautifulSoup para extrair texto da página da web. Em seguida, removeremos as palavras irrelevantes dele e, em seguida, traçaremos o gráfico das mesmas.

Beautiful Soup é uma biblioteca Python para extrair dados de arquivos HTML e XML.

Vamos ver como fazer web scraping usando Python:

Começaremos importando todas as bibliotecas necessárias

from bs4 import BeautifulSoup
import urllib.request
import nltk
response = urllib.request.urlopen(‘http://php.net/’)
html = response.read()
soup = BeautifulSoup(html,”html5lib”)

O módulo urllib.request é usado para abrir URLs. O pacote Beautiful Soup é usado para extrair dados de arquivos HTML.

text = soup.get_text(strip=True)
tokens = [t for t in text.split()]
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
print (str(key) + ‘:’ + str(val))

‘soup.get_text’ é usado para obter o texto da página da web e ‘ nltk.FreqDist’ é usado para obter a frequência de cada item do vocabulário no texto.

Traçando o gráfico de frequênciafrequencia. plot ( 20 , cumulativo = False )

No gráfico acima, podemos ver que todas as palavras com suas frequências foram plotadas.

Agora, depois de buscar todas as palavras do site, nosso objetivo é remover as palavras irrelevantes delas. Podemos fazer isso usando a biblioteca ‘stopwords’ da nltk.

nltk.download(‘stopwords’)
from nltk.corpus import stopwords

Removeremos as palavras irrelevantes e imprimiremos todas as palavras com sua respectiva frequência.

text = soup.get_text(strip=True)
tokens = [t for t in text.split()]
clean_tokens = tokens[:]
sr = stopwords.words(‘english’)
for token in tokens:
if token in stopwords.words(‘english’):
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
print (str(key) + ‘:’ + str(val))

Traçar o gráfico de frequência após remover as palavras irrelevantesfrequencia. plot ( 20 , cumulativo = False )

É assim que fazemos web scraping usando BeautifulSoup. Espero que tenha gostado deste blog. Para qualquer dúvida ou sugestão, deixe-nos um comentário.

Créditos: acadgild

Continue visitando nosso site para mais blogs sobre Ciência de Dados e Análise de Dados.

Compartilhe!

152 comentários em “Web Scraping com Python”

  1. Aw, this was an extremely nice post. Spending some time and
    actual effort to create a top notch article… but what can I say…
    I put things off a whole lot and never manage to get nearly anything done.

  2. Significantly less than a quarter (23%) oof self-employed
    year olds arre members oof a pension sceme – which is defined as a scheme from a preceding employer
    or a personal pension.

    Here is my blog … 유흥알바

  3. There are now 52 Meister schools and their graduates have accompished employment
    prices of far more than 90% for five consecutive
    years.

    Feel free to surf to my blog; site

  4. According to Section 21 of the Kansas code, putting any wager whatsoever
    is illegal unless anything is settled in the state-controlled environment.

    Look into my web-site: webpage

  5. Greate article. Keep writing such kind of info on your blog.

    Im really impressed by your site.
    Hello there, You have performed a great job. I’ll defintely digg it
    and in my view recommend to my friends. I am confident they’ll be benefited from this
    site.

    Stoop byy my web-site … 안전토토사이트

  6. Hello excellent blog! Does running a blog similar to this require a lot of work?
    I have very little expertise in coding but I was hoping to start my own blog
    in the near future. Anyhow, should you have any suggestions or techniques for new blog owners please share.

    I understand this is off topic nevertheless I simply needed
    to ask. Kudos!

  7. Let me give you a thumbs up man. Can I speak out
    on amazing values and if you want to with no joke truthfully see and also share valuable info about
    how to get connected to girls for free yalla lready know follow me my fellow commenters!.

  8. First of all I would like to say great blog! I had a
    quick question which I’d like to ask if you do not
    mind. I was interested to know how you center yourself and clear
    your thoughts prior to writing. I’ve had
    a tough time clearing my mind in getting my thoughts out. I truly do enjoy writing however it just seems like the first 10 to 15 minutes tend to be lost simply just trying to figure out how to begin. Any recommendations or hints?
    Thank you!

  9. I believe people who wrote this needs true loving because it’s a blessing.
    So let me give back and with heart reach out change your
    life and if you want to really findout? I will share info about
    how to make passive income Don’t forget..

    I am always here for yall. Bless yall!

  10. When I originally left a comment I appear to have clicked the
    -Notify me when new comments are added- checkbox and now every time a comment
    is added I recieve 4 emails with the exact same comment.
    Perhaps there is a way you can remove me from that service?
    Cheers!

  11. I simply couldn’t leave your site before suggesting that I actually enjoyed
    the standard information an individual provide on your visitors?
    Is going to be again frequently to check out new posts

  12. OMG! This is amazing. Ireally appreciate it~ May I shout out
    on a secret only I KNOW and if you want to with no joke truthfully see You really have to believe mme and have
    faith and I will show how to find hot girls for free Once again I want to show my appreciation and
    may all the blessing goes to you now!.

  13. When I originally commented I clicked the “Notify me when new comments are added”
    checkbox and now each time a comment is added I get
    several e-mails with the same comment. Is there any way you can remove me from that service?
    Appreciate it!

  14. I go to see each day a few web pages and information sites to read posts, except this blog offers feature based writing.

  15. Hey! This post could not be written any better!
    Reading through this post reminds me of my old room mate!
    He always kept chatting about this. I will forward this article to him.

    Pretty sure he will have a good read. Thanks for sharing!

  16. Let me give you a thumbs up man. Can I give my value
    on amazing values and if you want to with no joke truthfully see and also share valuable info about how to make a fortune yalla lready know follow
    me my fellow commenters!.

  17. Hmm is anyone else having problems with the pictures on this blog loading?
    I’m trying to find out if its a problem on my end or if
    it’s the blog. Any responses would be greatly appreciated.

  18. When I originally commented I seem to have clicked the
    -Notify me when new comments are added- checkbox and from now on each time a comment is added I recieve four emails with the exact same comment.
    Perhaps there is an easy method you can remove me from that service?
    Many thanks!

  19. The IMF’s additional conservative estimate suggests that
    if female labor force participation rates reached that of males by 2035 it would add 7 percent to South Korean GDP.

    Also visit myy web site: 유흥알바

  20. Wow that was strange. I just wrote an extremely
    long comment but after I clicked submit my comment didn’t
    appear. Grrrr… well I’m not writing all that over again. Anyway, just wanted to say excellent blog!

  21. We are a gaggle of volunteers and starting
    a brand new scheme in our community. Your web site offered us with helpful information to work on. You’ve performed a formidable job and
    our whole neighborhood will likely be thankful to
    you.

  22. I do not even understand how I finished up right here, however I assumed this publish
    was good. I don’t understand who you are however certainly you’re going to a well-known blogger when you are not already.
    Cheers!

  23. I’m impressed, I must say. Seldom do I encounter a blog that’s equally educative and amusing, and without a doubt, you have hit the
    nail on the head. The problem is something too few folks are speaking intelligently about.
    Now i’m very happy that I stumbled across this during my hunt for something regarding this.

  24. Have you ever considered publishing an e-book or guest authoring on other websites?
    I have a blog based on the same topics you discuss and would love
    to have you share some stories/information. I know my audience would value your work.
    If you’re even remotely interested, feel free to send me an e-mail.

  25. My ideas ship sailed away when I discovered the LEGO Nintendo Entertainment System.
    My new obsession is gaming! superior LEGO Brick Sets It brings back fond memories for me of playing with Legos as a kid.

    Lego has become a huge part of my adulthood.

    It is fascinating that Lego can be both nostalgic as well as
    innovative.

  26. The Architecture Studio is the perfect set for young
    architects who want to develop their creativity and design abilities.

    Best-Selling Lego brick Projects Lego’s diversity is astounding.
    The Lego sets are available in a wide variety of styles.
    These Lego sets demonstrate the power and potential of imagination. These Lego sets remind us that we can build something amazing no matter what age we are.

  27. First of all I would like to say great blog! I had a quick question which I’d like to ask if you do
    not mind. I was interested to know how you center yourself
    and clear your head prior to writing. I’ve had a hard time clearing my
    mind in getting my ideas out. I do enjoy writing but
    it just seems like the first 10 to 15 minutes are generally lost simply just trying to figure out how
    to begin. Any ideas or hints? Cheers!

  28. Having read this I believed it was rather enlightening.
    I appreciate you taking the time and effort to put this informative article together.
    I once again find myself personally spending a lot
    of time both reading and commenting. But so what, it
    was still worth it!

  29. Please let me know if you’re looking for a article writer for your site.
    You have some really great posts and I think I would be a good asset.

    If you ever want to take some of the load off, I’d really like to write some
    articles for your blog in exchange for a link back to mine.
    Please shoot me an e-mail if interested. Kudos!

  30. I’ll immediately clutch your rss as I can’t to find your e-mail subscription link
    or e-newsletter service. Do you have any? Kindly permit me know
    so that I could subscribe. Thanks.

  31. Your mode of describing all in this article is really good, all can without difficulty understand it, Thanks a lot.

  32. I know this if off topic but I’m looking into starting my own blog and was curious
    what all is required to get setup? I’m assuming having a blog like yours would cost a pretty penny?
    I’m not very internet smart so I’m not 100% positive.

    Any recommendations or advice would be greatly appreciated.
    Thank you

  33. Hi! I realize this is sort of off-topic but I needed to ask.
    Does operating a well-established blog such as yours take a massive amount work?
    I’m brand new to blogging but I do write in my diary everyday.

    I’d like to start a blog so I will be able to share my own experience and thoughts online.
    Please let me know if you have any ideas or tips for new aspiring blog
    owners. Thankyou!

  34. It’s wonderful that you are getting thoughts from this paragraph
    as well as from our discussion made at this time.

  35. Have you ever considered writing an ebook or guest authoring on other sites?

    I have a blog based on the same ideas you discuss and would love to have you share some stories/information. I know my subscribers would appreciate your work.
    If you’re even remotely interested, feel free to shoot me an e mail.

  36. Having read this I believed it was very informative.
    I appreciate you spending some time and energy to put this information together.

    I once again find myself personally spending a significant amount of time both reading and leaving comments.
    But so what, it was still worth it!

  37. If some one needs expert view about blogging then i propose
    him/her to pay a visit this web site, Keep
    up the good work.

  38. You really make it appear so easy along with your presentation but I to find this matter to be
    really something which I believe I might never understand.
    It sort of feels too complicated and extremely broad for me.
    I am having a look forward in your subsequent submit,
    I’ll try to get the dangle of it!

  39. Excellent post. I used to be checking continuously this blog
    and I am inspired! Extremely useful information specially the last
    phase 🙂 I take care of such info much. I was looking for
    this particular info for a long time. Thank you and best of luck.

  40. With havin so much content do you ever run into any issues of plagorism or copyright infringement?
    My site has a lot of unique content I’ve either created myself or
    outsourced but it seems a lot of it is popping it
    up all over the web without my agreement. Do you know
    any techniques to help prevent content from being stolen? I’d
    truly appreciate it.

  41. I got this web site from my buddy who told me about this web site
    and now this time I am visiting this web site and reading very informative content at this time.

  42. Hi, i read your blog from time to time and i own a similar one and i was just wondering if you
    get a lot of spam comments? If so how do you protect against it, any plugin or anything you can advise?
    I get so much lately it’s driving me mad so any help is very much
    appreciated.

  43. Useful information. Lucky me I found your web
    site unintentionally, and I’m shocked why this coincidence didn’t took place in advance!
    I bookmarked it.

  44. Good post. I learn something totally new and challenging on websites I
    stumbleupon on a daily basis. It’s always useful to read through
    content from other writers and practice a little something from their sites.

  45. Hello! I could have sworn I’ve visited this blog before but after going through many of the articles I realized
    it’s new to me. Anyways, I’m certainly happy I discovered it and I’ll be book-marking
    it and checking back regularly!

  46. Hi, i feel that i noticed you visited my web site so i came to go back the
    prefer?.I’m attempting to in finding issues to improve my website!I guess its adequate to make use of a few
    of your ideas!!

  47. I’m really loving the theme/design of your weblog.
    Do you ever run into any internet browser compatibility problems?

    A couple of my blog audience have complained about my blog not working correctly in Explorer but looks great in Opera.
    Do you have any ideas to help fix this problem?

  48. My family members always say that I am killing
    my time here at net, however I know I am getting familiarity all the
    time by reading such fastidious posts.

  49. I was recommended this website by my cousin. I’m not sure whether this post is written by him as nobody else
    know such detailed about my problem. You are wonderful!
    Thanks!

  50. Your method of explaining all in this piece of writing is genuinely good, every one be able to without difficulty
    know it, Thanks a lot.

  51. My family all the time say that I am wasting my time here at web, however I know I am getting knowledge all the time by reading such fastidious articles or reviews.

  52. Its such as you learn my mind! You appear to know so much approximately this, such as you wrote the book in it or something.
    I feel that you could do with some % to force the message home a
    little bit, but instead of that, this is great blog. A great read.
    I’ll certainly be back.

  53. Amazing issues here. I’m very happy to peer your post.

    Thank you so much and I am looking forward to contact you.
    Will you please drop me a e-mail?

  54. Write more, thats all I have to say. Literally, it seems as though you relied on the video to
    make your point. You definitely know what youre talking about,
    why waste your intelligence on just posting videos to
    your weblog when you could be giving us something enlightening
    to read?

  55. Wow, that’s what I was searching for, what a information!
    existing here at this weblog, thanks admin of this web site.

  56. I’m amazed, I have to admit. Rarely do I come across a blog that’s both equally educative and engaging, and let me tell you,
    you’ve hit the nail on the head. The issue is something
    too few people are speaking intelligently about.
    I am very happy I stumbled across this in my hunt for something
    relating to this.

  57. Great post however I was wanting to know if you could
    write a litte more on this subject? I’d be very thankful if you could elaborate a little bit further.

    Bless you!

  58. I was suggested this blog by my cousin. I am not sure whether this post is written by him as no
    one else know such detailed about my problem. You are amazing!
    Thanks!

  59. Hey I am so delighted I found your blog, I really found you by error, while
    I was browsing on Digg for something else,
    Anyhow I am here now and would just like to say cheers for a
    tremendous post and a all round enjoyable blog (I also love the theme/design),
    I don’t have time to read it all at the minute but I
    have bookmarked it and also included your RSS feeds,
    so when I have time I will be back to read much more, Please
    do keep up the fantastic b.

  60. Howdy! This is kind of off topic but I need some advice from an established blog.
    Is it difficult to set up your own blog? I’m not very
    techincal but I can figure things out pretty fast.
    I’m thinking about setting up my own but I’m not sure where to begin. Do you have any points or suggestions?
    Appreciate it

  61. Thanks for the marvelous posting! I seriously enjoyed reading it, you may be a great author.
    I will make sure to bookmark your blog and will come back someday.
    I want to encourage you continue your great work, have a nice
    day!

  62. I was wondering if you ever thought of changing the
    structure of your blog? Its very well written; I love what youve got to say.
    But maybe you could a little more in the way of content so people
    could connect with it better. Youve got an awful lot of text for only having 1 or two pictures.

    Maybe you could space it out better?

  63. Everything is very open with a clear clarification of the issues.
    It was really informative. Your site is extremely helpful.
    Thank you for sharing!

  64. you’re in point of fact a excellent webmaster.
    The web site loading pace is amazing. It kind of feels that you’re
    doing any unique trick. Furthermore, The contents are masterwork.
    you’ve done a wonderful activity on this matter!

  65. whoah this blog is magnificent i like reading your articles.
    Stay up the good work! You understand, many individuals
    are searching round for this information, you could aid them
    greatly.

  66. Do you mind if I quote a few of your articles as
    long as I provide credit and sources back to your website?
    My blog site is in the exact same niche as yours and my users would definitely benefit from
    some of the information you present here. Please let me know if this alright with you.

    Many thanks!

  67. Definitely believe that which you said. Your favorite
    justification appeared to be on the net the easiest thing to be aware of.
    I say to you, I definitely get irked while people think about worries that they just
    don’t know about. You managed to hit the nail upon the top and defined out
    the whole thing without having side-effects , people could
    take a signal. Will probably be back to get more.
    Thanks

  68. Attractive element of content. I just stumbled upon your website and in accession capital to say that I acquire in fact enjoyed account your weblog
    posts. Anyway I will be subscribing to your feeds or even I success you get admission to constantly rapidly.

  69. Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?

  70. Can you be more specific about the content of your article? After reading it, I still have some doubts. Hope you can help me.

  71. Can you be more specific about the content of your article? After reading it, I still have some doubts. Hope you can help me.

Deixe um comentário