Web Scrapping
steps and Tools
Explain Crawling
Crawling is essentially following links, both internal and external.
Explain Scraping
Explain Parsing
Parsing means to make something understandable (by analyzing its parts). Convert information represented in one form into another form that is easier to work with
what are Technologies for
disseminationg content
information extracting
storing web data
This is an example for?
HTML5
tags
elements
attributes
tags and elements
Semantinc Tags
DOM (Document Object Model)
platform and language nautral interface
allow programs and scripts to dynamically access and update the content, structure and style of documents
documents can be furhter processed and the results if that processing can be incorporated back into the presented page
HTML DOM
is an Object Model for HTML -> it defines:
HTML elements as objects
Properties for all HTML elements
Methods for all HTML elements
Events for all HTML elements
DOM
tree of nodes
document is represented as a tree nodes (DOM tree)
A node have child nodes with a direct parent
Sibling node -> same level
Process HTML -> DOM view
how to load web documents into R
How to Parse the HTML file
How does a htmlParse() work?
XML
Meta language for the definition of markup languages
IDEA: Metadata about structure, format, partly semantics become part of the message itself. Thus adaptable to sender and receiver
HTLM: display data on web page
XML: describe data and information
DTD: Document Type Definition
How to parse XML to R
XML how to read
XML got stricter rules than HTML
check if document conforms rules a validation step can be included after DOM has ben created by setting the validate argument = TRUE
relevant extensions of XML => RSS
Usecase of RSS
What are Web scraping alternatives to R
R is not a typical language for web scraping
alternatives
Python
Java
C++
Rule & Tips for Crawling
Problems by web scrape:
IP Blockade, Captcha triggers, request throttling
=> choose the path of least resistence
=> follow rules of the webmaster
=> be ethical
Robot exlusion protocol (REP)
created after a badly behaved crawler caused a denial of service attak on Kosters server
REP (robots.txt) instruct crawlers how to crawl and index pages on their website
webmasters usually defince rules for crawlers, -> www. xxx / robots.txt
What are advanteges on following the REP
Compliance with REP protects against
Tripwires, IP adres blocking / blacklisting, misguidance, legal aftermath
script is created that will work for long time
REP
what are the ethical “rules”
request data at reasonable rate
only request and save data you absolutely need from the page
scrape during off peak hours
respect the copyright lawy -> check website terms of service
Last changed6 months ago