undefined

Buffl

Data Science

von Felix S.

Web Scrapping

steps and Tools

Explain Crawling

Crawling is essentially following links, both internal and external.

Explain Scraping

Explain Parsing

Parsing means to make something understandable (by analyzing its parts). Convert information represented in one form into another form that is easier to work with

what are Technologies for

disseminationg content
information extracting
storing web data

This is an example for?

HTML5

tags
elements
attributes

tags and elements

attributes

HTML5

Semantinc Tags

DOM (Document Object Model)

platform and language nautral interface
allow programs and scripts to dynamically access and update the content, structure and style of documents
documents can be furhter processed and the results if that processing can be incorporated back into the presented page

HTML DOM

is an Object Model for HTML -> it defines:

HTML elements as objects
Properties for all HTML elements
Methods for all HTML elements
Events for all HTML elements

DOM

tree of nodes

document is represented as a tree nodes (DOM tree)
A node have child nodes with a direct parent
Sibling node -> same level

DOM

Process HTML -> DOM view

how to load web documents into R

How to Parse the HTML file

How does a htmlParse() work?

XML

Meta language for the definition of markup languages
IDEA: Metadata about structure, format, partly semantics become part of the message itself. Thus adaptable to sender and receiver
HTLM: display data on web page
XML: describe data and information
DTD: Document Type Definition

How to parse XML to R

XML how to read

XML got stricter rules than HTML
check if document conforms rules a validation step can be included after DOM has ben created by setting the validate argument = TRUE
relevant extensions of XML => RSS

Usecase of RSS

What are Web scraping alternatives to R

R is not a typical language for web scraping
alternatives
- Python
- Java
- C++

Rule & Tips for Crawling

Problems by web scrape:

IP Blockade, Captcha triggers, request throttling

=> choose the path of least resistence

=> follow rules of the webmaster

=> be ethical

Robot exlusion protocol (REP)

created after a badly behaved crawler caused a denial of service attak on Kosters server
REP (robots.txt) instruct crawlers how to crawl and index pages on their website
webmasters usually defince rules for crawlers, -> www. xxx / robots.txt

What are advanteges on following the REP

Compliance with REP protects against
- Tripwires, IP adres blocking / blacklisting, misguidance, legal aftermath
script is created that will work for long time

REP

what are the ethical “rules”

request data at reasonable rate
only request and save data you absolutely need from the page
scrape during off peak hours
respect the copyright lawy -> check website terms of service

Beitreten

Vorschau

Author

Felix S.

Informationen

Zuletzt geändert
vor einem Jahr

Kurs melden

LE9 - Web Analytics

Author

Felix S.

Informationen