Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Pre-requisits:
- PHP5+ or 7+
- Basic knowledge of PHP (And OOPS concept)
- Basic knowledge of HTML
- Simple HTML DOM Parser
- CURL
Understanding HTML:
HTML is a language used for creating web pages.
It is an acronym for Hyper Text Markup Language
It uses markup and describes the structure of the web pages.
HTML pages are basically made of HTML elements
HTML elements are represented by tags
There are various tags like heading, paragraph, table etc.
Now we will create a PHP function scrapeWebsiteData in scrape-data.php to get website data using PHP cURL
<?php
$ch = curl_init();
$timeout = 5;
$config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';
$url = "Your scraping website URL HERE";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $config['useragent']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
?>
Here we have used three cURL functions curl_init() initializes the session, curl_exec() executes, and curl_close() to close connection. The variable CURLOPT_URL is used to set the website URL that we scrapping. The second CURLOPT_RETURNTRANSFER is used to tell to store scraped page in a variable rather than its default, which is to simply display the entire page as it is.
To Create a DOM parser object and parse the HTML page
<?php
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//div[contains(@class, "DIV CLASS ID")]');
?>
After parsing the HTML then fetch the particular data and store in the array for JSON conversion format.
<?php
$results = array();
if (!is_null($title)) {
foreach ($title as $key => $element) {
$results[] = [
'title' => trim(preg_replace("/[\r\n]+/", " ", $title->item($key)->nodeValue))
];
}
}
echo json_encode($results,JSON_UNESCAPED_SLASHES);
?>
Recommended for reference













Disclaimer: http://phpcodingconcepts.blogspot.com does not endorse or recommend any product or verify their authenticity, and a product does not constitute an endorsement or recommendation or authentication. http://phpcodingconcepts.blogspot.com explicitly makes no representations or guarantees about product or the accuracy of the information provided by the third party website amazon. http://phpcodingconcepts.blogspot.com is not responsible for authenticity, price, fraudulent practices or any other aspect of product without limitation. It is the responsibility of users to perform due diligence in researching products when bought.
{/}
Code
No comments:
Post a Comment