curl - Getting page information from page source by name in C++

Question

Welcome To Ask or Share your Answers For Others

curl - Getting page information from page source by name in C++

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am trying to make a C++ program that searches a URL and picks out text by its name. I know this is achievable in python with requests and beautiful-soup, however I am trying to do this in C++. I have had a look at cURL and so far this is what I have in terms of its functionality

#include <iostream>
#include <string>
#include <curl/curl.h>


static size_t WriteCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
    ((std::string*)userp)->append((char*)contents, size * nmemb);
    return size * nmemb;
}

int main(void)
{
  CURL *curl;
  CURLcode res;
  std::string readBuffer;

  curl = curl_easy_init();
  if(curl) {
    curl_easy_setopt(curl, CURLOPT_URL, "https://www.barcodelookup.com/763649064870");
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
    res = curl_easy_perform(curl);
    curl_easy_cleanup(curl);


    std::cout << readBuffer << std::endl;
  }
  return 0;
}

The output of the page from the program does not include the text when viewed in a browser as displayed next to the name. For example, I am interested in picking out (from the source of https://www.barcodelookup.com/763649064870 this line here (5): <meta name="description" content="Barcode Lookup provides info on UPC 763649064870 - Seagate 1Tb Expansion Portable Drive USB."> in python I would just find <meta name="description" .... however in my C++ program I don't get that. The output is actually:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>

<title>Attention Required! | Cloudflare</title>

<meta name="captcha-bypass" id="captcha-bypass" />
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>

and that is just for the first portion of the output, but that in theory should have covered line 5 (which im interested in). How can I get the proper output as displayed by view page source?

question from:https://stackoverflow.com/questions/65915787/getting-page-information-from-page-source-by-name-in-c

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

739 views

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:09:35+0000

answered Oct 7, 2021 by 深蓝 (71.8m points)

Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

curl - Getting page information from page source by name in C++

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags