There are many links out there, all mixed throughout the content. And this was a really short article, too. So let’s save them all externally and see how we can make a distinction between them.
First, we create and open a CSV file. We do this outside the function, right near the imports. Our function is recursive, meaning that we will have A LOT of files if it creates a new file every time it’s called.
std::ofstream writeCsv("links.csv");
Then, in our main function, we write the first row of the CSV file right before calling the function for the first time. Do not forget to close the file after the execution is done.
writeCsv << "type,link" << "\n";
search_for_links(parsed_response->root);
writeCsv.close();
Now, we write its content. In our search_for_links function, when we find a <;a> tag, instead of displaying in the console now we do this:
if (node->v.element.tag == GUMBO_TAG_A)
{
GumboAttribute* href = gumbo_get_attribute(&node->v.element.attributes, "href");
if (href)
{
std::string link = href->value;
if (link.rfind("/wiki") == 0)
writeCsv << "article," << link << "\n";
else if (link.rfind("#cite") == 0)
writeCsv << "cite," << link << "\n";
else
writeCsv << "other," << link << "\n";
}
}
We take the href attribute value with this code and put it in 3 categories: articles, citations, and the rest.
Of course, you can go much further and define your own link types, like those that look like an article but are actually a file, for example.