C ++クロスプラットフォームでURLを解析する簡単な方法は？

Question

C++で記述しているアプリケーションでプロトコル、ホスト、パス、およびクエリを取得するには、URLを解析する必要があります。アプリケーションは、クロスプラットフォームであることを意図しています。 boost または [〜＃〜] poco [〜＃〜] ライブラリでこれを行うものが見つからないことに驚いています。私が見ていないのはどこか明らかですか？適切なオープンソースライブラリに関する提案はありますか？または、これは私が自分でやらなければならないことですか？それほど複雑ではありませんが、共通の解決策がないことに驚いています。

Dean Michael · Accepted Answer

Boostを含めるために提案されたライブラリがあり、HTTP URIを簡単に解析できます。 Boost.Spiritを使用し、Boost Software Licenseの下でリリースされています。ライブラリは http://cpp-netlib.github.com/ でドキュメントを見つけることができるcpp-netlibです- http：//から最新リリースをダウンロードできますgithub.com/cpp-netlib/cpp-netlib/downloads .

使用する関連タイプはboost::network::http::uriそして文書化されていますここ。

wilhelmtell · Answer

すみません、仕方がありませんでした。：s

url.hh

#ifndef URL_HH_ #define URL_HH_ #include <string> struct url { url(const std::string& url_s); // omitted copy, ==, accessors, ... private: void parse(const std::string& url_s); private: std::string protocol_, Host_, path_, query_; }; #endif /* URL_HH_ */

url.cc

#include "url.hh" #include <string> #include <algorithm> #include <cctype> #include <functional> using namespace std; // ctors, copy, equality, ... void url::parse(const string& url_s) { const string prot_end("://"); string::const_iterator prot_i = search(url_s.begin(), url_s.end(), prot_end.begin(), prot_end.end()); protocol_.reserve(distance(url_s.begin(), prot_i)); transform(url_s.begin(), prot_i, back_inserter(protocol_), ptr_fun<int,int>(tolower)); // protocol is icase if( prot_i == url_s.end() ) return; advance(prot_i, prot_end.length()); string::const_iterator path_i = find(prot_i, url_s.end(), '/'); Host_.reserve(distance(prot_i, path_i)); transform(prot_i, path_i, back_inserter(Host_), ptr_fun<int,int>(tolower)); // Host is icase string::const_iterator query_i = find(path_i, url_s.end(), '?'); path_.assign(path_i, query_i); if( query_i != url_s.end() ) ++query_i; query_.assign(query_i, url_s.end()); }

main.cc

// ... url u("HTTP://stackoverflow.com/questions/2616011/parse-a.py?url=1"); cout << u.protocol() << '	' << u.Host() << ...

Tom · Answer

上記のWstringバージョン、必要な他のフィールドを追加しました。間違いなく洗練されているかもしれませんが、私の目的には十分です。

#include <string> #include <algorithm> // find struct Uri { public: std::wstring QueryString, Path, Protocol, Host, Port; static Uri Parse(const std::wstring &uri) { Uri result; typedef std::wstring::const_iterator iterator_t; if (uri.length() == 0) return result; iterator_t uriEnd = uri.end(); // get query start iterator_t queryStart = std::find(uri.begin(), uriEnd, L'?'); // protocol iterator_t protocolStart = uri.begin(); iterator_t protocolEnd = std::find(protocolStart, uriEnd, L':'); //"://"); if (protocolEnd != uriEnd) { std::wstring prot = &*(protocolEnd); if ((prot.length() > 3) && (prot.substr(0, 3) == L"://")) { result.Protocol = std::wstring(protocolStart, protocolEnd); protocolEnd += 3; // :// } else protocolEnd = uri.begin(); // no protocol } else protocolEnd = uri.begin(); // no protocol // Host iterator_t hostStart = protocolEnd; iterator_t pathStart = std::find(hostStart, uriEnd, L'/'); // get pathStart iterator_t hostEnd = std::find(protocolEnd, (pathStart != uriEnd) ? pathStart : queryStart, L':'); // check for port result.Host = std::wstring(hostStart, hostEnd); // port if ((hostEnd != uriEnd) && ((&*(hostEnd))[0] == L':')) // we have a port { hostEnd++; iterator_t portEnd = (pathStart != uriEnd) ? pathStart : queryStart; result.Port = std::wstring(hostEnd, portEnd); } // path if (pathStart != uriEnd) result.Path = std::wstring(pathStart, queryStart); // query if (queryStart != uriEnd) result.QueryString = std::wstring(queryStart, uri.end()); return result; } // Parse }; // uri

テスト/使用法

Uri u0 = Uri::Parse(L"http://localhost:80/foo.html?&q=1:2:3"); Uri u1 = Uri::Parse(L"https://localhost:80/foo.html?&q=1"); Uri u2 = Uri::Parse(L"localhost/foo"); Uri u3 = Uri::Parse(L"https://localhost/foo"); Uri u4 = Uri::Parse(L"localhost:8080"); Uri u5 = Uri::Parse(L"localhost?&foo=1"); Uri u6 = Uri::Parse(L"localhost?&foo=1:2:3"); u0.QueryString, u0.Path, u0.Protocol, u0.Host, u0.Port....

Elliot Cameron · Answer

完全を期すために、Cで書かれたものを使用できます（少しラップして、間違いなく）： http://uriparser.sourceforge.net/

[RFC準拠でUnicodeをサポート]

これは、解析の結果を簡単に取得するために使用している非常に基本的なラッパーです。

#include <string> #include <uriparser/Uri.h> namespace uriparser { class Uri //: boost::noncopyable { public: Uri(std::string uri) : uri_(uri) { UriParserStateA state_; state_.uri = &uriParse_; isValid_ = uriParseUriA(&state_, uri_.c_str()) == URI_SUCCESS; } ~Uri() { uriFreeUriMembersA(&uriParse_); } bool isValid() const { return isValid_; } std::string scheme() const { return fromRange(uriParse_.scheme); } std::string Host() const { return fromRange(uriParse_.hostText); } std::string port() const { return fromRange(uriParse_.portText); } std::string path() const { return fromList(uriParse_.pathHead, "/"); } std::string query() const { return fromRange(uriParse_.query); } std::string fragment() const { return fromRange(uriParse_.fragment); } private: std::string uri_; UriUriA uriParse_; bool isValid_; std::string fromRange(const UriTextRangeA & rng) const { return std::string(rng.first, rng.afterLast); } std::string fromList(UriPathSegmentA * xs, const std::string & delim) const { UriPathSegmentStructA * head(xs); std::string accum; while (head) { accum += delim + fromRange(head->text); head = head->next; } return accum; } }; }

Michael Mc Donnell · Answer

POCOのURIクラスはURLを解析できます。次の例は、 POCO URIおよびUUIDスライドの短縮版です。

#include "Poco/URI.h" #include <iostream> int main(int argc, char** argv) { Poco::URI uri1("http://www.appinf.com:88/sample?example-query#frag"); std::string scheme(uri1.getScheme()); // "http" std::string auth(uri1.getAuthority()); // "www.appinf.com:88" std::string Host(uri1.getHost()); // "www.appinf.com" unsigned short port = uri1.getPort(); // 88 std::string path(uri1.getPath()); // "/sample" std::string query(uri1.getQuery()); // "example-query" std::string frag(uri1.getFragment()); // "frag" std::string pathEtc(uri1.getPathEtc()); // "/sample?example-query#frag" return 0; }

Tom Makin · Answer

Pocoライブラリには、URIを分析し、ホスト、パスセグメント、クエリ文字列などをフィードバックするためのクラスが追加されました。

https://pocoproject.org/pro/docs/Poco.URI.html

velcrow · Answer

//Sudo apt-get install libboost-all-dev; #install boost //g++ urlregex.cpp -lboost_regex; #compile #include <string> #include <iostream> #include <boost/regex.hpp> using namespace std; int main(int argc, char* argv[]) { string url="https://www.google.com:443/webhp?gws_rd=ssl#q=cpp"; boost::regex ex("(http|https)://([^/ :]+):?([^/ ]*)(/?[^ #?]*)\x3f?([^ #]*)#?([^ ]*)"); boost::cmatch what; if(regex_match(url.c_str(), what, ex)) { cout << "protocol: " << string(what[1].first, what[1].second) << endl; cout << "domain: " << string(what[2].first, what[2].second) << endl; cout << "port: " << string(what[3].first, what[3].second) << endl; cout << "path: " << string(what[4].first, what[4].second) << endl; cout << "query: " << string(what[5].first, what[5].second) << endl; cout << "fragment: " << string(what[6].first, what[6].second) << endl; } return 0; }

Sun · Answer

Facebookの Folly ライブラリは、あなたのために簡単に仕事をすることができます。 ri クラスを使用するだけです：

#include <folly/Uri.h> int main() { folly::Uri folly("https://code.facebook.com/posts/177011135812493/"); folly.scheme(); // https folly.Host(); // code.facebook.com folly.path(); // posts/177011135812493/ }

Sergey K. · Answer

このライブラリは非常に小さくて軽量です： https://github.com/corporateshark/LUrlParser

ただし、解析のみであり、URLの正規化/検証は行われません。

Ralf · Answer

また、興味深いのは http://code.google.com/p/uri-grammar/ です。これは、Dean Michaelのnetlibのように、ブーストスピリットを使用してURIを解析します。 Boost :: Spirit？を使用した単純な式パーサーの例

sdgfsdh · Answer

使用できる小さな依存関係は riparser で、最近GitHubに移動しました。

コードで最小限の例を見つけることができます： https://github.com/uriparser/uriparser/blob/63384be4fb8197264c55ff53a135110ecd5bd8c4/tool/uriparse.c

これは、BoostやPocoよりも軽量になります。唯一の問題は、それがCであることです。

Buckaroo パッケージもあります：

buckaroo add github.com/buckaroo-pm/uriparser

Vivit · Answer

C++ REST SDK （Microsoftが作成し、Apache License 2.0で配布）という名前のオープンソースライブラリを試すことができます。Windowsを含む複数のプラットフォーム用にビルドできます。、Linux、OSX、iOS、Android）。 web::uriと呼ばれるクラスがあり、そこで文字列を入れて個々のURLコンポーネントを取得できます。コードサンプルは次のとおりです（Windowsでテスト済み）。

#include <cpprest/base_uri.h> #include <iostream> #include <ostream> web::uri sample_uri( L"http://dummyuser@localhost:7777/dummypath?dummyquery#dummyfragment" ); std::wcout << L"scheme: " << sample_uri.scheme() << std::endl; std::wcout << L"user: " << sample_uri.user_info() << std::endl; std::wcout << L"Host: " << sample_uri.Host() << std::endl; std::wcout << L"port: " << sample_uri.port() << std::endl; std::wcout << L"path: " << sample_uri.path() << std::endl; std::wcout << L"query: " << sample_uri.query() << std::endl; std::wcout << L"fragment: " << sample_uri.fragment() << std::endl;

出力は次のようになります。

scheme: http user: dummyuser Host: localhost port: 7777 path: /dummypath query: dummyquery fragment: dummyfragment

他にも使いやすい方法があります。クエリから個々の属性/値のペアにアクセスし、パスをコンポーネントに分割するなど。

Mike Ellery · Answer

新しくリリースされたgoogle-url libがあります：

http://code.google.com/p/google-url/

このライブラリは、GURLと呼ばれる高レベルの抽象化だけでなく、低レベルのURL解析APIを提供します。これを使用した例を次に示します。

#include <googleurl\src\gurl.h> wchar_t url[] = L"http://www.facebook.com"; GURL parsedUrl (url); assert(parsedUrl.DomainIs("facebook.com"));

2つの小さな不満：（1）ICU=を使用して、異なる文字列エンコーディングを処理することを望んでいます。言い換えれば、ライブラリは存在するように完全にスタンドアロンではありませんが、特にICUを既に使用している場合は、開始するのにまだ良い基盤だと思います。

Mr. Jones · Answer

Std :: regexに基づく別の自己完結型ソリューションを提供できますか：

const char* SCHEME_REGEX = "((http[s]?)://)?"; // match http or https before the :// const char* USER_REGEX = "(([^@/:\s]+)@)?"; // match anything other than @ / : or whitespace before the ending @ const char* Host_REGEX = "([^@/:\s]+)"; // mandatory. match anything other than @ / : or whitespace const char* PORT_REGEX = "(:([0-9]{1,5}))?"; // after the : match 1 to 5 digits const char* PATH_REGEX = "(/[^:#?\s]*)?"; // after the / match anything other than : # ? or whitespace const char* QUERY_REGEX = "(\?(([^?;&#=]+=[^?;&#=]+)([;|&]([^?;&#=]+=[^?;&#=]+))*))?"; // after the ? match any number of x=y pairs, seperated by & or ; const char* FRAGMENT_REGEX = "(#([^#\s]*))?"; // after the # match anything other than # or whitespace bool parseUri(const std::string &i_uri) { static const std::regex regExpr(std::string("^") + SCHEME_REGEX + USER_REGEX + Host_REGEX + PORT_REGEX + PATH_REGEX + QUERY_REGEX + FRAGMENT_REGEX + "$"); std::smatch matchResults; if (std::regex_match(i_uri.cbegin(), i_uri.cend(), matchResults, regExpr)) { m_scheme.assign(matchResults[2].first, matchResults[2].second); m_user.assign(matchResults[4].first, matchResults[4].second); m_Host.assign(matchResults[5].first, matchResults[5].second); m_port.assign(matchResults[7].first, matchResults[7].second); m_path.assign(matchResults[8].first, matchResults[8].second); m_query.assign(matchResults[10].first, matchResults[10].second); m_fragment.assign(matchResults[15].first, matchResults[15].second); return true; } return false; }

正規表現の各部分に説明を追加しました。この方法により、取得する予定のURLに対して解析する関連部分を正確に選択できます。それに応じて、目的の正規表現グループインデックスを変更することを忘れないでください。

Matthew Flaschen · Answer

QTには QUrl があります。 GNOMEには SoupURI in libsoup があり、おそらくもう少し軽量になります。

Fabiano Tarlao · Answer

@ Mr.Jonesや@velcrowソリューションのような1つの正規表現で動作する1つのC++クラスである「オブジェクト指向」ソリューションを開発しました。 Urlクラスは、url/uriの「解析」を実行します。

私はvelcrowregexをより堅牢に改良し、ユーザー名部分も含めて改善したと思います。

私のアイデアの最初のバージョンに従って、同じコードをリリースし、改善しましたGPL3ライセンスオープンソースプロジェクト Cpp URL Parser 。

#ifdef/ndef膨張部分を省略、Url.hに続きます

#include <string> #include <iostream> #include <boost/regex.hpp> using namespace std; class Url { public: boost::regex ex; string rawUrl; string username; string protocol; string domain; string port; string path; string query; string fragment; Url(); Url(string &rawUrl); Url &update(string &rawUrl); };

これは、Url.cpp実装ファイルのコードです。

#include "Url.h" Url::Url() { this -> ex = boost::regex("(ssh|sftp|ftp|smb|http|https):\/\/(?:([^@ ]*)@)?([^:?# ]+)(?::(\d+))?([^?# ]*)(?:\?([^# ]*))?(?:#([^ ]*))?"); } Url::Url(string &rawUrl) : Url() { this->rawUrl = rawUrl; this->update(this->rawUrl); } Url &Url::update(string &rawUrl) { this->rawUrl = rawUrl; boost::cmatch what; if (regex_match(rawUrl.c_str(), what, ex)) { this -> protocol = string(what[1].first, what[1].second); this -> username = string(what[2].first, what[2].second); this -> domain = string(what[3].first, what[3].second); this -> port = string(what[4].first, what[4].second); this -> path = string(what[5].first, what[5].second); this -> query = string(what[6].first, what[6].second); this -> fragment = string(what[7].first, what[7].second); } return *this; }

使用例：

string urlString = "http://gino@ciao.it:67/ciao?roba=ciao#34"; Url *url = new Url(urlString); std::cout << " username: " << url->username << " URL domain: " << url->domain; std::cout << " port: " << url->port << " protocol: " << url->protocol;

Urlオブジェクトを更新して、別のURLを表す（および解析する）こともできます。

url.update("http://gino@nuovociao.it:68/nuovociao?roba=ciaoooo#")

私は今C++を学んでいるので、100％C++のベストプラクティスに従ったかどうかわかりません。どんなヒントでも大歓迎です。

追伸：Cpp URL Parserを見てみましょう。そこには改良点があります。

楽しむ

Larytet · Answer

さらに別のライブラリ https://snapwebsites.org/project/libtld があります。このライブラリは、考えられるすべてのトップレベルドメインとURIシーマを処理します