C＃のURL Slugifyアルゴリズム？

Question

したがって、SO=上の slug タグを検索して参照し、2つの魅力的なソリューションのみを見つけました。

これは問題の部分的な解決策です。これを自分で手動でコーディングすることもできますが、まだ解決策がまだないことに驚いています。

だから、ラテン文字、ユニコード、その他のさまざまな言語の問題に適切に対処するC＃および/または.NETのslugifyアルゴリズム実装はありますか？

Marcel · Accepted Answer

http://predicatet.blogspot.com/2009/04/improved-c-slug-generator-or-how-to.html

public static string GenerateSlug(this string phrase) { string str = phrase.RemoveAccent().ToLower(); // invalid chars str = Regex.Replace(str, @"[^a-z0-9\s-]", ""); // convert multiple spaces into one space str = Regex.Replace(str, @"\s+", " ").Trim(); // cut and trim str = str.Substring(0, str.Length <= 45 ? str.Length : 45).Trim(); str = Regex.Replace(str, @"\s", "-"); // hyphens return str; } public static string RemoveAccent(this string txt) { byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt); return System.Text.Encoding.ASCII.GetString(bytes); }

Joan Caron · Answer

ここでは、C＃でURLスラッグを生成する方法を見つけます。この関数は、すべてのアクセント（Marcelの答え）を削除し、スペースを置き換え、無効な文字を削除し、末尾からダッシュを削除し、「-」または「_」の二重出現を置き換えます

コード：

public static string ToUrlSlug(string value){ //First to lower case value = value.ToLowerInvariant(); //Remove all accents var bytes = Encoding.GetEncoding("Cyrillic").GetBytes(value); value = Encoding.ASCII.GetString(bytes); //Replace spaces value = Regex.Replace(value, @"\s", "-", RegexOptions.Compiled); //Remove invalid chars value = Regex.Replace(value, @"[^a-z0-9\s-_]", "",RegexOptions.Compiled); //Trim dashes from end value = value.Trim('-', '_'); //Replace double occurences of - or _ value = Regex.Replace(value, @"([-_]){2,}", "$1", RegexOptions.Compiled); return value ; }

dana · Answer

ジョーンとマルセルの答えに基づいた私の演出です。私が行った変更は次のとおりです。

コードは次のとおりです。

public class UrlSlugger { // white space, em-dash, en-dash, underscore static readonly Regex WordDelimiters = new Regex(@"[\s—–_]", RegexOptions.Compiled); // characters that are not valid static readonly Regex InvalidChars = new Regex(@"[^a-z0-9\-]", RegexOptions.Compiled); // multiple hyphens static readonly Regex MultipleHyphens = new Regex(@"-{2,}", RegexOptions.Compiled); public static string ToUrlSlug(string value) { // convert to lower case value = value.ToLowerInvariant(); // remove diacritics (accents) value = RemoveDiacritics(value); // ensure all Word delimiters are hyphens value = WordDelimiters.Replace(value, "-"); // strip out invalid characters value = InvalidChars.Replace(value, ""); // replace multiple hyphens (-) with a single hyphen value = MultipleHyphens.Replace(value, "-"); // trim hyphens (-) from ends return value.Trim('-'); } /// See: http://www.siao2.com/2007/05/14/2629747.aspx private static string RemoveDiacritics(string stIn) { string stFormD = stIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); for (int ich = 0; ich < stFormD.Length; ich++) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]); if (uc != UnicodeCategory.NonSpacingMark) { sb.Append(stFormD[ich]); } } return (sb.ToString().Normalize(NormalizationForm.FormC)); } }

それでも、非ラテン文字の問題は解決しません。完全に別の解決策は、 ri.EscapeDataString を使用して、文字列を16進表現に変換することです。

string original = "测试公司"; // %E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8 string converted = Uri.EscapeDataString(original);

次に、データを使用してハイパーリンクを生成します。

<a href="http://www.example.com/100/%E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8"> 测试公司 </a>

多くのブラウザは、中国語の文字をアドレスバーに表示します（下記を参照）が、私の限られたテストに基づいて、完全にサポートされていません。

address bar with Chinese characters

注： ri.EscapeDataString をこのように機能させるには、 iriParsing を有効にする必要があります。

[〜＃〜] edit [〜＃〜]

C＃でURLスラッグを生成する場合は、次の関連する質問を確認することをお勧めします。

Stack OverflowはどのようにSEOに優しいURLを生成しますか？

それは私が私のプロジェクトに使用することになったものです。

Matthew Groves · Answer

スラッグ化（新しいWord！）で発生した問題の1つは、衝突です。たとえば、「Stack-Overflow」というブログ記事と「Stack Overflow」というブログ記事がある場合、これら2つのタイトルのスラッグは同じです。したがって、私のスラグジェネレーターは通常、何らかの方法でデータベースを使用する必要があります。これが、より一般的な解決策が見当たらない理由かもしれません。

Simon Mourier · Answer

これが私のショットです。以下をサポートします。

分音記号の削除（「無効な」文字を削除するだけではありません）
結果の最大長（または分音記号の削除前-「早期切り捨て」）
正規化されたチャンク間のカスタムセパレーター
結果は強制的に大文字または小文字にすることができます
サポートされているユニコードカテゴリの構成可能なリスト
許可される文字の範囲の構成可能なリスト
フレームワーク2.0をサポート

コード：

/// <summary> /// Defines a set of utilities for creating slug urls. /// </summary> public static class Slug { /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <returns> /// A slugged text. /// </returns> public static string Create(string text) { return Create(text, (SlugOptions)null); } /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <param name="options">The options. May be null.</param> /// <returns>A slugged text.</returns> public static string Create(string text, SlugOptions options) { if (text == null) return null; if (options == null) { options = new SlugOptions(); } string normalised; if (options.EarlyTruncate && options.MaximumLength > 0 && text.Length > options.MaximumLength) { normalised = text.Substring(0, options.MaximumLength).Normalize(NormalizationForm.FormD); } else { normalised = text.Normalize(NormalizationForm.FormD); } int max = options.MaximumLength > 0 ? Math.Min(normalised.Length, options.MaximumLength) : normalised.Length; StringBuilder sb = new StringBuilder(max); for (int i = 0; i < normalised.Length; i++) { char c = normalised[i]; UnicodeCategory uc = char.GetUnicodeCategory(c); if (options.AllowedUnicodeCategories.Contains(uc) && options.IsAllowed(c)) { switch (uc) { case UnicodeCategory.UppercaseLetter: if (options.ToLower) { c = options.Culture != null ? char.ToLower(c, options.Culture) : char.ToLowerInvariant(c); } sb.Append(options.Replace(c)); break; case UnicodeCategory.LowercaseLetter: if (options.ToUpper) { c = options.Culture != null ? char.ToUpper(c, options.Culture) : char.ToUpperInvariant(c); } sb.Append(options.Replace(c)); break; default: sb.Append(options.Replace(c)); break; } } else if (uc == UnicodeCategory.NonSpacingMark) { // don't add a separator } else { if (options.Separator != null && !EndsWith(sb, options.Separator)) { sb.Append(options.Separator); } } if (options.MaximumLength > 0 && sb.Length >= options.MaximumLength) break; } string result = sb.ToString(); if (options.MaximumLength > 0 && result.Length > options.MaximumLength) { result = result.Substring(0, options.MaximumLength); } if (!options.CanEndWithSeparator && options.Separator != null && result.EndsWith(options.Separator)) { result = result.Substring(0, result.Length - options.Separator.Length); } return result.Normalize(NormalizationForm.FormC); } private static bool EndsWith(StringBuilder sb, string text) { if (sb.Length < text.Length) return false; for (int i = 0; i < text.Length; i++) { if (sb[sb.Length - 1 - i] != text[text.Length - 1 - i]) return false; } return true; } } /// <summary> /// Defines options for the Slug utility class. /// </summary> public class SlugOptions { /// <summary> /// Defines the default maximum length. Currently equal to 80. /// </summary> public const int DefaultMaximumLength = 80; /// <summary> /// Defines the default separator. Currently equal to "-". /// </summary> public const string DefaultSeparator = "-"; private bool _toLower; private bool _toUpper; /// <summary> /// Initializes a new instance of the <see cref="SlugOptions"/> class. /// </summary> public SlugOptions() { MaximumLength = DefaultMaximumLength; Separator = DefaultSeparator; AllowedUnicodeCategories = new List<UnicodeCategory>(); AllowedUnicodeCategories.Add(UnicodeCategory.UppercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.LowercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.DecimalDigitNumber); AllowedRanges = new List<KeyValuePair<short, short>>(); AllowedRanges.Add(new KeyValuePair<short, short>((short)'a', (short)'z')); AllowedRanges.Add(new KeyValuePair<short, short>((short)'A', (short)'Z')); AllowedRanges.Add(new KeyValuePair<short, short>((short)'0', (short)'9')); } /// <summary> /// Gets the allowed unicode categories list. /// </summary> /// <value> /// The allowed unicode categories list. /// </value> public virtual IList<UnicodeCategory> AllowedUnicodeCategories { get; private set; } /// <summary> /// Gets the allowed ranges list. /// </summary> /// <value> /// The allowed ranges list. /// </value> public virtual IList<KeyValuePair<short, short>> AllowedRanges { get; private set; } /// <summary> /// Gets or sets the maximum length. /// </summary> /// <value> /// The maximum length. /// </value> public virtual int MaximumLength { get; set; } /// <summary> /// Gets or sets the separator. /// </summary> /// <value> /// The separator. /// </value> public virtual string Separator { get; set; } /// <summary> /// Gets or sets the culture for case conversion. /// </summary> /// <value> /// The culture. /// </value> public virtual CultureInfo Culture { get; set; } /// <summary> /// Gets or sets a value indicating whether the string can end with a separator string. /// </summary> /// <value> /// <c>true</c> if the string can end with a separator string; otherwise, <c>false</c>. /// </value> public virtual bool CanEndWithSeparator { get; set; } /// <summary> /// Gets or sets a value indicating whether the string is truncated before normalization. /// </summary> /// <value> /// <c>true</c> if the string is truncated before normalization; otherwise, <c>false</c>. /// </value> public virtual bool EarlyTruncate { get; set; } /// <summary> /// Gets or sets a value indicating whether to lowercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be lowercased; otherwise, <c>false</c>. /// </value> public virtual bool ToLower { get { return _toLower; } set { _toLower = value; if (_toLower) { _toUpper = false; } } } /// <summary> /// Gets or sets a value indicating whether to uppercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be uppercased; otherwise, <c>false</c>. /// </value> public virtual bool ToUpper { get { return _toUpper; } set { _toUpper = value; if (_toUpper) { _toLower = false; } } } /// <summary> /// Determines whether the specified character is allowed. /// </summary> /// <param name="character">The character.</param> /// <returns>true if the character is allowed; false otherwise.</returns> public virtual bool IsAllowed(char character) { foreach (var p in AllowedRanges) { if (character >= p.Key && character <= p.Value) return true; } return false; } /// <summary> /// Replaces the specified character by a given string. /// </summary> /// <param name="character">The character to replace.</param> /// <returns>a string.</returns> public virtual string Replace(char character) { return character.ToString(); } }