Регулярные выражения для ряда юникодных точек PHP

Я пытаюсь удалить все символы из строки, кроме:

  • Буквенно-цифровые символы
  • Знак доллара ($)
  • Подчеркнуть (_)
  • Юникод символы между кодовыми точками U+0080 и U+FFFF

Я получил первые три условия, выполнив это:

preg_replace('/[^a-zA-Z\d$_]+/', '', $foo);

Как мне соответствовать четвертому условию? Я смотрел на использование \X но должен быть лучший способ, чем перечислять символы 65000+.

Ответ 1

Вы можете использовать:

$foo = preg_replace('/[^\w$\x{0080}-\x{FFFF}]+/u', '', $foo);
  • \w - эквивалентно [a-zA-Z0-9_]
  • \x{0080}-\x{FFFF} для сопоставления символов между кодовыми точками U +0080 and U + FFFF`
  • /u для поддержки unicode в regex

Ответ 2

Модернизированный ответ.

Вам было бы неразумно просто исключить кодовые точки U + 80 - U + FFFF
учитывая диапазон Юникода, расширяется до U + 10FFFF.

В настоящее время это охватывает много символов за пределами 16-битного диапазона bmp.

Я собираюсь показать вам, как сделать это в диапазоне, который вы хотите в любом
UTF-16 или UTF-8/32, которые вы можете или не можете контролировать.

UTF-16

 # UTF-16 regex ;   equavelent UTF-8/32 regex   (?![\x{80}-\x{FFFF}])[$\w]

 (?!
      (?:
           [\x{80}-\x{D7FF}\x{E000}-\x{FFFF}] 
        |  
           [\x{D800}-\x{DBFF}] 
           (?! [\x{DC00}-\x{DFFF}] )
        |  
           [\x{DC00}-\x{DFFF}] 
           (?<! [\x{D800}-\x{DBFF}] [\S\s] )
      )
 )
 [$\w] 

 # Output --------------------------------
 # 77,905 Unicode characters
 # UTF-16 regex  equivalent (using codepoints)
 (?:
      [\x{24}\x{30}-\x{39}\x{41}-\x{5A}\x{5F}\x{61}-\x{7A}] 
   |  
      (?:
           \x{D800} [\x{DC00}-\x{DC0B}\x{DC0D}-\x{DC26}\x{DC28}-\x{DC3A}\x{DC3C}-\x{DC3D}\x{DC3F}-\x{DC4D}\x{DC50}-\x{DC5D}\x{DC80}-\x{DCFA}\x{DDFD}\x{DE80}-\x{DE9C}\x{DEA0}-\x{DED0}\x{DEE0}\x{DF00}-\x{DF1F}\x{DF2D}-\x{DF40}\x{DF42}-\x{DF49}\x{DF50}-\x{DF7A}\x{DF80}-\x{DF9D}\x{DFA0}-\x{DFC3}\x{DFC8}-\x{DFCF}] 
        |  \x{D801} [\x{DC00}-\x{DC9D}\x{DCA0}-\x{DCA9}\x{DCB0}-\x{DCD3}\x{DCD8}-\x{DCFB}\x{DD00}-\x{DD27}\x{DD30}-\x{DD63}\x{DE00}-\x{DF36}\x{DF40}-\x{DF55}\x{DF60}-\x{DF67}] 
        |  \x{D802} [\x{DC00}-\x{DC05}\x{DC08}\x{DC0A}-\x{DC35}\x{DC37}-\x{DC38}\x{DC3C}\x{DC3F}-\x{DC55}\x{DC60}-\x{DC76}\x{DC80}-\x{DC9E}\x{DCE0}-\x{DCF2}\x{DCF4}-\x{DCF5}\x{DD00}-\x{DD15}\x{DD20}-\x{DD39}\x{DD80}-\x{DDB7}\x{DDBE}-\x{DDBF}\x{DE00}-\x{DE03}\x{DE05}-\x{DE06}\x{DE0C}-\x{DE13}\x{DE15}-\x{DE17}\x{DE19}-\x{DE35}\x{DE38}-\x{DE3A}\x{DE3F}\x{DE60}-\x{DE7C}\x{DE80}-\x{DE9C}\x{DEC0}-\x{DEC7}\x{DEC9}-\x{DEE6}\x{DF00}-\x{DF35}\x{DF40}-\x{DF55}\x{DF60}-\x{DF72}\x{DF80}-\x{DF91}] 
        |  \x{D803} [\x{DC00}-\x{DC48}\x{DC80}-\x{DCB2}\x{DCC0}-\x{DCF2}\x{DD00}-\x{DD27}\x{DD30}-\x{DD39}\x{DF00}-\x{DF1C}\x{DF27}\x{DF30}-\x{DF50}\x{DFE0}-\x{DFF6}] 
        |  \x{D804} [\x{DC01}\x{DC03}-\x{DC46}\x{DC66}-\x{DC6F}\x{DC7F}-\x{DC81}\x{DC83}-\x{DCAF}\x{DCB3}-\x{DCB6}\x{DCB9}-\x{DCBA}\x{DCD0}-\x{DCE8}\x{DCF0}-\x{DCF9}\x{DD00}-\x{DD2B}\x{DD2D}-\x{DD34}\x{DD36}-\x{DD3F}\x{DD44}\x{DD50}-\x{DD73}\x{DD76}\x{DD80}-\x{DD81}\x{DD83}-\x{DDB2}\x{DDB6}-\x{DDBE}\x{DDC1}-\x{DDC4}\x{DDC9}-\x{DDCC}\x{DDD0}-\x{DDDA}\x{DDDC}\x{DE00}-\x{DE11}\x{DE13}-\x{DE2B}\x{DE2F}-\x{DE31}\x{DE34}\x{DE36}-\x{DE37}\x{DE3E}\x{DE80}-\x{DE86}\x{DE88}\x{DE8A}-\x{DE8D}\x{DE8F}-\x{DE9D}\x{DE9F}-\x{DEA8}\x{DEB0}-\x{DEDF}\x{DEE3}-\x{DEEA}\x{DEF0}-\x{DEF9}\x{DF00}-\x{DF01}\x{DF05}-\x{DF0C}\x{DF0F}-\x{DF10}\x{DF13}-\x{DF28}\x{DF2A}-\x{DF30}\x{DF32}-\x{DF33}\x{DF35}-\x{DF39}\x{DF3B}-\x{DF3D}\x{DF40}\x{DF50}\x{DF5D}-\x{DF61}\x{DF66}-\x{DF6C}\x{DF70}-\x{DF74}] 
        |  \x{D805} [\x{DC00}-\x{DC34}\x{DC38}-\x{DC3F}\x{DC42}-\x{DC44}\x{DC46}-\x{DC4A}\x{DC50}-\x{DC59}\x{DC5E}-\x{DC5F}\x{DC80}-\x{DCAF}\x{DCB3}-\x{DCB8}\x{DCBA}\x{DCBF}-\x{DCC0}\x{DCC2}-\x{DCC5}\x{DCC7}\x{DCD0}-\x{DCD9}\x{DD80}-\x{DDAE}\x{DDB2}-\x{DDB5}\x{DDBC}-\x{DDBD}\x{DDBF}-\x{DDC0}\x{DDD8}-\x{DDDD}\x{DE00}-\x{DE2F}\x{DE33}-\x{DE3A}\x{DE3D}\x{DE3F}-\x{DE40}\x{DE44}\x{DE50}-\x{DE59}\x{DE80}-\x{DEAB}\x{DEAD}\x{DEB0}-\x{DEB5}\x{DEB7}-\x{DEB8}\x{DEC0}-\x{DEC9}\x{DF00}-\x{DF1A}\x{DF1D}-\x{DF1F}\x{DF22}-\x{DF25}\x{DF27}-\x{DF2B}\x{DF30}-\x{DF39}] 
        |  \x{D806} [\x{DC00}-\x{DC2B}\x{DC2F}-\x{DC37}\x{DC39}-\x{DC3A}\x{DCA0}-\x{DCE9}\x{DCFF}\x{DDA0}-\x{DDA7}\x{DDAA}-\x{DDD0}\x{DDD4}-\x{DDD7}\x{DDDA}-\x{DDDB}\x{DDE0}-\x{DDE1}\x{DDE3}\x{DE00}-\x{DE38}\x{DE3A}-\x{DE3E}\x{DE47}\x{DE50}-\x{DE56}\x{DE59}-\x{DE96}\x{DE98}-\x{DE99}\x{DE9D}\x{DEC0}-\x{DEF8}] 
        |  \x{D807} [\x{DC00}-\x{DC08}\x{DC0A}-\x{DC2E}\x{DC30}-\x{DC36}\x{DC38}-\x{DC3D}\x{DC3F}-\x{DC40}\x{DC50}-\x{DC59}\x{DC72}-\x{DC8F}\x{DC92}-\x{DCA7}\x{DCAA}-\x{DCB0}\x{DCB2}-\x{DCB3}\x{DCB5}-\x{DCB6}\x{DD00}-\x{DD06}\x{DD08}-\x{DD09}\x{DD0B}-\x{DD36}\x{DD3A}\x{DD3C}-\x{DD3D}\x{DD3F}-\x{DD47}\x{DD50}-\x{DD59}\x{DD60}-\x{DD65}\x{DD67}-\x{DD68}\x{DD6A}-\x{DD89}\x{DD90}-\x{DD91}\x{DD95}\x{DD97}-\x{DD98}\x{DDA0}-\x{DDA9}\x{DEE0}-\x{DEF4}] 
        |  \x{D808} [\x{DC00}-\x{DF99}] 
        |  \x{D809} [\x{DC80}-\x{DD43}] 
        |  \x{D80C} [\x{DC00}-\x{DFFF}] 
        |  \x{D80D} [\x{DC00}-\x{DC2E}] 
        |  \x{D811} [\x{DC00}-\x{DE46}] 
        |  \x{D81A} [\x{DC00}-\x{DE38}\x{DE40}-\x{DE5E}\x{DE60}-\x{DE69}\x{DED0}-\x{DEED}\x{DEF0}-\x{DEF4}\x{DF00}-\x{DF36}\x{DF40}-\x{DF43}\x{DF50}-\x{DF59}\x{DF63}-\x{DF77}\x{DF7D}-\x{DF8F}] 
        |  \x{D81B} [\x{DE40}-\x{DE7F}\x{DF00}-\x{DF4A}\x{DF4F}-\x{DF50}\x{DF8F}-\x{DF9F}\x{DFE0}-\x{DFE1}\x{DFE3}] 
        |  [\x{D81C}-\x{D820}] [\x{DC00}-\x{DFFF}] 
        |  \x{D821} [\x{DC00}-\x{DFF7}] 
        |  \x{D822} [\x{DC00}-\x{DEF2}] 
        |  \x{D82C} [\x{DC00}-\x{DD1E}\x{DD50}-\x{DD52}\x{DD64}-\x{DD67}\x{DD70}-\x{DEFB}] 
        |  \x{D82F} [\x{DC00}-\x{DC6A}\x{DC70}-\x{DC7C}\x{DC80}-\x{DC88}\x{DC90}-\x{DC99}\x{DC9D}-\x{DC9E}] 
        |  \x{D834} [\x{DD67}-\x{DD69}\x{DD7B}-\x{DD82}\x{DD85}-\x{DD8B}\x{DDAA}-\x{DDAD}\x{DE42}-\x{DE44}] 
        |  \x{D835} [\x{DC00}-\x{DC54}\x{DC56}-\x{DC9C}\x{DC9E}-\x{DC9F}\x{DCA2}\x{DCA5}-\x{DCA6}\x{DCA9}-\x{DCAC}\x{DCAE}-\x{DCB9}\x{DCBB}\x{DCBD}-\x{DCC3}\x{DCC5}-\x{DD05}\x{DD07}-\x{DD0A}\x{DD0D}-\x{DD14}\x{DD16}-\x{DD1C}\x{DD1E}-\x{DD39}\x{DD3B}-\x{DD3E}\x{DD40}-\x{DD44}\x{DD46}\x{DD4A}-\x{DD50}\x{DD52}-\x{DEA5}\x{DEA8}-\x{DEC0}\x{DEC2}-\x{DEDA}\x{DEDC}-\x{DEFA}\x{DEFC}-\x{DF14}\x{DF16}-\x{DF34}\x{DF36}-\x{DF4E}\x{DF50}-\x{DF6E}\x{DF70}-\x{DF88}\x{DF8A}-\x{DFA8}\x{DFAA}-\x{DFC2}\x{DFC4}-\x{DFCB}\x{DFCE}-\x{DFFF}] 
        |  \x{D836} [\x{DE00}-\x{DE36}\x{DE3B}-\x{DE6C}\x{DE75}\x{DE84}\x{DE9B}-\x{DE9F}\x{DEA1}-\x{DEAF}] 
        |  \x{D838} [\x{DC00}-\x{DC06}\x{DC08}-\x{DC18}\x{DC1B}-\x{DC21}\x{DC23}-\x{DC24}\x{DC26}-\x{DC2A}\x{DD00}-\x{DD2C}\x{DD30}-\x{DD3D}\x{DD40}-\x{DD49}\x{DD4E}\x{DEC0}-\x{DEF9}] 
        |  \x{D83A} [\x{DC00}-\x{DCC4}\x{DCD0}-\x{DCD6}\x{DD00}-\x{DD4B}\x{DD50}-\x{DD59}] 
        |  \x{D83B} [\x{DE00}-\x{DE03}\x{DE05}-\x{DE1F}\x{DE21}-\x{DE22}\x{DE24}\x{DE27}\x{DE29}-\x{DE32}\x{DE34}-\x{DE37}\x{DE39}\x{DE3B}\x{DE42}\x{DE47}\x{DE49}\x{DE4B}\x{DE4D}-\x{DE4F}\x{DE51}-\x{DE52}\x{DE54}\x{DE57}\x{DE59}\x{DE5B}\x{DE5D}\x{DE5F}\x{DE61}-\x{DE62}\x{DE64}\x{DE67}-\x{DE6A}\x{DE6C}-\x{DE72}\x{DE74}-\x{DE77}\x{DE79}-\x{DE7C}\x{DE7E}\x{DE80}-\x{DE89}\x{DE8B}-\x{DE9B}\x{DEA1}-\x{DEA3}\x{DEA5}-\x{DEA9}\x{DEAB}-\x{DEBB}] 
        |  [\x{D840}-\x{D868}] [\x{DC00}-\x{DFFF}] 
        |  \x{D869} [\x{DC00}-\x{DED6}\x{DF00}-\x{DFFF}] 
        |  [\x{D86A}-\x{D86C}] [\x{DC00}-\x{DFFF}] 
        |  \x{D86D} [\x{DC00}-\x{DF34}\x{DF40}-\x{DFFF}] 
        |  \x{D86E} [\x{DC00}-\x{DC1D}\x{DC20}-\x{DFFF}] 
        |  [\x{D86F}-\x{D872}] [\x{DC00}-\x{DFFF}] 
        |  \x{D873} [\x{DC00}-\x{DEA1}\x{DEB0}-\x{DFFF}] 
        |  [\x{D874}-\x{D879}] [\x{DC00}-\x{DFFF}] 
        |  \x{D87A} [\x{DC00}-\x{DFE0}] 
        |  \x{D87E} [\x{DC00}-\x{DE1D}] 
        |  \x{DB40} [\x{DD00}-\x{DDEF}] 
      )
 )

UTF-8/32

 # UTF-8/32 regex ;

 (?! [\x{80}-\x{FFFF}] )
 [$\w] 


 # Output --------------------------------
 # 77,905 Unicode characters
 # UTF-8 / 32 regex equivalent (using codepoints)

 (?:
      [\x{24}\x{30}-\x{39}\x{41}-\x{5A}\x{5F}\x{61}-\x{7A}\x{10000}-\x{1000B}\x{1000D}-\x{10026}\x{10028}-\x{1003A}\x{1003C}-\x{1003D}\x{1003F}-\x{1004D}\x{10050}-\x{1005D}\x{10080}-\x{100FA}\x{101FD}\x{10280}-\x{1029C}\x{102A0}-\x{102D0}\x{102E0}\x{10300}-\x{1031F}\x{1032D}-\x{10340}\x{10342}-\x{10349}\x{10350}-\x{1037A}\x{10380}-\x{1039D}\x{103A0}-\x{103C3}\x{103C8}-\x{103CF}\x{10400}-\x{1049D}\x{104A0}-\x{104A9}\x{104B0}-\x{104D3}\x{104D8}-\x{104FB}\x{10500}-\x{10527}\x{10530}-\x{10563}\x{10600}-\x{10736}\x{10740}-\x{10755}\x{10760}-\x{10767}\x{10800}-\x{10805}\x{10808}\x{1080A}-\x{10835}\x{10837}-\x{10838}\x{1083C}\x{1083F}-\x{10855}\x{10860}-\x{10876}\x{10880}-\x{1089E}\x{108E0}-\x{108F2}\x{108F4}-\x{108F5}\x{10900}-\x{10915}\x{10920}-\x{10939}\x{10980}-\x{109B7}\x{109BE}-\x{109BF}\x{10A00}-\x{10A03}\x{10A05}-\x{10A06}\x{10A0C}-\x{10A13}\x{10A15}-\x{10A17}\x{10A19}-\x{10A35}\x{10A38}-\x{10A3A}\x{10A3F}\x{10A60}-\x{10A7C}\x{10A80}-\x{10A9C}\x{10AC0}-\x{10AC7}\x{10AC9}-\x{10AE6}\x{10B00}-\x{10B35}\x{10B40}-\x{10B55}\x{10B60}-\x{10B72}\x{10B80}-\x{10B91}\x{10C00}-\x{10C48}\x{10C80}-\x{10CB2}\x{10CC0}-\x{10CF2}\x{10D00}-\x{10D27}\x{10D30}-\x{10D39}\x{10F00}-\x{10F1C}\x{10F27}\x{10F30}-\x{10F50}\x{10FE0}-\x{10FF6}\x{11001}\x{11003}-\x{11046}\x{11066}-\x{1106F}\x{1107F}-\x{11081}\x{11083}-\x{110AF}\x{110B3}-\x{110B6}\x{110B9}-\x{110BA}\x{110D0}-\x{110E8}\x{110F0}-\x{110F9}\x{11100}-\x{1112B}\x{1112D}-\x{11134}\x{11136}-\x{1113F}\x{11144}\x{11150}-\x{11173}\x{11176}\x{11180}-\x{11181}\x{11183}-\x{111B2}\x{111B6}-\x{111BE}\x{111C1}-\x{111C4}\x{111C9}-\x{111CC}\x{111D0}-\x{111DA}\x{111DC}\x{11200}-\x{11211}\x{11213}-\x{1122B}\x{1122F}-\x{11231}\x{11234}\x{11236}-\x{11237}\x{1123E}\x{11280}-\x{11286}\x{11288}\x{1128A}-\x{1128D}\x{1128F}-\x{1129D}\x{1129F}-\x{112A8}\x{112B0}-\x{112DF}\x{112E3}-\x{112EA}\x{112F0}-\x{112F9}\x{11300}-\x{11301}\x{11305}-\x{1130C}\x{1130F}-\x{11310}\x{11313}-\x{11328}\x{1132A}-\x{11330}\x{11332}-\x{11333}\x{11335}-\x{11339}\x{1133B}-\x{1133D}\x{11340}\x{11350}\x{1135D}-\x{11361}\x{11366}-\x{1136C}\x{11370}-\x{11374}\x{11400}-\x{11434}\x{11438}-\x{1143F}\x{11442}-\x{11444}\x{11446}-\x{1144A}\x{11450}-\x{11459}\x{1145E}-\x{1145F}\x{11480}-\x{114AF}\x{114B3}-\x{114B8}\x{114BA}\x{114BF}-\x{114C0}\x{114C2}-\x{114C5}\x{114C7}\x{114D0}-\x{114D9}\x{11580}-\x{115AE}\x{115B2}-\x{115B5}\x{115BC}-\x{115BD}\x{115BF}-\x{115C0}\x{115D8}-\x{115DD}\x{11600}-\x{1162F}\x{11633}-\x{1163A}\x{1163D}\x{1163F}-\x{11640}\x{11644}\x{11650}-\x{11659}\x{11680}-\x{116AB}\x{116AD}\x{116B0}-\x{116B5}\x{116B7}-\x{116B8}\x{116C0}-\x{116C9}\x{11700}-\x{1171A}\x{1171D}-\x{1171F}\x{11722}-\x{11725}\x{11727}-\x{1172B}\x{11730}-\x{11739}\x{11800}-\x{1182B}\x{1182F}-\x{11837}\x{11839}-\x{1183A}\x{118A0}-\x{118E9}\x{118FF}\x{119A0}-\x{119A7}\x{119AA}-\x{119D0}\x{119D4}-\x{119D7}\x{119DA}-\x{119DB}\x{119E0}-\x{119E1}\x{119E3}\x{11A00}-\x{11A38}\x{11A3A}-\x{11A3E}\x{11A47}\x{11A50}-\x{11A56}\x{11A59}-\x{11A96}\x{11A98}-\x{11A99}\x{11A9D}\x{11AC0}-\x{11AF8}\x{11C00}-\x{11C08}\x{11C0A}-\x{11C2E}\x{11C30}-\x{11C36}\x{11C38}-\x{11C3D}\x{11C3F}-\x{11C40}\x{11C50}-\x{11C59}\x{11C72}-\x{11C8F}\x{11C92}-\x{11CA7}\x{11CAA}-\x{11CB0}\x{11CB2}-\x{11CB3}\x{11CB5}-\x{11CB6}\x{11D00}-\x{11D06}\x{11D08}-\x{11D09}\x{11D0B}-\x{11D36}\x{11D3A}\x{11D3C}-\x{11D3D}\x{11D3F}-\x{11D47}\x{11D50}-\x{11D59}\x{11D60}-\x{11D65}\x{11D67}-\x{11D68}\x{11D6A}-\x{11D89}\x{11D90}-\x{11D91}\x{11D95}\x{11D97}-\x{11D98}\x{11DA0}-\x{11DA9}\x{11EE0}-\x{11EF4}\x{12000}-\x{12399}\x{12480}-\x{12543}\x{13000}-\x{1342E}\x{14400}-\x{14646}\x{16800}-\x{16A38}\x{16A40}-\x{16A5E}\x{16A60}-\x{16A69}\x{16AD0}-\x{16AED}\x{16AF0}-\x{16AF4}\x{16B00}-\x{16B36}\x{16B40}-\x{16B43}\x{16B50}-\x{16B59}\x{16B63}-\x{16B77}\x{16B7D}-\x{16B8F}\x{16E40}-\x{16E7F}\x{16F00}-\x{16F4A}\x{16F4F}-\x{16F50}\x{16F8F}-\x{16F9F}\x{16FE0}-\x{16FE1}\x{16FE3}\x{17000}-\x{187F7}\x{18800}-\x{18AF2}\x{1B000}-\x{1B11E}\x{1B150}-\x{1B152}\x{1B164}-\x{1B167}\x{1B170}-\x{1B2FB}\x{1BC00}-\x{1BC6A}\x{1BC70}-\x{1BC7C}\x{1BC80}-\x{1BC88}\x{1BC90}-\x{1BC99}\x{1BC9D}-\x{1BC9E}\x{1D167}-\x{1D169}\x{1D17B}-\x{1D182}\x{1D185}-\x{1D18B}\x{1D1AA}-\x{1D1AD}\x{1D242}-\x{1D244}\x{1D400}-\x{1D454}\x{1D456}-\x{1D49C}\x{1D49E}-\x{1D49F}\x{1D4A2}\x{1D4A5}-\x{1D4A6}\x{1D4A9}-\x{1D4AC}\x{1D4AE}-\x{1D4B9}\x{1D4BB}\x{1D4BD}-\x{1D4C3}\x{1D4C5}-\x{1D505}\x{1D507}-\x{1D50A}\x{1D50D}-\x{1D514}\x{1D516}-\x{1D51C}\x{1D51E}-\x{1D539}\x{1D53B}-\x{1D53E}\x{1D540}-\x{1D544}\x{1D546}\x{1D54A}-\x{1D550}\x{1D552}-\x{1D6A5}\x{1D6A8}-\x{1D6C0}\x{1D6C2}-\x{1D6DA}\x{1D6DC}-\x{1D6FA}\x{1D6FC}-\x{1D714}\x{1D716}-\x{1D734}\x{1D736}-\x{1D74E}\x{1D750}-\x{1D76E}\x{1D770}-\x{1D788}\x{1D78A}-\x{1D7A8}\x{1D7AA}-\x{1D7C2}\x{1D7C4}-\x{1D7CB}\x{1D7CE}-\x{1D7FF}\x{1DA00}-\x{1DA36}\x{1DA3B}-\x{1DA6C}\x{1DA75}\x{1DA84}\x{1DA9B}-\x{1DA9F}\x{1DAA1}-\x{1DAAF}\x{1E000}-\x{1E006}\x{1E008}-\x{1E018}\x{1E01B}-\x{1E021}\x{1E023}-\x{1E024}\x{1E026}-\x{1E02A}\x{1E100}-\x{1E12C}\x{1E130}-\x{1E13D}\x{1E140}-\x{1E149}\x{1E14E}\x{1E2C0}-\x{1E2F9}\x{1E800}-\x{1E8C4}\x{1E8D0}-\x{1E8D6}\x{1E900}-\x{1E94B}\x{1E950}-\x{1E959}\x{1EE00}-\x{1EE03}\x{1EE05}-\x{1EE1F}\x{1EE21}-\x{1EE22}\x{1EE24}\x{1EE27}\x{1EE29}-\x{1EE32}\x{1EE34}-\x{1EE37}\x{1EE39}\x{1EE3B}\x{1EE42}\x{1EE47}\x{1EE49}\x{1EE4B}\x{1EE4D}-\x{1EE4F}\x{1EE51}-\x{1EE52}\x{1EE54}\x{1EE57}\x{1EE59}\x{1EE5B}\x{1EE5D}\x{1EE5F}\x{1EE61}-\x{1EE62}\x{1EE64}\x{1EE67}-\x{1EE6A}\x{1EE6C}-\x{1EE72}\x{1EE74}-\x{1EE77}\x{1EE79}-\x{1EE7C}\x{1EE7E}\x{1EE80}-\x{1EE89}\x{1EE8B}-\x{1EE9B}\x{1EEA1}-\x{1EEA3}\x{1EEA5}-\x{1EEA9}\x{1EEAB}-\x{1EEBB}\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B740}-\x{2B81D}\x{2B820}-\x{2CEA1}\x{2CEB0}-\x{2EBE0}\x{2F800}-\x{2FA1D}\x{E0100}-\x{E01EF}] 
 )

Наконец, самое простое расширение диапазона до U + 10FFFF

 # UTF-8/32 regex ;

 (?! [\x{80}-\x{10FFFF}] )
 [$\w] 


 # Output --------------------------------
 # 64 Unicode characters
 # UTF-8 / 16/ 32 regex equivalent (using codepoints)

 [\x{24}\x{30}-\x{39}\x{41}-\x{5A}\x{5F}\x{61}-\x{7A}] 

 # Codepoint -> character substitution

 [$0-9A-Z_a-z]  

Ответ 3

Если вы действительно хотите очистить строку для сортировки по умолчанию в MySQL (utf8_general_ci), удаления эмодзи будет недостаточно. utf8_general_ci соответствует набору символов utf8/utf8mb3, который поддерживает только диапазон от 0x000 до 0xFFFF (базовая многоязычная плоскость). Поэтому я предлагаю удалить любой символ с кодом больше 0xFFFF (0x10FFFF/16: SPUA-B, который я считаю самым большим из известных на данный момент согласно https://en.wikipedia.org/wiki/Plane_(Unicode))

function removeNonBasicMultilingualPlane(string $text): string
{
   return \preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $text);
}