If you see a great character in a tweet and want to find similar or related characters, you can use tools like this one. Copy the characters from the tweet and paste them into this tool. Let’s assume this is the input you copied from a tweet: ▬ⓘ▬ⓝ▬ⓖ▬▬ⓙ▬ⓤ▬ⓜ▬ⓟ+▬▬®
When you paste it and hit enter, the result will be something like this:
It gives two columns of numerical character codes for the characters. Note the number in the left column for ⓝ. It’s 24dd. Yes, that’s a number, a hexadecimal number.
The 24 part of the number can help find the group or block the character is part of:
There’s something really fabulous on this http://www.decodeunicode.org page. Notice under the big bold type there are two rows — the upper one longer than the lower one — of vertical dashes. Move the mouse/cursor (mouseover) those vertical marks and watch the page show the names and a sample character from the unicode blocks. It also gives the category of the block. I can tell you from experience with other MUCH MUCH more cumbersome user interfaces for browsing Unicode characters that this one is absolutely brilliant. And VERY concise. And fast. Amazing. The best ever. If you intend to play with Unicode characters, you’ll want to bookmark this site.
The characters in the chart on that screen are displayed with little .gif graphics files. That means you can’t just copy and paste them into twitter from there. You can only type or paste *characters* into twitter tweets, not graphics files. If you click on the chart/table, click on the character you’re interested in, the site will give you a page for that letter. The main and big display of the character is also a graphic and is not help for twitter. But on the right side of the page, there’s a small input box with that character automatically written into it. You can copy it there and paste into twitter. Here’s an example:
The main source for all the characters and character codes is the Unicode Consortium
The home page – http://unicode.org/
A list of the character blocks/charts – http://www.unicode.org/charts/
Here’s an example: The Japanese hiragana characters – http://www.unicode.org/charts/PDF/U3040.pdf
Escape Characters, Escape Sequences, Character Codes, Unicode, and Such
THE USUAL MOTIVATION
pretty girl in a hat was interested, so, since chivalry and desire to impress a girl in order to get into … well … since chivalry and that other thing are not dead … in fact, since they’re alive and well … we’ll have a go at these deliciously complicated techie nerd character code issues …
we mostly think of “characters” as letters and numbers and some other things we see on the keys of a keyboard, computer screens, and printed pages … if we think about it a little more, we realize there are far more “characters” showing up on our computer screens and printed pages than we see painted on the tops of our keyboard keys … the difference makes a big difference …
IT ALL BEGAN WITH THE TYPEWRITER
The difference started being important back when “teletype” online terminals first began to replace typewriters. Typewriters had shift keys for upper case letters and to get to the characters at the tops of the keys. In order to replace a typewriter, computers needed to be able to replicate all that, but, because they were computers and not humans, more codes had to be added so the human at the teletype terminal could exert certain kinds of control over what the computer was doing in various parts of the process. So things like the control key, escape key, and the interrupt key were added to the keyboard of the teletype machine. Not because they wanted to be able to type on and English keyboard and get Chinese characters, but for simple control of typing, display, printing, and the overall computer connection.
WHAT IS A CHARACTER?
to the computer in the middle between the keyboard and the screen, all of these characters are binary numbers that are combined with other numbers to form an address in memory where there’s information about how to create the image of that number using the pixels of the screen or printed page. right, we sort of already knew that. but the difference between “sorta already knew that” and “really knowing how it works” is the difference between being able and not being able to sort out nasty little character code problems that can make our wonderful web-page or document or presentation slide or graphic work look dumb.
when you really get into this, it becomes impressive that ANY input (like a tweet including ordinary letters and numbers, but especially things like hashtags and twitter art) into ANY application (like a twitter or wordpress or blogspot or text caption portions of twitpic or yfrog) displays the same way (and the way the writer intended it)on ALL (or even all the latest) versions of ALL browsers and, oh by the way, ALL iphone and blackberry and other portable apps.
if you were fooling around
hope you were
but back to computing issues …
if you were fooling around with computers in the 70’s or 80’s, or even still a little in the 90’s, but even still for the more exotic characters now — you may have noticed that something you typed looked ok on your screen when you were typing it, but came out with wierd %$#%^ stuff in some places when you tried to print it. that was because, somewhere, in and between several pieces of software — the computer’s keyboard driver software that “catches” the keystrokes and mouse moves you make, the computer’s screen display driver software, the printer’s printer driver software, and the parts of the main word processor or game or other application software that “talk” those other three pieces of software — there were disagreements on which binary numbers represented which characters.
That’s good for that part for now. Let’s restart now in a different place. If you want to represent characters in a computer, you need to make a list and give each one a number. Let’s not worry how it’s really done for a moment. Let’s say we’re inventing computers for the first time and we get to decide which number goes with which letter and number. So we let “a” be character 1, “b” be 2, “c” be 3, and so forth, right? But wait a minute. We also need “1” and “2” and “3”, don’t we. Why not let them be 1 and 2 and 3 on our list? So there’s the first issue. When “character codes” are established, different people can easily — sometimes on some logical basis, sometimes arbitrarily, almost always a combination of both — assign different character code numbers to different intended characters. (By the way, that is what we were just inventing — a “character code”, a list of characters from 1 to 32, or from 1 to 64, or from 1 to some power of 2. Why a power of 2? Think about it. I’ll come back to it. )
BY THE WAY, IT’S SIMPLE NOW
Actually, the underlying challenge is still detailed and complicated by the basic physical and logical facts of characters, computers, connections, pixel displays, and printers. But it’s now remarkably easy for the masses of non-technical or somewhat technical users. For very technical users and designers, there are still some headaches because they’re trying to do unusual things. But 50 years … of tech progress … of plummeting costs per unit of performance of cpu cycles and memory and storage and display and printers … and of leaders of various industry segments cooperating on creating, agreeing to, and adopting … better and simpler yet more natural and comprehensive standards for dealing with character and control codes … have resulted in users and designers being able to do MUCH more complex things more simply, more naturally, more quickly, and MUCH more reliably. We mostly don’t see odd ugly strings of characters showing up on display screens and printed pages these days even though there’s a lot more multi-language, font flexibility, and such.
so these issues i’m discussing are “under the hood” of the modern laptop, desktop, individual phone/ipod/blackberry, and on-line connection, just handled by standards and certain smart design disciplines to take advantage of the standards. bet @al3x and his pals would agree.
Unless an additional conceptual, adminstrative, or technical breakthough has happened (I doubt it, but i’m no longer current), Unicode is the key to having all the right players and parts singing from the same songsheet. Unicode is an organization, a group of people. It’s also a VERY long and VERY comprehensive list of characters and (I think) reserved numbers for printer and other control codes (other control codes like maybe robotics and remote sensors and stuff, although i may be giving unicode too much credit there … that may into other standards-making bodies like IEEE and such … not important for this discussion).
It used to be important, or at least useful, to know a handful of character codes like ascii, latin-1, i don’t know, latin extended, and a few more. But now I’m pretty sure the only one to know is unicode. I’m pretty sure that anything that’s still relevant about the old codes is preserved and documented in some part of the Unicode numbering scheme.
the unicode character lists are available online at unicode’s site. the last several times i’ve gone to it, i found it authoritative, but not so easy to use.
the best interface to unicode codes and characters i found earlier this year or maybe it was last year … where did i put that … there it is, further up on this page with a short discussion about how to use its brilliant mouseover user interface to browse the unicode, which, in my experience, is otherwise, NOWHERE NEAR this quick, natural, and easy. Here’s that site set to show the “enclosed alphanumerics” segment — actually unicode calls them, “blocks”:
here’s the same site, but on a different block of unicode, miscellaneous symbols
As more and more developers in various parts of the overall system software and hardware world have adopted unicode, the more all those crazy system and printer “crashes”, and printers printing all the lines of a page on one line (carriage return disabled by some errant character or control code), and printers printing a few dozen characters of gibberish and then ejecting the page and doing that until the user leaped over and shut off the printer’s power or the stack of paper ran out :), and other weird displayed and printed results have pretty much just disappeared.
USING UNICODE CHARACTER CODES IN TWEETS
Up at the top of this page is how to use a few different sources of unicode characters for copy-and-paste of Unicode characters into tweets (vs. typing or copy/pasting Unicode character codes) into tweets.
I just did some trial and error on twitter and, partly remembered, partly figured out all over again, how to insert the decimal (base 10) or hexi-decimal (base 16) numerical Unicode character codes into tweets.
For hexadecimal numbers (which are the ones usually available by default in the code block charts for unicode characters and use 0-9, a-f as digit values), the formula is, for example of little umbrella, hex 2602, is: & #x2602; (no space after ampersand) for ☂ The fact that the little umbrella shows up says that this also works for wordpress, not just twitter.
For using decimal (base 10, in other words, the usual 0-9 for digit placeholder values) versions of numerical Unicode character codes, it’s & x2602; (no space after &) for &x2602;, which looks like a character in the Russian alphabet.
Nope. Ok, that’s an excellent error for use as a teaching point. If you’re working with these codes, you have to be really picky and not go too far into “auto” typing mode. It should be & #2602; (no space after &) for ਪ, our Russian character.
And the semi-colon is NOT optional. It does work sometimes without the semi-colons, but software somewhere is making assumptions if the opening “& plus x” and closing “semi-colon” are not there to make the unicode starting and stopping point definitive. I know from experience today that not using the semi-colon causes weird, confusing, varying/inconsistent results both within the same browser page and also across different browsers (i.e., IE, FF, and Chrome). With the semi-colon, it’s all good.
Let’s try that cheer again …
Ok, once found a unicode chart with cool stuff, picked a character, got number, and tweeted it the first time using the number (with & and # and x and stupid ; semi-colon), we can copy and paste the character itself for future tweets.
Another useful little skill would be, when we see a tweet come in with a character we like. We can copy and paste that particular one, of course. But if it’s like a D in a circle, but we want an E within a circle, then we want to find the unicode block with all the characters that are letters inside circles. So we need the unicode number for the cool character that we saw in the incoming tweet, the D in the circle.
There’s probably an easy way to do this but I haven’t found it yet. I tried looking at the HTML view here on this WordPress page, but the “html” view is a modified HTML view, not a raw plain text source file. I tried looking at the browser “view source” for the page that had the character, using Ctrl-F to find words close to it, and that didn’t work. I thought it should have. Maybe i wasn’t looking at it right. But it seemed the tweet content itself wasn’t displayed on the web page HTML source display. and i think it might have been XML ish or something. anyway, if nothing else works, i may go back to “view source” or “view xml” or whatever options are available in these reasonably upto date browsers. i tried to google for a coder/decoder site. … … bothered me then and now that getting the unicode number for a character isn’t easier than it is … i might just be missing something or screwing something up by typing with my elbows or something …
so starting again on that … may as well use the little umbrella character and try to find a place that will let me input that (copy from a tweet and paste into this tool we need to find) that and get in return hex 2602. once i can do that, when combined with the “& plus # plus x plus ;” stuff from earlier, we’ll be able to get any characters related to the tweets we see from all the brilliant twitter art folk. btw, those guys already know all of this, plus how to control line breaks and spacing. but that’s them doing what they do and this is us having fun figuring out this … 🙂
i’m inclined to try (maybe this is try again, not sure) just pasting the character into google and see what shows up … google’s capabilities are a moving target … they get better and better all the time as google adds “ok the user just put in 10 miles so she probably wants the equivalent in kilometers” types of functionality … i think all the major search engines are doing that … and, now that i think about it, even the browsers are doing it, or the browsers are handing off cryptic input in the address line to the default search engine … so both search engine input boxes and browser address input lines are getting smarter and smarter all the time … like you and me …
ok that didn’t work. the chrome browser address line accepted and displayed the pasted ☂, but the google search engine came back and said, didn’t match any documents. don’t be surprised to see that working someday. but people work with unicode all day, every day, all over the world. there must be some simple tool everybody in the unicode programming game uses as quickly and easily as you and I do a Clint Eastwood lookup on imdb, or a Nirvana lookup on wikipedia.
found this while googling. i didn’t finish reading it, but it gets off to an excellent start, setting the right tone and attitude. it’s from 2003. like the author says, in 2003, the change to unicode was necessary and obvious, and not that hard, but lots of people were still not yet on board. like me. i was learning it the following year mainly because of the work i was doing with chinese, russian, thai, etc. etc. on the web.
I scanned it looking for maybe a reference to a character to code converting tool. That wasn’t there, but I can see that, if you’re interested in this set of issues, this is a GREAT article that takes one from beginner to pretty darn knowledgeable. One can then leave it there or learn more from other places.
TRY IT OUT
There’s a twitter artist, @GuyVincent, who used some characters I hadn’t seen before, but they looked like a combination of Korean Hangul composites (except with 4 little positions instead of only 3) with Japanese kana like characters in the positions. Wondered what character set they were from.
Here’s the GuyVincent tweet
Here’s a copy of one of the character strings: ㌂∞㌡ↂ. It’s the first and third that were catching my eye. I posted that string into Richard’s converter and, voila, the hex codes. The two I was interested in are in the hex 3300 block.
Went to the unicodedecoder.org site, and here’s the full block! Cool!
I know from prior work with unicode, japanese, chinese, and hangul that CJK compatability refers to “chinese japanese korean.” both japan and korea borrowed the chinese writing system to form the basis for their own writing systems. japanese still uses most of what they borrowed from chinese (they call their version of the chinese characters, kanji … they use them for the base meanings of words, basic nouns, verbs, adjectives, basic ideas), but they also developed two 50-char sets of phonetic syllable symbols called kana. the two kana are hiragana (that they use for all sorts of inflections around base meanings … chinese doesn’t have this and works around it by, for example, indicating future without inflection, figuring saying “eat” and “tomorrow” makes it clear it’s future without “will eat” or “will be eating” kinds of stuff … makes sense … it must … they’ve used it for thousands of years and a billion use it today … not better or worse … just different … ) and katakana (used primarily for foreign words, like for when the japanese want to say barack obama or general motors or miley cyrus or pizza).
here’s some unicode for the hirigana part of japanese kana
here’s unicode for the katakana part of kana
(chinese writing is not phonetic … which means you can’t sound it out from it’s letters … doesn’t have letters and alphabet … has building blocks called radicals and combos of radicals called characters … in the mandarin and cantonese chinese dialects, each character is pronounced as a syllable … in other chinese dialects and maybe in korean and japanese, each character may be pronounced as 1 or more syllables … one must memorize the sound for each character … as to the meaning, the character is somewhat pictorial and often gives some clue to its meaning … once the meaning is known, the sound of any language can be applied to it … so, this isn’t an actual example, but it will work to show how it works … let’s say there’s a chinese character that means work … someone speaking english could see that char and say “work” … another language say french would see it and say “travaille” … and so forth … that’s why it worked for lots of spoken languages to unify ancient china … that’s why japan and korea borrowed it when they had no written language).
the koreans used the chinese system at first, then a korean prince invented a phonetic system called hangul, so only for ancient, esoteric, literary, or certain academic flourishes is the chinese writing used today in korea.
here’s some of the unicode for hangul
anyway, to deal with both the similarities among chinese, korean, and japanese, but also the unique characters each one needed for computing applications to be complete — the unicode consortium developed the CJK block of characters. so that’s what that block is. what the individual character is, i still don’t know. i didn’t run into characters like that when i was learning the korean, chi, and japanese writing systems, so this particular character may be like some of the older chinese characters that have been retired from much modern use, but are still needed for academic purposes.
But cool that the tools worked to solve my @GuyVincent question!
SUMMARY: BASIC TOOLSET
So there may be other tools on richard’s site or in other places, but, for now, my toolset is richard ishida’s flexible conversion tool (with so many input/output options) and that decodeunicode.org site (with the amazing fast mouseover user interface for changing among unicode blocks). plus for intro to the whole area, joelonsoftware’ s tutorial. Here they are again:
or use packetizer converter to get hex codes (note below)
Note: Just looked over (re-read) the earlier stuff at the top of the page and it turns out that, back in March, I had already found a way to paste characters from a cool twitart tweet into a converter (the packetizer site which works like the one on the richard ishida site) and get hex codes to find the unicode block (on the unicodedecoder.org site), find other characters in the block, and paste them into twitter. I had forgotten I’d gotten all the way through to a way to get and use new characters in twitter. So either converter works and both the “copy from unicodedecoder.org and paste into twitter” process and the “get the hex code and add & and # and x and ; to it” process works to get the new characters into twitter.
In that earlier work, I hadn’t remembered or figure out again yet how to do insert characters into twitter using the hex numerical codes and the well-known, standard, and simple (but easy to screw up in the details of typing) “& # x number ;” format. There was also one confusion factor back then. In the packetizer site, in the left column of the output, are the hex codes we’re using (e.g., 2602 for the little umbrella). But there’s also a right column in the output and I don’t understand why the left one is unicode and the right one is utf-8, and why — when the U in utf means Unicode — why the two numbers are different, but that’s not important right now. Having wrassled with this a bit now, it seems likely there are some sites just sitting out there waiting for us to find them that make this whole process very easy. In other words, lets us input the character we like, hit a button, gives us hex, also gives us a button that just takes us to the block the character is in, and at the block chart, lets every character (displayed as a little .gif file) be a button that can be clicked to copy that character onto the clipboard for easy pasting into twitter. Like the “insert symbol” charts in some word processors. twitter apps or twitter itself could probably fairly easily add a utility that did that. like bit.ly lets you push a button to put its shortened URL onto the clipboard for easy pasting. that so easy and obvious, i bet it’s already out there and the twitArt crowd probably use such tools all day everyday. fun stuff.
So the packetizer site works too. Can use either the packetizer converter or the Richard Ishida converter to get the hex (hexadecimal number) for an interesting new character that showed up in some incoming tweet, which can then be used with the unicodedecoder.org site to find other cool things in that character’s block, which then can be put for the first time into one of your tweets, and fav’d/saved for later copy/pasting into other tweets.
Here’s one … ♌
hex 264c …
Comes from this block
♌ … reminds me of somebody … hex 264c … catchy lyric … maybe make a good backmasking message … “hex 264c is wonderful … hex 264c is beautiful … ” : )