Non ASCII Characters: find out what they are and how to remove them

Those who constantly work with codes, such as SEO Specialists or Web Developers, certainly know what it means to come across non ASCII characters and what their management entails.

Indeed, on the one hand, we have an invaluable resource such as ASCII characters, which allow you to insert special characters in texts or send commands in just a few steps; on the other hand, there is a system of codes, non ASCII characters for precisely, which constitute a significant problem for all those who work in the programming sector.

origins of non ascii characters

To build a good website and optimize it in the best way, it is essential to know how to work with programming code and to know the difference between ASCII and non ASCII characters. From a theoretical point of view, the distinctive characteristics between the two types of code could be taken for granted, but knowing how they work and being able to manage them is essential for good positioning in the SERP of a website.

In this regard, to learn how to best create a site, both in terms of structure and content, I suggest the Digital Coach SEO Course Certification, a theoretical-practical path that allows you to acquire useful skills to position your website higher on Google than your competitors.

This article will give you some theoretical background on the world of codes and will lead you to the discovery of ASCII characters. Specifically, you will understand what non ASCII characters are and how to remove them to better optimize a website.

What are non ASCII characters?

To explain what non ASCII characters are, we need to start from the beginning by describing their elaboration. The acronym ASCII came from American Standard Code for Information Interchange.

Introduced in 1963, this name was chosen to name the 7-bit character coding system that allowed the representation of 128 characters and gave the ability to give specific commands to computers and convert information into standardized digital formats. This system was used in calculators, operating systems of computer systems, and, more generally, in computers.

I guess you’re wondering what 7-bit character encoding is, right? First of all, by bit, we mean digital information. Standard ASCII code uses seven digits of binary numbers (bits) or is made up of various sequences of 0 and 1. The ASCII code can represent as many as 128 different characters because there are 128 possible combinations of seven 0s and 1s. To give you an example, the sequence 1010000 simply represents the capital letter “P”.

Subsequently, to respond to more complex needs, a bit was added, which allowed the extension of the ASCII code to 256 characters. At the moment, the most used system is Unicode. This is a new coding that allows you to encode in a standard way and allows you to use a larger set of characters that also include Greek and Cyrillic.

This type of code can be programmed with variables ranging from 32, 16, and 8 bits and allows coding all languages. Furthermore, it also supports the insertion of characters belonging to the Braille alphabet and famous emoticons.

What, then, is a non ASCII code? To do this, I resume the distinction made by Emacs, a free text editor popular among programmers. In particular, this editor has two methods for representing text in a string or in a buffer.

These two types are called unibyte and multibyte, and each text string uses one of these two representations:

  • in the unibyte, each character occupies one byte; therefore, the character codes range exactly from 0 to 255. In particular, the codes from 0 to 127 are defined as ASCII characters, while codes 128 to 255 are defined as non-ASCII
  • in multibyte, each character can occupy more than one byte and can store a whole range of Emacs code. In particular, the first multibyte character is always in the range between 128 and 159. Subsequent bytes of the multibyte are located in the range between 160 and 255.

Non-ASCII characters are, therefore, the extension of the standard ASCII code. They include all those special characters from certain countries that have a particular type of writing, such as the native Japanese, Chinese, or Korean ones, or more simply, they are the characters that have accented vowels, semi-graphic symbols, or other types of graphemes.

Do you have what it takes to become a top SEO Specialist? Find out with a quiz!

seo specialist test

Difference between ASCII and non ASCII characters

What is the difference between ASCII and non ASCII characters? Quite simply, the former are the standard characters, while the latter are special characters that are not commonly used.

differences of special characters

In the previous paragraph, I explained to you that ASCII characters consist of 128 characters. This range can be divided into various sections; in particular, we find the characters:

  • of command (from 0 to 31 and the number 127): these are the so-called non-printable characters that are used to send commands to the PC
  • special (32 to 47, 58 to 64, 91 to 96, and 123 to 126): These are printable characters that don’t exactly match numbers and letters but do match punctuation marks. This field also includes the space, although it is not visible but is considered a printable character
  • indicating the numbers (from 30 to 30): allows you to create all 10 Arabic digits from 0 to 9
  • indicating the letters (from 65 to 90 and from 97 to 122): the first group allows you to create uppercase letters, while the second group allows you to create lowercase letters.

If you want to enter a character or symbol that is present on the keyboard using ASCII characters, hold down the “Alt” key and type in the corresponding numeric code.

To give you an example: do you want to insert the snail in a text? Hold down the “Alt” key and type 64, and you’ll get exactly the character you want! Would you like to insert an emoticon? Just press “Alt” + 1 to get a nice smile!

If, on the other hand, you would like to insert characters and symbols that are not present on the keyboard or on mobile devices, you will have to consider some shortcuts based on the Unicode coding system I told you about earlier, which contains both ASCII characters and other types of characters.

The non ASCII characters are all characters ranging from number 128 to 255, which consist of the so-called ASCII code extension. As previously mentioned, this range includes all those particular and uncommonly used graphemes that belong to oriental languages or to particular types of alphabets.

In some cases, non ASCII characters can cause problems and should be removed to clean up the code of a website.

Acquire the skills necessary to become a professional SEO Manager

START HERE

How to remove non ASCII characters

Why should we remove non ASCII characters? The use of non-ASCII characters depends on the cultural context in which we find ourselves, precisely because they are graphic signs that belong to specific alphabets. Often this type of character can create problems; for this reason, it is necessary to have good programming software that can resolve any coding errors due to the use of these characters.

To give you an example, it is always better to avoid using non-ASCII characters in URLs, especially if you aim to have a URL that is SEO-friendly, clean, and linear.

removing unwanted characters

Precisely for this reason, very often, you act on the code to eliminate non ASCII characters that could negatively affect the site’s ranking or compromise the code. Surely this is the goal of an expert in SEO, who will try to optimize the website of the company he works for to ensure its success.

But pay attention to one very important aspect: not all non ASCII characters are “bad” and should be eliminated. Instead, it is important to consider the context in which they are used. For example, if you are writing text in Chinese, it will be unavoidable to use ideograms that derive from non-ASCII characters.

To remove non ASCII characters, you need to know how to handle the code with care and, of course, be an expert. To remove invalid characters from a string, you’ll need to be able to use formulas and numbers deftly.

The methods are many and often depend on the type of writing and reading of the data file. For example, if you are reading the file in R, the best choice would be to keep this mode from studying the raw data to the final product. In some cases, for example, there may be some packages or functions that allow you to delete non-ASCII characters very easily.

In some cases, it will be enough to enter an expression, locate non ASCII characters, and remove them; in others, some computer packages may come to your aid.

I’ll now show you some common ways to strip non-ASCII characters from your string.

Detect non ASCII characters

The first step in eliminating non ASCII characters is to locate them. To find out if your file contains this type of character, you could use the following expressions:

Do *any* lines contain non-ASCII characters?any(grepl(“I_WAS_NOT_ASCII”, iconv(x, “latin1”, “ASCII”, sub=”I_WAS_NOT_ASCII”)))

Find which lines (eg read in by readLines()) contain non-ASCII characters grep(“I_WAS_NOT_ASCII”, iconv(x, “latin1”, “ASCII”, sub=”I_WAS_NOT_ASCII”))

Are you working with Notepad++? You can simply use this expression which will look for non-ASCII values ??for you: [^\x00-\x7F]+

Just check the “search mode = regular expression” item and click on the Find Next button. Alternatively, you can select the “Search” item in the menu and then “Find characters in the range” and tap on the “Non-ASCII characters (128-255)” item. This way, you will be able to scroll through the text and find all the non-ASCII characters.

Once all the non ASCII characters have been found, don’t forget to use a function that highlights them or inserts a bookmark in one of the lines of text containing this type of character. This way, you can keep track of them and not lose them in the chaos of the code.

Download the free ebook and discover what Non ASCII Characters are and give your SEO optimisation strategy a boost and gain top rankings on Google

seo copywriting ebook guide

Use a positive or negative expression of removal

To remove non ASCII characters, it is possible to use two types of expressions, called positive or negative. Using a positive expression of remove means directly expressing which characters to remove. Conversely, a negative expression of remove indicates which characters of the code not to remove. In the first case, the formula corresponds to:

textContent = textContent.replace(/[\u{0080}-\u{FFFF}]/gu,””);

As for the negative formula, instead, you have to write:

textContent = textContent.replace(/[^\x00-\x7F]/g,””);

How does the first formula differ from the second? To explain it, I’ll start with the meaning of the symbols: the circumflex accent present in the second expression indicates the word “non”, while the writing “\x00-\x7F” stands for “ASCII”.

The union of the two scripts and, consequently, of the two words indicates the locution “non ASCII”. And here’s how to get a negative statement for the removal of non-ASCII characters, particularly suitable for those who speak and use the English language. Also, keep in mind that in both cases, the Unicode coding system was used, which is the universal encoding type.

Use computer packages

To eliminate non ASCII characters, it is possible to use different types of packets. Computer packets, in industry jargon, indicate a finite sequence of data that is transmitted over a network, channel, or communication line. More generally, they are sequences of bits modulated numerically in order to be transmitted on the physical channel.

In your case, I will introduce you to the Stringi package, which will give you a function to convert the text to general Unicode, allowing you to preserve much of the original text.

x <- c(“Ekstr\u00f8m”, “J\u00f6reskog”, “bi\u00dfchen Z\u00fcrcher”)x#> [1] “Ekstrøm” “Jöreskog” “bißchen Zürcher”

stringi::stri_trans_general(x, “latin-ascii”)#> [1] “Ekstrom” “Joreskog” “bisschen Zurcher”

The Stringi package isn’t the only option. There are many other solutions you can use to get rid of non-ASCII characters present in your text. An alternative is to use the Xfun package with a filter from dplyr.

Conclusions and request for more information

As you have noticed, eliminating non ASCII characters is not child’s play: you need to know how to identify them and find the right method to remove them, also taking into consideration the tools you have. We remind you, however, that it is not always necessary to eliminate this type of character: you need to evaluate the cultural context in which you find yourself and which language you are using.

In general, however, removing non-ASCII characters is a good practice to keep the text of the code clean and improve the placement of their websites. In short, it is one of the best practices that every SEO copywriter should use to do their job in the best possible way.

Rely on an expert to understand how to optimize your website best!

CONTACT AN EXPERT FREE OF CHARGE

START YOUR CAREER

next-departure -courses-digital-marketing

DOWNLOAD YOUR EBOOK

Which is the best Digital jobs for you cover ebook

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *