965 stories

Introducing spaCy v2.3

1 Share

spaCy now speaks Chinese, Japanese, Danish, Polish and Romanian! Version 2.3 of the spaCy Natural Language Processing library adds models for five new languages. We’ve also updated all 15 model families with word vectors and improved accuracy, while also decreasing model size and loading times for models with vectors.

This is the last major release of v2, by the way. We’ve been working hard on spaCy v3, which comes with a lot of cool improvements, especially for training, configuration and custom modeling. We’ll start making prereleases on spacy-nightly soon, so stay tuned.

New languages

spaCy v2.3 provides new model families for five languages: Chinese, Danish, Japanese, Polish and Romanian. The Chinese and Japanese language models are the first provided models that use external libraries for word segmentation rather than spaCy’s tokenizer.


The new Chinese models use pkuseg for word segmentation and ship with a custom model trained on OntoNotes with a token accuracy of 94.6%. Users can initialize the tokenizer with both pkuseg and custom models and customize the user dictionary. Details can be found in the Chinese docs. The Chinese tokenizer continues to support jieba as the default word segmenter along with character-based segmentation as in v2.2.


The updated Japanese language class switches to SudachiPy for word segmentation and part-of-speech tagging. Using sudachipy greatly simplifies installing spaCy for Japanese, which is now possible with a single command: pip install spacy[ja]. More details are in the Japanese docs.

Model Performance

Following our usual convention, the sm, md and lg models differ in their word vectors. The lg models include one word vector for most words in the training data, while the md model prunes the vectors table to only include entries for the 20,000 most common words, mapping less frequent words to the most similar vector in the reduced table. The sm models do not use pretrained vectors.

Language Model Size TAG UAS LAS ENTS F
Chinese zh_core_web_sm 45 MB 89.63 68.55 63.21 66.57
zh_core_web_md 75 MB 90.23 69.39 64.43 68.52
zh_core_web_lg 575 MB 90.55 69.77 64.99 69.33
Danish da_core_news_sm 16 MB 92.79 80.48 75.65 72.79
da_core_news_md 46 MB 94.13 82.71 78.98 81.45
da_core_news_lg 546 MB 94.95 82.53 78.99 82.73
Japanese ja_core_news_sm 7 MB 97.30 88.68 86.87 59.93
ja_core_news_md 37 MB 97.30 89.26 87.76 67.68
ja_core_news_lg 526 MB 97.30 88.94 87.55 70.48
Polish pl_core_news_sm 46 MB 98.03 85.61 78.09 81.32
pl_core_news_md 76 MB 98.28 90.41 84.47 84.68
pl_core_news_lg 576 MB 98.45 90.80 85.52 85.67
Romanian ro_core_news_sm 13 MB 95.65 87.20 79.79 71.05
ro_core_news_md 43 MB 96.32 88.69 81.77 75.42
ro_core_news_lg 545 MB 96.78 88.87 82.05 76.71

The training data for Danish, Japanese and Romanian is relatively small, so the pretrained word vectors improve accuracy quite a lot, in particular for NER. The Chinese model uses a larger training corpus, but word segmentation errors may make the word vectors less effective. Word segmentation accuracy also explains some of the lower scores for Chinese, as the model has to get the word segmentation correct before it can be scored as accurate on any of the subsequent tasks.

Word vectors for all model families

All model families now include medium and large models with 20k and 500k unique vectors respectively. For most languages, spaCy v2.3 introduces custom word vectors trained using spaCy’s language-specific tokenizers on data from OSCAR and Wikipedia. The vectors are trained using FastText with the same settings as FastText’s word vectors (CBOW, 300 dimensions, character n-grams of length 5).

In particular for languages with smaller training corpora, the addition of word vectors greatly improves the model accuracy. For example, the Lithuanian tagger increases from 81.7% for the small model (no vectors) to 89.3% for the large model. The parser increases by a similar margin and the NER F-score increases from 66.0% to 70.1%. For German, updating the word vectors increases the scores for the medium model for all components by 1.5 percentage points across the board.

Remember that models trained with v2.2 will be incompatible with the new version. To find out if you need to update your models, you can run python -m spacy validate. If you’re using your own custom models, you’ll need to retrain them with the new version.

Updated training data

All spaCy training corpora based on Universal Dependencies corpora have been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish). The updated data improves the quality and size of the training corpora, increasing the tagger and parser accuracy for all provided models. For example, the Dutch training data is extended to include both UD Dutch Alpino and LassySmall, which improves the tagger and parser scores for the small models by 3%, and the addition of the new word vectors improve the scores further by 3-5%.

Fine-grained POS tags

As a result of the updates, many of the fine-grained part-of-speech tag sets will differ from v2.2 models. The coarse-grained tag-set remains the same, although there are some minor differences in how they are calculated from the fine-grained tags.

For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tag sets contain new merged tags related to contracted forms, such as ADP_DET for French "au", which maps to UPOS ADP based on the head "à". This increases the accuracy of the models by improving the alignment between spaCy’s tokenization and Universal Dependencies multi-word tokens used for contractions.

Smaller models and faster loading times

The medium model packages with 20k vectors are at least 2× smaller than in v2.2, the large English model is 120M smaller, and the loading times are 2-4× faster for all models with vectors. To achieve this, models no longer store derivable lexeme attributes such as lower and is_alpha and the remaining lexeme attributes (norm, cluster and prob) have been moved to spacy-lookups-data.

If you’re training new models, you’ll probably want to install spacy-lookups-data for normalization and lemmatization tables! The provided models include the norm lookup tables for use with the core pipeline components, but the optional cluster and prob features are now only available through spacy-lookups-data.

Free online course and tutorials

We’re also proud to announce updates and translations of our online course, “Advanced NLP with spaCy”. We’ve made a few small updates to the English version, including new videos to go with the interactive exercises. It’s really the translations we’re excited about though. We have translations into Japanese, German and Spanish, with Chinese, French and Russian soon to come.

spacy v2 3 spacy course

Speaking of videos, you should also check out Sofie’s tutorial on training a custom entity linking model with spaCy. You can find the code and data in our growing projects repository.

Another cool video to check out is the new episode of Vincent Warmerdam’s “Intro to NLP with spaCy” . The series lets you sit beside Vincent as he works through an example data science project using spaCy. In episode 5, “Rules vs. Machine Learning”, Vincent uses spaCy’s rule-based matcher to probe the decisions of the NER model he trained previously, using the rules to understand the model’s behavior and figure out how to improve the training data to get better results.

What’s next?

spaCy v2.3 is the last big release of v2. We’ve been working hard on v3, which we expect to start publishing prereleases of in the next few weeks. spaCy v3 comes with a lot of cool improvements, especially for training, configuration and custom modeling. The training and data formats are the main thing we’ve taken the opportunity to fix, so v3 will have some breaking changes, but don’t worry — it’s nothing like the big transformations seen in libraries like TensorFlow or Angular. It should be pretty easy to upgrade, but we’ve still tried to backport as much as possible into v2.3, so you can use it right away. We’ll also continue to make maintenance releases of v2.3 with bug fixes as they come in.

We also have a big release of our annotation tool Prodigy pretty much ready to go. In addition to the spaCy v2.3 update (giving you all the new models), Prodigy v1.10 comes with a new annotation interface for tasks like relation extraction and coreference resolution, full-featured audio and video annotation (including recipes using pyannote.audio models in the loop), a new and improved manual image UI, more options for NER annotation, new recipe callbacks, and lots more. To get notified when it’s ready, follow us on Twitter!

Read the whole story
19 days ago
Share this story

What’s more stressful than lockdown? The easing of lockdown | Suzanne Moore

1 Share

We have been through a collective trauma and need time to adjust to the new world. Instead, we are being told to go and have a pint. No wonder we can’t handle it

The least-helpful piece of advice in the world is: “Just be yourself.” (Seriously? It won’t go well.) The second-least-helpful is: “Just act normal.” I have been acting normal sitting on a train in a mask and gloves, while my glasses misted up, telling myself: “This is just fine.” Then I acted normally by sitting in the drizzle outside a pub with a young man desperate for his first pint of Guinness in a while, only to be told by frazzled bar staff that they had no Guinness. Then I acted normally in another small town and had a drink outside! Until I got up to go to the loo, and my mate stopped me. “Don’t go in there,” she said ominously. “I have a bad feeling about it.”

In a world in which we’re all “acting normal”, we shouldn’t have all these bad feelings, should we? But they won’t go away. We are now required to make snap judgments about what is safe and what isn’t because, actually, we don’t really know. Friends have reported “accidentally hugging” their own grownup kids over the weekend. And despite the pictures, most people in Soho in London on Saturday night were not really engaged in some bacchanal. (Nudge, wink, what was all the outcry about, if not homophobia?) The truth is, most of us are edging back into the water, not diving in.

Continue reading...
Read the whole story
27 days ago
Share this story

Python Raiz

1 Share

Como todos sabem sou programador Python a quase duas décadas e boa parte da minha carreira foi construída em torno desta linguagem. E isso mostra como ela é excelente. Pra uma pessoa como eu, que adora aprender e estudar linguagens de programação, significa muito priorizar uma só linguagem por quase metade da vida.

Mas o que me motivou a escrever esse texto aqui foram as adições recentes (ou nem tão recentes) à linguagem. Coisas como suporte para desenvolvimento assíncrono (async, await, etc), anotações de tipos, f-strings, operador “walrus” (:=), e a mais recente delas (cuja PEP ainda está em “Draft”): Pattern Matching.

Python Nutella

O conceito de Pattern Matching já existe a bastante tempo (Prolog?) mas ficou mais popular com a recente adoção da excelente linguagem Elixir por muitos desenvolvedores.

O uso mais básico de Pattern Matching pode ser visto nas instruções switch/case presente em várias linguagens de programação estruturadas. Mas Pattern Matching não é “só” um switch/case. Nos tradicionais comandos de switch/case as linguagens avaliam uma expressão (switch expr) e dependendo do resultado da expressão ele busca um bloco de código que dê “match” com esse resultado (case constant-expr).

Mas em linguagens como Elixir (que é a que usarei como exemplo por conhecer melhor) esse “match” pode ser feito com regras muito mais elaboradas e usar estruturas de dados completas no lugar de apenas um valor constante, como nas linguagens que mencionei acima.

Abaixo vou colocar um exemplo de código pseudo-Python que demonstra como uma função pode ser implementada sem e com Pattern Matching:

Esse caso mostra como o conceito é poderoso e prático para resolver uma série de problemas. E a sintaxe parece razoavelmente natural. Mas parece que essa “naturalidade” some quando avançamos pelos exemplos da PEP e começamos a ver coisas como:

O que me assusta em tudo isso são as invenções de coisas que nunca estiveram no Python:

  • _ como caracter coringa: o caracter _ sempre foi usado, por convenção, como uma variável cujo valor pode ser descartado. Um exemplo disso seria fazer coisas como: name, url, *_ = 'name,url,extra,data'.split(','). Porque não usar a cláusula else: com o match?
  • Dotted-names: Os casos de Constant Value Patterns onde tiveram que inventar uma sintaxe nova com dotted-names é outra coisa que nunca vi no Python. Essa sintaxe foi descartada anteriormente para coisas muito mais simplórias como quando pediram algo parecido com with do Pascal.
  • | como operador de alternativa: | já existe em Python e ele faz as vezes de OR bit-a-bit. Também temos o or que faz as vezes de OR lógico. Me parece que, mesmo não sendo semanticamente a mesma coisa, o segundo OR lógico faz mais sentido do que o OR bit-a-bit. Uma opção aceitável, por familiaridade, seria o uso de || que também é usado como OR lógico em diversas linguagens.

Mas o que mais me desagrada nessas recentes adições à linguagem é a preocupação que tenho com “legibilidade natural da linguagem“. E não, essa legibilidade não tem relação direta com escrever código limpo e organizado para programadores Python. Tem relação com a capacidade de um código ser compreendido até mesmo por algum desenvolvedor que não conhece a linguagem.

Aprendendo Python

Python não foi minha primeira linguagem. Já conhecia outras antes dela. Essas linguagens eram majoritariamente estruturadas mas eu já tinha brincado um pouco com desenvolvimento orientado à objetos com Object Pascal do Turbo Pascal 6 e 7 (com Turbo Vision e tal).

Meu primeiro contato com Python se deu por volta do ano 2000 na extinta e saudosa Conectiva (aquela do Conectiva Linux que depois se juntou com a Mandrake pra formar a Mandriva que também sumiu… enfim… uma bagunça 🙂).

Eu estava trabalhando em um projeto de “compilador de configuração para interfaces gráficas”. Esse é o nome chique que dei agora… na época era só um programa escrito em C que geraria os arquivos texto de configuração para Gnome, KDE e WindowMaker à partir de um conjunto de configurações centralizadas.

Mas o importante para a discussão de agora é a parte do “escrito em C” e “arquivos texto”. Todos sabem que C e texto não se harmonizam muito bem, certo? Então… no final tudo daria certo. Só demoraria mais tempo pra ficar pronto.

Certo dia meu chefe me trouxe um artigo que ensinava Python. Ele havia traduzido o artigo original para o português e pediu pra que eu desse uma revisada (acho que a intenção dele era outra já nem sabia inglês direito). O artigo original não está mais no ar mas tenho uma cópia dele (a tradução se perdeu).

A versão do Python, naquela época, era 1.5.2 (o artigo foi atualizado depois que eu li para acrescentar coisas do Python 2). E lendo só esse artigo eu aprendi Python em 1, uma, UMA f*cking noite! Com um artigo de blog!

Eu aprendi Python em uma noite porque eu sou inteligente? Esperto? Super-humano? Não! Dêem uma olhada no post.

Eu espero.

Viram como a linguagem do artigo é simples? Ela é legível, as construções dela são intuitivas: v = 1 atribui valor 1 pra v, v == 2 compara valor v com 2, v[0] acessa o primeiro elemento de um array/lista, class Person: ... define uma classe, e assim se segue. Ou seja, para uma pessoa que programa minimamente em algo conseguia aprender a linguagem muito rápido.

Tinham pouca coisa estranha nessa linguagem que não existia em outras. Talvez o fato de usar indentação pra delimitar os blocos de código e aquele parâmetro self nos métodos de instância. Mas tirando isso é tudo bem normalzinho.

Agora imaginem uma pessoa aprendendo Python com coisas como:

Imagina esbarrar com um código desses logo de primeira? O cara volta pro Perl 🙂 Just kidding…

Eu sei que o código do artigo é praticamente o mesmo para rodar no Python de hoje (talvez só o uso de print() e input() tenha mudado com o Python 3).

Também sei que o artigo não ensina tudo sobre a linguagem (mesmo naquela versão da época). Ele é só uma introdução. E de fato, quando decidi me aprofundar mais no aprendizado da linguagem eu fui atrás de outros materiais. Li o The Python Tutorial que vem com a própria linguagem e na seqüencia importei um livro que realmente ensinava a linguagem toda: Learning Python da editora O’Reilly.

Naquela época eu comprei a primeira edição do livro que, hoje, já está na quinta edição que cobre até o Python 3.3.

Finalmente eu pude estudar tudo o que Python tinha. E até hoje eu recomendo esse livro. Mas vou falar uma curiosidade sobre ele: a 1ª edição tinha 384 páginas. A 5ª edição tem 1648 páginas! Mais de 4 vezes maior. E nem descrevem as (muitas) novidades do Python 3.4, 3.5, 3.6, 3.7, etc.

Ou seja, se você realmente pretende dominar tudo o que a linguagem oferece vai levar uma vida. Além do Learning Python eu ainda recomendo o Fluent Python (assim que a segunda edição sair eu atualizo o link) do meu querido amigo Luciano Ramalho.

Python para não-programadores

Todo mundo sabe que Python tem crescido muito na comunidade científica. Isso aconteceu muito graças à iniciativas como SciPy e de projetos que nasceram dentro dessa iniciativa e ganharam vida própria como o Jupyter, matplotlib, pandas, scikit-learn, etc.

Por outro lado Python também se tornou uma linguagem muito usada no ensino de programação de diversas escolas e universidades em todo o mundo. Usam ela para ensinar programação para todo mundo e não só para alunos dos cursos de computação e afins.

E essas duas coisas estão relacionadas. Se você ensina programação com Python para um aluno de biologia qual linguagem ele vai usar para escrever uma ferramenta que auxilia num trabalho de genética?

Porque essas escolas escolheram Python para ensinar programação Porque projetos como SciPy e Jupiter escolheram usar Python? Porque não escolheram outras linguagens?

Eu suspeito de que seja um conjunto de atributos da linguagem, entre eles:

  1. Educacional: Python nasceu de um projeto de linguagem educacional (ABC), mas também nasceu porque seu criador (Guido van Rossum) acreditava que linguagens educacionais não precisavam ser de brinquedo (toy languages) e que deveria ser possível usá-las no dia-a-dia.
  2. Multiplataforma: instalação, implementação e uso fácil nas principais plataformas disponíveis.
  3. Multipropósito: você consegue desenvolver software de linha de comando mas também consegue implementar uma interface gráfica ou um servidor Web.
  4. Multiparadigma: sabe programação estruturada? ok. Sabe modelagem OO? ok também. Sou craque em lambda functions? tá lá também um básico pra você usar.
  5. Facilidade de integração com bibliotecas em de outras linguagens como C, Fortran, etc. Vocês devem imaginar o imenso número de bibliotecas científicas implementadas em outras linguagens. Porque reescrever tudo?

Mas o que eu acho mais importante para essa escolha é a de que todos os cientistas que já programaram alguma coisa para seu trabalho conseguem ver um código Python e ter uma ideia, mesmo que superficial, do que aquilo faz. É a tal “legibilidade natural da linguagem” que já mencionei.


Se você quer um carro que comporte toda a família, seja potente, veloz, econômico, tenha porta-malas grande, seja espaçoso, confortável, etc você provavelmente vai acabar com isso aqui:

Python nasceu pra ser fácil de aprender e usar. Ter uma sintaxe limpa, clara e familiar. Sem coisas esdrúxulas como símbolos em excesso ou coisas menos convencionais como a sintaxe object message do Smalltalk (que inspira a sintaxe do Ruby).

Python também nasceu sem tipos ou anotações de tipos. Tudo nela foi pensada para abstrair esse conceito “mundano” da computação. Isso facilita o aprendizado das pessoas.

Python nunca pretendeu ter uma performance absurda. Só a performance necessária para permitir o seu uso em problemas reais. Se você precisasse de performance absurda você provavelmente escreveria código em C ou Assembly e “colaria” ele com Python.

Python nunca foi pensada para concorrência ou paralelismo. Então não dá pra “competir” com linguagens como Go ou Elixir (Erlang) nesse quesito. Essas linguagens nasceram pra lidar com esse tipo de problema. Acrescentar um punhado de palavras reservadas e algumas bibliotecas não torna Python ideal para esse propósito.

Python não nasceu como linguagem funcional. Mesmo ela tendo ferramentas que te permitam expressar algumas ideias de modo funcional ela não é uma linguagem funcional de verdade. Python nasceu como linguagem majoritariamente orientada à objetos.

Colocar um map() aqui e um lambda acolá não torna a linguagem própria para resolver problemas que são mais facilmente solucionáveis com linguagens funcionais. Python tem objetos mutáveis. Python não tem suporte nativo à tail-recursion. Macros? O próprio Guido já disse que “nem morto” (não achei a referência, mas acredite, eu vi isso).


Python já não é mais uma linguagem de nicho ou underground. Python já é mainstream e centenas ou milhares de empresas de diversos portes já usam ela em seus negócios.

Como consequência disso o número de vagas de emprego e trabalhos com Python cresceu vertiginosamente nos últimos anos e por isso, inevitavelmente, programadores de outras linguagens acabam tendo que lidar com Python em algum momento de suas carreiras.

Por conta dessa situação eu tenho a sensação de que a necessidade de adicionar certas funcionalidades no Python vem do desejo desses programadores de usar parte favorita da outra linguagem também em Python porque eles não conseguem pensar “do jeito Python”.

Lembro bem quando Java estava na moda e todo programador Python criava getters/setters nas classes. Inventavam mil maluquices como Interfaces (no Zope), protocols, … para terem algo parecido com o que Java oferecia. Um baita esforço pra programar Java em Python.

Hoje em dia parece que tá todo mundo querendo programar Elixir em Python, JS/Node em Python, …

Sem Conclusão

Vocês devem estar imaginando que estou ficando desgostoso com a linguagem ou que vou abandoná-la. Não estou não. As coisas novas que não gosto de usar irei ignorar (ex. operador :=). Outras que me fizeram até torcer o nariz mas passei a gostar entraram pro meu repertório (ex. f-strings).

Tem aquelas funcionalidades que me fazem torcer o nariz, e quando dei uma chance pra elas descobri que podem ser úteis se forem usadas com muita moderação (ex. type annotation) serão usadas com moderação.

Enfim… Vou continuar a usar e a gostar de Python. Até porque ela tem tantas qualidades que me deixou preguiçoso para mudar. Mas pretendo continuar a programar Python em Python. Daquele jeito raiz. Daquele jeito moleque… aquela programação Python de várzea.

Update: corrigido um erro no operador “OR lógico” para or e adicionei uma sugestão de uso do ||.

The post Python Raiz appeared first on osantana.

Read the whole story
38 days ago
Share this story

Brazil's Fake News Bill Would Dismantle Crucial Rights Online and is on a Fast Track to Become Law

1 Share

Update: A new draft text was released shortly before the voting set for June 25th. It doesn’t include blocking and data localization measures, but the surveillance and identification rules remain. Read more in the analysis of a coalition of digital rights groups in Brazil.

Despite widespread complaints about its effects on free expression and privacy, Brazilian Congress is moving forward in its attempts to hastily approve a "Fake News" bill. We've already reported about some of the most concerning issues in previous proposals, but the draft text released this week is even worse. It will hinder users' access to social networks and applications, require the construction of massive databases of users' real identities, and oblige companies to keep track of our private communications online.

It creates demands that disregard Internet key characteristics like end-to-end encryption and decentralised tool-building, running afoul of innovation, and could criminalize the online expression of political opinions. Although the initial bill arose as an attempt to address legitimate concerns on the spread of online disinformation, it has opened the door to arbitrary and unnecessary measures, that strike settled privacy and freedom of expression safeguards.

You can join the hundreds of other protestors and organizations telling Brazil’s lawmakers why not to approve this Fake News bill right now.

Here’s how the latest proposals measure up:

Providers Are Required to Retain the Chain of Forwarded Communications

Social networks and any other Internet application that allows social interaction would be obliged to keep the chain of all communications that have been forwarded, whether distribution of the content was done maliciously or not. This is a massive data retention obligation which would affect millions of innocent users instead of only those investigated for an illegal act. Although Brazil already has obligations for retaining specific communications metadata, the proposed rule goes much further. Piecing together a communication chain may reveal highly sensitive aspects of individuals, groups, and their interactions -- even when none are actually involved in illegitimate activities. The data will end up as a constantly-updated map of connections and relations between nearly every Brazilian Internet user: it will be ripe for abuse.

Furthermore, this obligation disregards the way more decentralized communication architectures work. It assumes that application providers are always able to identify and distinguish forwarded and non-forwarded content, and also able to identify the origin of a forwarded message. In practice, this depends on the design of the service and on the relationship between applications and services. When the two are independent it is common that the service provider will not be able to  differentiate between forwarded and non-forwarded content, and that the application does not store the forwarding history except on the user's device.  This architectural separation is traditional in Internet communications, including  web browsers, FTP clients, email, XMPP, file sharing, etc. All of them allow actions equivalent to the forwarding of contents or the act of copying and pasting them, where the client application and its functions are  technically and legally independent from the service to which it connects. The obligation would also negatively impact open source applications, designed to let  end-users not only understand but also to modify and adapt  the functioning of local applications.

It Compels Applications to Get All User's ID and Cell Phone Numbers

The bill creates a general monitoring obligation on user's identity, compelling Internet applications to require all users to give proof of identity through a national ID or passport, as well as their phone number. This requirement goes in the opposite direction to the  principles and safeguards set out in the country's data protection law which is yet to enter into force.  A vast database of identity cards, held by private actors, is in no way aligned with the standards of data minimization, purpose limitation and the prevention of risks in processing and storing personal data that Brazil’s data protection law represents.

Current versions of the "Fake News" Bill do not even ensure the use of  pseudonyms for Internet users. As we've said many times before, there are myriad reasons why individuals may wish to use a name other than the one they have on their IDs and were born with. Women rebuilding their lives despite the harassment of domestic violence abusers, activists and community leaders facing threats, investigative journalists carrying out sensitive research in online groups, transgender users affirming their identities are just a few of examples of the need for pseudonymity in a modern society.

Under the new bill, users' accounts would be linked to their cell phone numbers, allowing  -- and in some cases requiring --  telecom service providers and Internet companies to track users even more closely. Anyone without a mobile number would be prevented from using any social network -- if users' numbers are disabled for any reason, their social media accounts would be suspended. In addition to privacy harms, the rule creates serious hurdles to speak, learn, and share online. 

Censorship, Data Localization, and Blocking

These proposals seriously curb the online expression of political opinions and could quickly lead to political persecution. The bill sets high fines in cases of online sponsored content that mocks electoral candidates or question election reliability. Although elections' trustworthiness is crucial for democracy and disinformation attempts to disrupt it should be properly tackled, a broad interpretation of the bill would severely endanger the vital work of e-voting security researchers in preserving that trustworthiness and reliability. Electoral security researchers already face serious harassment in the region. Other new and vague criminal offenses set by the bill are prone to silence legitimate critical speech and could criminalize users' routine actions without the proper consideration of malicious intent.

The bill revives the disastrous idea of data localization. One of its provisions would force  social networks to store user data in a special database that would be required to be hosted in Brazil. Data localization rules such as this can make data especially vulnerable to security threats and surveillance, while also imposing serious barriers to international trade and e-commerce.

Finally, as the icing on the cake of a raft of provisions that disregard  the Internet's global nature, providers that fail to comply with the rules would be subject to a suspension penalty. Such suspensions are unjustifiable and disproportionate, curtailing the communications of millions of Brazilians and incentivizing applications to overcompliance in the detriment of users' privacy, security, and free expression.

EFF has joined many other organizations across the world calling on the Brazilian parliament to reject the latest version of the bill and stop the fast-track mode that has been adopted. You can also take action against the "Fake News" bill now, with our Twitter campaign aimed at senators of the National Congress.

Read the whole story
38 days ago
Share this story

.ORG Domain Registry Sale to Ethos Capital Rejected in Stunning Victory for Public Interest Internet

1 Share
ICANN Withholds Consent, Says Deal Lacked ‘Meaningful Plan to Protect’ .ORG Community

San Francisco—In an important victory for thousands of public interest groups around the world, a proposal to sell the .ORG domain registry to private equity firm Ethos Capital and convert it to a for-profit entity was rejected late yesterday by the Internet Corporation for Assigned Names and Numbers (ICANN).

The Electronic Frontier Foundation (EFF), which worked hand in hand with Access Now, NTEN, National Council of Nonprofits, Americans for Financial Reform, and many other organizations to oppose the sale, applauds ICANN’s well-reasoned decision to stop the $1.1 billion transaction from moving forward. In a statement, ICANN said rejecting the deal was the right thing to do because it lacked a meaningful plan to protect the interests of nonprofits and NGOs that rely on the .ORG registry to exist on the Internet and connect with the people they serve.

The sale would change Public Interest Registry (PIR), the nonprofit operator of .ORG, into an entity bound to serve the interests of its corporate stakeholders, not the nonprofit world. ORG is the third-largest Internet domain name registry, with over 10 million domain names held by a diverse group of charities, public interest organizations, and nonprofits, from the Girl Scouts of America and American Bible Society to Farm Aid and Meals On Wheels.

“We’re gratified that ICANN listened to the .ORG community, which was united in its opposition to the sale,” said EFF Senior Staff Attorney Mitch Stoltz. “Under the deal, .ORG would be converted to a for-profit entity controlled by domain name industry insiders and their secret investors. Nonprofits are vulnerable to the governments and corporations who they often seek to hold accountable. The public interest community rightly questioned whether an owner motivated by profits would stand up to demands for censorship of charities who rely on .ORG so that people can find and rely on their vital services.”

“The sale of .ORG was announced, without .ORG community input, not long after price caps on registration fees for domain names were lifted and PIR acquired new powers to allegedly ‘protect’ the rights of third parties,” said EFF Staff Attorney Cara Gagliano. “It was obvious to many that .ORG registrants could face higher operating costs and degradation of service as Ethos sought to increase fees and seek profitable arrangements with businesses keen to silence nonprofits. This concern grew after it was revealed that the transaction required taking on a $360 million debt obligation.”

If PIR wishes to press forward, it still must seek approval from courts in the state of Pennsylvania, where PIR is incorporated.  As part of that process, the Pennsylvania state Attorney General may weigh in. EFF urges both to follow ICANN’s lead and reject the transaction. This will pave the way for a transparent process to select a new operator for .ORG that will act in the interests of the nonprofits that it serves.


Senior Staff Attorney
Staff Attorney
Read the whole story
76 days ago
Share this story

Governments Shouldn’t Use “Centralized” Proximity Tracking Technology

1 Share

Companies and governments across the world are building and deploying a dizzying number of systems and apps to fight COVID-19. Many groups have converged on using Bluetooth-assisted proximity tracking for the purpose of exposure notification. Even so, there are many ways to approach the problem, and dozens of proposals have emerged.

One way to categorize them is based on how much trust each proposal places in a central authority. In more “centralized” models, a single entity—like a health organization, a government, or a company—is given special responsibility for handling and distributing user information. This entity has privileged access to information that regular users and their devices do not. In “decentralized” models, on the other hand, the system doesn’t depend on a central authority with special access. A decentralized app may share data with a server, but that data is made available for everyone to see—not just whoever runs the server. 

Both centralized and decentralized models can claim to make a slew of privacy guarantees. But centralized models all rest on a dangerous assumption: that a “trusted” authority will have access to vast amounts of sensitive data and choose not to misuse it. As we’ve seen, time and again, that kind of trust doesn’t often survive a collision with reality. Carefully constructed decentralized models are much less likely to harm civil liberties. This post will go into more detail about the distinctions between these two kinds of proposals, and weigh the benefits and pitfalls of each.

Centralized Models

There are many different proximity tracking proposals that can be considered “centralized,” but generally, it means a single “trusted” authority knows things that regular users don’t. Centralized proximity tracking proposals are favored by many governments and public health authorities. A central server usually stores private information on behalf of users, and makes decisions about who may have been exposed to infection. The central server can usually learn which devices have been in contact with the devices of infected people, and may be able to tie those devices to real-world identities. 

For example, a European group called PEPP-PT has released a proposal called NTK. In NTK, a central server generates a private key for each device, but keeps the keys to itself. This private key is used to generate a set of ephemeral IDs for each user. Users get their ephemeral IDs from the server, then exchange them with other users. When someone tests positive for COVID-19, they upload the set of ephemeral IDs from other people they’ve been in contact with (plus a good deal of metadata). The authority links those IDs to the private keys of other people in its database, then decides whether to reach out to those users directly. The system is engineered to prevent users from linking ephemeral IDs to particular people, while allowing the central server to do exactly that.

Some proposals, like Inria’s ROBERT, go to a lot of trouble to be pseudonymous—that is, to keep users’ real identities out of the central database. This is laudable, but not sufficient, since pseudonymous IDs can often be tied back to real people with a little bit of effort. Many other centralized proposals, including NTK, don’t bother. Singapore’s TraceTogether and Australia’s COVIDSafe apps even require users to share their phone numbers with the government so that health authorities can call or text them directly. Centralized solutions may collect more than just contact data, too: some proposals have users upload the time and location of their contacts as well.

Decentralized Models

In a “decentralized” proximity tracking system, the role of a central authority is minimized. Again, there are a lot of different proposals under the “decentralized” umbrella. In general, decentralized models don’t trust any central actor with information that the rest of the world can’t also see. There are still privacy risks in decentralized systems, but in a well-designed proposal, those risks are greatly reduced.

EFF recommends the following characteristics in decentralized proximity tracking efforts:

  1. The goal should be exposure notification. That is, an automated alert to the user that they may have been infected by proximity to a person with the virus, accompanied by advice to that user about how to obtain health services. The goal should not be automated delivery to the government or anyone else of information about the health or person-to-person contacts of individual people.
  2. A user’s ephemeral IDs should be generated and stored on their own device. The ephemeral IDs can be shared with devices the user comes into contact with, but nobody should have a database mapping sets of IDs to particular people. 
  3. When a user learns they are infected, as confirmed by a physician or health authority, it should be the user’s absolute prerogative to decide whether or not to provide any information to the system’s shared server. 
  4. When a user reports ill, the system should transmit from the user’s device to the system’s shared server the minimum amount of data necessary for other users to learn their exposure risk. For example, they may share either the set of ephemeral IDs they broadcast, or the set of IDs they came into contact with, but not both.
  5. No single entity should know the identities of the people who have been potentially exposed by proximity to an infected person. This means that the shared server should not be able to “push” warnings to at-risk users; rather, users’ apps must “pull” data from the central server without revealing their own status, and use it to determine whether to notify their user of risk. For example, in a system where ill users report their own ephemeral IDs to a shared server, other users’ apps should regularly pull from the shared server a complete set of the ephemeral IDs of ill users, and then compare that set to the ephemeral IDs already stored on the app because of proximity to other users.  
  6. Ephemeral IDs should not be linkable to real people or to each other. Anyone who gathers lots of ephemeral IDs should not be able to tell whether they come from the same person.

Decentralized models don’t have to be completely decentralized. For example, public data about which ephemeral IDs correspond to devices that have reported ill may be hosted in a central database, as long as that database is accessible to everyone. No blockchains need to be involved. Furthermore, most models require users to get authorization from a physician or health authority before reporting that they have COVID-19. This kind of “centralization” is necessary to prevent trolls from flooding the system with fake positive reports.

Apple and Google’s exposure notification API is an example of a (mostly) decentralized system. Keys are generated on individual devices, and nearby phones exchange ephemeral IDs. When a user tests positive, they can upload their private keys—now called “diagnosis keys”—to a publicly accessible database. It doesn’t matter if the database is hosted by a health authority or on a peer-to-peer network; as long as everyone can access it, the contact tracing system functions effectively.

What Are the Trade-Offs?

There are benefits and risks associated with both models. However, for the most part, centralized models benefit governments, and the risks fall on users.

Centralized models make more data available to whoever sets themselves up as the controlling authority, and they could potentially use that data for far more than contact tracing. The authority has access to detailed logs of everyone that infected people came into contact with, and it can easily use those logs to construct detailed social graphs that reveal how people interact with one another. This is appealing to some health authorities, who would like to use the data gathered by these tools to do epidemiological research or measure the impact of interventions. But personal data collected for one purpose should not be used for another (no matter how righteous) without the specific consent of the data subjects. Some decentralized proposals, like DP-3T, include ways for users to opt-in to sharing certain kinds of data for epidemiological studies. The data shared in that way can be de-identified and aggregated to minimize risk.

More important, the data collected by proximity tracking apps isn’t just about COVID—it’s really about human interactions. A database that tracks who interacts with whom could be extremely valuable to law enforcement and intelligence agencies. Governments might use it to track who interacts with dissidents, and employers might use it to track who interacts with union organizers. It would also make an attractive target for plain old hackers. And history has shown that, unfortunately, governments don’t tend to be the best stewards of personal data.

Centralization means that the authority can use contact data to reach out to exposed people directly. Proponents argue that notifications from public health authorities will be more effective than exposure notification from apps to users. But that claim is speculative. Indeed, more people may be willing to opt-in to a decentralized proximity tracking system than a centralized one. Moreover, the privacy intrusion of a centralized system is too high.

Even in an ideal, decentralized model, there’s some degree of unavoidable risk of infection unmasking: that when someone reports they are sick, everyone they've been in contact with (and anyone with enough Bluetooth beacons) can theoretically learn the fact that they are sick. This is because lists of infected ephemeral IDs are shared publicly. Anyone with a Bluetooth device can record the time and place they saw a particular ephemeral ID, and when that ID is marked as infected, they learn when and where they saw the ID. In some cases this may be enough information to determine who it belonged to. 

Some centralized models, like ROBERT, claim to eliminate this risk. In ROBERT’s model, users upload the list of IDs they have encountered to the central authority. If a user has been in contact with an infected person, the authority will tell them, "You have been potentially exposed," but not when or where. This is similar to the way traditional contact tracing works, where health authorities interview infected people and then reach out directly to those they’ve been in contact with. In truth, ROBERT’s model makes it less convenient to learn who’s infected, but not impossible. 

Automatic systems are easy to game. If a bad actor only turns on Bluetooth when they’re near a particular person, they’ll be able to learn whether their target is infected. If they have multiple devices, they can target multiple people. Actors with more technical resources could more effectively  exploit the system. It’s impossible to solve the problem of infection unmasking completely—and users need to understand that before they choose to share their status with any proximity app. Meanwhile, it’s easy to avoid the privacy risks involved with granting a central authority privileged access to our data.


EFF remains wary of proximity tracking apps. It is unclear how much they will help; at best, they will supplement tried-and-tested disease-fighting techniques like widespread testing and manual contact tracing. We should not pin our hopes on a techno-solution. And with even the best-designed apps, there is always risk of misuse of personal information about who we've been in contact with as we go about our days.

One point is clear: governments and health authorities should not turn to centralized models for automatic exposure notification. Centralized systems are unlikely to be more effective than decentralized alternatives. They will create massive new databases of human behavior that are going to be difficult to secure, and more difficult to destroy once this crisis is over.

Read the whole story
76 days ago
Share this story
Next Page of Stories