Monday, May 25, 2009

Stop Spam, Read Books


An article in this Sunday's New York Times by Anne Eisenberg  titled, "New Puzzles That Tell Humans From Machines" provided a fascinating glimpse into the workings of reCaptcha.  

I was particularly fascinated by this passage, "The system has an unusual twist that provides an added benefit to projects that are digitizing books and papers in archives: the source of the wiggly images that people must decipher is not random. The images are drawn from books and other media that are being digitized in mass projects, but that machines haven’t been able to read because, for instance, the page is wrinkled.

"Automatic character recognition lets people who are having the work scanned know which words it cannot read. These are the words that recaptcha farms out and, once they are interpreted, returns to the original document. In this way, word by word, most of the mystery words are deciphered, in this case by humans. 

“We are digitizing about 25 million words per day by having people type in captchas,” Dr. von Ahn said.

"The audio captchas are also being used for transcription and digitization projects.  We are doing both speech and text," Dr. von Ahn said.  Take your choice.

"The Times is paying reCaptcha for its help in digitizing its archives, said Marc Frons, chief technology officer digital operations.  So far, puzzling words in archives covering about 3o years have been deciphered with reCaptchas, he said.

The reCaptchas above came from the USPTO PAIR site back in March.  I wonder whose phone number found its way into the middle of all this.  Google provided no results.

Friday, May 15, 2009

The Census Bureau and Nail Salons

The Census Bureau knows data.  

Over the years we've worked with the Census Bureau.  Everyone knows about the Decennial Census where they count everyone in the country.  For Census this the big show, the part of their mission that is in the Constitution.  But a lot of people don't realize what else they do.   The Census Bureau also does the Economic Census, the Census of Agriculture both of which are amazing feats of operational and logistical coordination, as well as about 450 other surveys.

We learned an important lesson about information and finding the real information from Census.  They have rules.  Rule Number 1:  Give us the data the way the respondent put it on the form.  Don't modify it, don't interpret it, don't manipulate it.  Just give us the data the way the respondent put it on the form.  None of this using the software to add up the columns of numbers or make decisions about the address based on the zip code.  We want the data the way that the respondent gave it to us.  It's important, the respondents tell us things. We need to pay attention to what they say.

This is a hard lesson for people in the software and analytics industry.  All those years spent years creating special purpose software techniques to correct problems with orders and forms and other documents out the window.  Add up the order, calculate the sales tax, add the shipping, check out.  That works for order processing, it doesn't work for the mathematicians and statisticians at the Census Bureau.  This is  good thing.

We learned why you don't want to translate or make assumptions on the data.  One of their best stories is about Nail Salons. 

During one of the Economic Censuses there were questions on the forms sent to businesses identified as beauty salons businesses with a designation for businesses who performed grooming related personal services.  Census asks a lot of questions.  And somewhere in the middle of the form they ask the respondent to break down their sales and revenue.  They provide some fields that have information on what one would expect for a beauty salon to help the respondent and then they include "OTHER".   

OTHER turned out to be very important.  Census's commitment to take the data that way the respondent provided it is very important when things are written into the "OTHER" section.  It turned out a lot of companies in this business sector were filling out the "OTHER" field with data.  This was an OUTLIER.  At Census, they live and die by outliers.  It's where they find the interesting new stuff.  Anyway, they found an outlier.

Close examination of the "OTHER" category and phone calls to the respondents revealed the emergence of a new business.  The "Nail Place".  Census has a more official name but they found the data that led them down the path of discovering that a new business emerged; a place that only did nails -- manicure, pedicure, french, silk wrap, acrylics -- nails.

It was interesting that they were able to confirm a reality.  The women of the world knew that there were nail salons.  But Census confirmed that what we thought was just a trend was a legitimate industry, creating jobs and providing services to the public.

Census found the trend by not letting the data be manipulated, by trusting their respondents.  Census trusted that the citizens answering their surveys were giving them good information about their businesses.  Census, in turn, recognized a new business and were able to provide us with insight about what was going on in the economy.

Next time you get one of those forms in the mail, you really should fill it out.  Census knows how to create information out of the data.  


Thursday, May 14, 2009

Boo to Boolean Searching

Boolean Searches and I don't get along.  A Boolean search makes you think about how to construct your search NOT what you are looking for.

One of our Doctor friends who is a very smart guy with lots of polysyllabic words in his vocabulary tried to construct a Boolean search for me to prove he could do it.  The man is brilliant, he can operate using the da vinci surgery robotics system and teach medical school.  The Boolean thing...too hard.  I gave him my keyword guessing machine analogy.  He equated it to a slot machine.  Then I made a really tragic mistake; I tried to explain it.  Aside from being able to talk about George Boole, the father of Boolean logic and its "and" "ors" and "nots"the rest of the conversation was unfulfilling all the way around.

I like cognitive searching.  Or semantic searching, searching related to the structure of language and logic, if you want to be more accurate.  

Cognitive searching lets you search using your ideas.  You can create a stream of consciousness text entry that describes what you are looking for and then hit enter.  The search uses latent semantic analysis to understand what you are looking for and then go find it.  This is a much more satisfying search experience.

At Coronado Group we are very good at cognitive searching.

Wednesday, May 13, 2009

Painful Patents

On Monday, May 11th, Greg Aharonian of Internet Patent News Service reported the following:                                

"NO GERMAN PATENT FOR A SAUDI KILLER CHIP"


"Newswires report that the German Patent Office last Friday rejected a patent application from a Saudi inventor which tried to claim implanting semiconductors under the skins of visitors and remotely killing them if they misbehave.  The chip would allow GPS tracking to prevent immigrants from overstaying, with some chips containing cyanide to be released by remote control to "eliminate" people if they become a security risk."

Gruesome...

But during a recent search focused on wearable technology we found US Patent 3,885,576. Inventor Eliot Symmes's 1975 patent, "Wrist Band Including a Mercury Switch To Induce an Electric Shock" discloses, "a wrist band including a normally open mercury switch is worn by a person so that when the  person raises his arm to put a cigarette to is lips...the mercury switch closes to connect a source of power ad induce an electrical shock in the person in order to deter the person from smoking, drinking or the like."

Ouch... 


Sunday, May 10, 2009

Peer Review in the Digital World

We had a conversation with some of the leading diagnostic imaging informatics experts,the leading experts and practicioners in the field; people who understand how to read the CT scan, the MRI, the Xray, the Mammogram. We asked them what they thought about search and looking for important information on the web. They told us that one of the problems they have when they search the web looking for medical research and other clinical and diagnostic information is that they don't have confidence in the results. They get blogs, opinion, and other unreliable information that they don't feel they can use to make decisions on behalf of their patients. The Drs. have to evaluate the reliability of the information that pops up on their search results list and they are not happy with what they find with a search of the web.

The Drs. still don't have the same level of confidence in what they find on the web as what they have when they find and read the peer reviewed journals that report on Radiology and the amazing imaging views that they use to diagnose disease and to improve the outcome for their patients. They need the experts, they need to know that the information they find is reliable; they need to know the credentials of the people who are publishing their findings and opinions. What they want is the a tool to help them make a judgement on the reliability of the content of the results of their search. After all, real people with real health issues are counting on them. They need to understand the authority of the information that they review.

So we talked about "bibilometrics", the study of written documents and their citations; bibliometrics uses citations to produce a quantitative and qualitative estimate of the importance of and impact of scientific research papers, journals and analytical analyses to results of their search. In short, they need a way to determine the merit of the work before they use it to help them in their decision making process.

We talked about creating a way to use the citations on these writing to build an way to evaluate the work and add weight to the important and authority of works when you search. Thesesame techniques have been used to find important patents, to find "important patents" based on how many times a work is cited by other inventors. To determine which new patents are going to be important.

So, we went to the drawing board to build an electronic equivalent of the peer review system based on the content of the citations, the organizations that published the work, the credentials of the authors, and the frequency of citations on published articles to help searchers find the woks that are most important. We are building the social network of the researchers so that these experts can connect with each other and leverage their important work. Stay tuned.

Saturday, May 9, 2009

The Problem With Keywords

In modern search technology a keyword that appears once in the body of a document will receive a low score.  This scorekeeping method potentially keeps important material from being exposed to the searcher.

Friday, May 8, 2009

Keyword Guessing Machine

An invention can’t be described in the 10 words supported by the leading internet search engines. Is a series of words and a few “ands”, and “ors” enough to describe the state of the art in any technology?   How do you find the terms to describe innovations over a 20 or 30 year time horizon?  What words do you need to know to get good results?  For most intellectual property research activities, conventional search becomes a Keyword Guessing Machine. 

The Keyword Guessing Machine takes you down the tedious path from one search to the next, keyword to keyword, combining keywords with Boolean operators, ands, ors, and nots, to try to find the right art.  The quality of the results is contingent on the searcher’s understanding of the keywords and vocabulary associated with the art.  A series of   Boolean operators are needed to build searches that define complex ideas. 

Searches are complicated by the need to look across time and address a constantly evolving vocabulary used to describe inventive art.   Keyword and full text searching do not compensate for the evolution in the lexicon used to describe a field of research and often have limited mechanisms to understand all of the concepts embodied by a single term. Researchers need to execute multiple queries using the vocabulary of the era or weed out search results that contain the same words but don’t embody the right concept. 

Conventional search dilutes the power of the terms and its underlying concepts. Associations become overused losing important related concepts or the meaning of search words over time.  

Search paralysis sets in when combinations of words yield no new or usable results.  Users then start again with a new set of terms.  The more novel or complex the idea, the harder it is to get meaningful results or to follow a thread of subject matter the way an inventor, attorney, or patent professional thinks about and defines their art.

If you look at the May 6th posting for the original Cybercash electronic commerce patent, it's hard to imagine that meaningful prior art could be found using the search terms defined by the examiner or that a researcher would know the right combination of words to find the appropriate art.  Keyword searching when looking for prior art if difficult, time consuming, and very frustrating.
 

Thursday, May 7, 2009

Search Can Get Ugly


Part of the Examiner's search request for Patent Number  6,092,053 now owned by Paypal, Inc.

The invention is described as:

A system and method for merchant invoked electronic commerce allowing consumers to purchase items over a network and merchants to receive payment information relating to the purchases. The system includes a server having software which gathers the purchasing information from a consumer to complete a purchasing transaction over a network. The system has a consumer data structure that stores purchasing information for registered consumers. The software is able to access the consumer data structure and enter the consumer's purchasing information during subsequent purchases. Having the software obtain and enter the consumer's purchasing information, the consumer does not have to enter the same information every time they purchase an item over the network. In alternate embodiments, the same technology can be applied to other arenas where a user may have to enter the same repetitive information.

Wednesday, May 6, 2009

A Fundamental Intellectual Property Challenge

Digital scientific literature is accelerating inventive activity and faster development of emerging technology.

There aren’t enough Subject Matter Experts to find prior art for new and emerging technology.

A new search paradigm is needed to support expansion of inventive activity.