Archive for November, 2009
by Bill Ives
Here is another in a series of notes from the Enterprise Search Summit connected with the 2009 KM World that focus on applications. This is part of the Enterprise Search Summit. It is titled: Evolve From a Tactical E-Discovery Approach to Search and E-Discovery by Brian W. Hill, Senior Analyst, Forrester Research, Here is the session description.
During the next 18 months, initiatives to cut costs and rising regulatory demands will drive enterprise e-discovery investments. Although search is a critical component of e-discovery, enterprises Forrester surveyed use a grab bag of search tools and techniques for litigation or regulatory inquiries. Lacking an end-to-end approach to gather and filter information, few enterprises report having a holistic approach to e-discovery. A full two-thirds of Forrester respondents consider their e-discovery strategy reactive rather than proactive. This indicates overall enterprise readiness for e-discovery is dismal. In this session, Hill will demonstrate the need for information and knowledge management (I&KM) professionals to collaborate with IT, legal, and business teams to understand how search technology can help cut e-discovery costs and mitigate risk. Formalized, cross-functional e-discovery programs that facilitate internal communication, encourage standardized methodologies, and provide a “big picture” view will help drive results. These programs can help support initiatives to synchronize e-discovery,
records management, and archiving.
Brian started by saying that eDiscovery solutions are not cheap but not using them can be much more expensive. Companies are grappling with regulatory issues and getting ready for possible litigation. Organizations have responded to this by rolling out a series of archives. The result is content and technology chaos that is difficult to search across and does not manage to mitigate legal risk.
There is massive fragmentation within many enterprises. Security and metadata can be very different and not connected.
eDiscovery is a large market but it is also immature. May organizations are having challenges here. There are different stakeholders including legal and regulatory and large set of technology options. eDiscovery now invites fear uncertainty and doubt. At the same time organizations are trying to take out cost. (see my AppGap posts: Equivio Releases Relevance™ to Enhance eDiscovery and Darwin Ecosystem Brings Awareness Engine™ to Enterprise 2.0 eDiscovery).
There are different content types: email, collaboration systems, desktop files, etc. Email is the fastest growing content type. However, the law says that all types of electronic content is subject to eDiscovery, including things like voice mail. Only 20% of organizations surveyed felt their records were accessible and searchable.
There is a change point at the 5,000 employee level. There is more pain above 5,000 employees. Companies at this level feel more vulnerable and have much more complexity to deal with. There is more to it than search. There needs to be proper legal hold capability within records management. Brian looked at the records management vendors. Many people are using Sharepoint but it is not designed for this task. (see Forrester ECM wave report)
Less than half of stakeholders are satisfied with their records management. Technology is not the answer. You need a well defined process that has proper stakeholder alignment. The market is fragmented and there are about 500 venders that address this need in some way. But again the right process is the main need and it is perceived as the top challenge. Defining defensible, repeatable process is key.
Records managers report all over the place in organizations. This is another example of fragmentation. There is gradual trend toward reporting form IT and legal. IT and legal are key players here and Brian said he often finds that IT and legal have not met before.
There is a wide variety of maturity of this function within large organizations for those just starting to those very well developed. The company’s legal risk profile is a main driver of maturity, as well as how many times they have been sued.
One search approach is key words but you only find what you are looking for. You might not know the right key words. I will add that this is a place where Darwin Ecosystem can help as it allows for discovery of relationships that you were not necessarily looking for. There are tools that use fuzzy logic, clustering, and other means to uncover key words that you did not think about.
The most commonly used technology to support search is operating system and desktop search. There are a number of shortcomings with this approach. A very low percentage are using enterprise search.
In summary, it is much more than technology. You need to define objectives, build queries, measure success, automate, review and analyze. You need a team of experts: legal, search, statistics, etc. Make sure you have the right exec support. Get the right stakeholders. Develop a records retention strategy and policies. Create a consistent process. Need to have integration between records management and search. Should stay on top evolving case law.
Brian’s slides are available at http://www.forrester.com/esswest09
by Bill Ives
This is another in a series of notes from the Enterprise Search Summit connected with 2009 KM World. Enterprise Search Technologies was a preconference workshop. It was led by independent consultant Miles Kehoe from New Idea Engineering, Inc. Here is the session description.
“This workshop, by a vendor neutral consultant who has hands-on experience with a broad range of “out of the box,” open source, commercial, and home grown solutions, provides an overview of the enterprise search technology landscape. It reviews technologies currently on the market; discusses pros and cons, strengths and weaknesses, and specific requirements Kehoe shares case studies that illuminate how search technologies are leveraged in different types of organizations; and provides a good introduction to and understanding of the enterprise search world.”
Miles said the characteristics of great search include conversational capabilities, open ended, flexible, and smart. Conversational search allows you to interact to focus your quest. This is especially important for enterprise search, as search is much harder inside the enterprise. Never provide just – no hits. Ask more if you cannot find anything. Every search engine is built around a set of indices. This even applies to Google who creates an index through its spiders. Different search engines just add different stuff around the indices. Every search engine goes through a process. Some expose parts of it which gives you added flexibility to pull specific information out.
It used to be you got plain search results pages in enterprise search like basic Google Web search. Now you get what he called enterprise search 2.0 with visualization, navigation, people, facets, etc. strung around the basic results.
There are two basic parts of search: indexing and the actual search. It is better to take time at indexing (when people are not waiting) than search when people are waiting. However, I asked if the real time capabilities of Twitter were changing those expectations. People want to see stuff as soon as it exists.
Before he reviewed vendors, Miles said it is not the technology but the methodology. It is how you implement the search engine. I can agree with this.
As he started to review vendors Miles mentioned Lucene/Soir, a free open source search engine that is behind a number of search engines, including some commercial ones. It is Java based with an Apache license, prolific documentation, many implementations, and you have total control over search and relevance. However, there is some implementation work required and it is hard to find answers. There are limited enterprise support options. SearchBlox is packaged Lucene. Lucid Imagination is packaged Soir.
Miles’ tier one vendors are: Autonomy, Endeca, Exalead, Fast Search (the original independent version), Google, Vivisimo. I have reviewed Exalead (see Exalead’s CloudView Offers Integrated Search Capabilities). His criteria are: broad enterprise presence, multi-platform search, market penetration, and clear product vision. People like the Google brand so they have a perception that Google enterprise search works well. Not being in Tier One is not necessarily bad, just not meeting all the criteria. Other vendors I have reviewed that Miles also mentioned include Attivio (see Attivio Aligns with Traction and Releases New Features) that is newer and Recommind (see Recommind Provides Axcelerate eDiscovery 3.0 with New Features) which is more vertical focused.
Dates are important but web servers provide bad data so it is hard to trust what you get. Miles gave the example of a 1996 document appearing as new because it had just been re-indexed.
The wifi started working so Miles showed us Web sites with good search capabilities. Globrix is a UK real estate site that uses FAST and you could see a lot of facets in home listings such as number of bedroom, bathrooms, price range, etc. Then we looked at Newssift that displays sentiment on topics. We looked at Kosmix that provides an example of exploratory search. It shows things that are related and loosely related.
Next we covered supporting technologies including document filters, connectors, social search, and federation. Document filters are part of the indexing process that converts binary source documents (PDF, Office, etc.) into a stream of text for indexing. Connectors are utility tools to provide a clearly defined interface between a search engine and external content. Some relate to indexing and others to display. Connectbeam is an example (see my reviews: Connectbeam Offers New Social Networking Application Integration Possibilities).
Social search is a popular term that applies to the capability to search corporate personal profiles to find people in an organization with certain skills or experience. It typically requires user to explicitly self-profile in order for searches to return accurate results. Some products now track user behavior to implicitly associate interest to users.
Federation refers to a program that can dispatch user queries to one or more external data sources (search engines, RDBMS systems, etc.) and present the combined results to the user. Federation from unsecure resources is fairly easy. Because relevance from each source is calculated differently, it is sometimes difficult to integrate results in a meaningful way.
Entity extraction recognizes people, places, or things during indexing. In unsupervised extraction entities are recognized through algorithms. In supervised extraction, the process is seeded by human operators prior to processing.
Sentiment analysis recognizes positive or negative sentiment algorithmically during indexing. It is easier to tell positive sentiment than negative.
Results clustering groups sets of documents into categories base don content. It looks like facets and entity extraction however clustering can be done independent of the query. Clustering is often used in search results to assist the user to discover additional related terms and content.
Facted search is the result of assigning documents in a search result list into a pre-defined taxonomy-like order. Unlike clustering, which can appear similar, facets are base don the query and populate pre-defined classes of content (authors location, etc.). Facets are often used to encourage interaction with user.
A key to having good search is to monitor it over time after the initial implementation. Look at what is happening and make corrections. Look at what people are searching for and accommodation them. You need to pull together a diverse collection of skills to have a great search function (e.g., business domain experts and corporate librarians, beyond just technical skills).
Miles mentioned two blogs on the topic that he writes: EnterpriseSearchBlog,com and SearchComponentsOnline.com.
by Bill Ives
I have written about Attivio on the AppGap before (see Attivio Tightly Integrates Structured Data and Unstructured Content for a New Approach to Information Access and Attivio on Some Potential Winners in our New Economic World). Recently, I talked again with Attivio’s CTO, Sid Probstein and their VP of Marketing, MaryAnne Sinville, on some of their latest moves. We first discussed Traction’s selection of the Attivio Active Intelligence Engine ™ (AIE) to power their information access.
I know Traction well and have great respect for their offering (see for example, Traction Software Releases Team Page 4.1 – Better Enabling the Social Side of Work). Sid said that Traction chose Attivio because of its single, flexible API, full Java support, multi-language capability, as well as its granular and secure permissioning model. This makes sense as both platforms are Java-based. More importantly, I am familiar with Traction’s strong granular permissioning capability so it would be important to have a search engine that supports this model.
Sid explained that there are two ways that search can support permission granularity, early bound and late bound. With early bound, the search capability handles permission information as content is loaded. This way users only see in search results, what they are allowed to access. It is the cleanest and most secure way to handle this requirement. However, if can be hard to accomplish and maintain, especially when changes are made to the status of a large number of documents. Attivio is able to keep up with these demands by storing permissions as structured data – in tables – and using their query-side JOIN operator to link security with content that matches the user’s query and permissions.
The other fall back method is called late bound. This approach does searches without regard to security levels. Then it checks results against a relational database of security levels before releasing the results. This can cause performance issues as it adds a step and brings back far more documents that then have to be narrowed down. It also adds an extra layer of expense and maintenance requirements. In addition, there can be security leaks through actions such as spell checking. Moreover, certain features become impossible like facet recognition and partitioning, another capability built into Traction that I cover next.
Sid showed me a sample Traction search page powered by Attivio (see above). The results are offered in the middle column. In the left column are aspects of the results grouped by facet (e.g., key words. projects, labels, types, authors). This is a very useful way to dig into the content by displaying more than just the titles of documents. You can drill down into these facets to find related content such as other works by the same author. You can also select search parameters such as projects, date ranges, and authors to narrow the results. Below is sample return on authors.
Next, we moved to some of the new features and capabilities within Attivio. They have released a Sentiment Module that extracts and measures the sentiment or “attitude” of documents, as well as the entities within documents. The attitude may be the person’s judgment (e.g. ‘positive’ vs. ‘negative’) or emotional tone (e.g. ‘objective’ vs. ‘subjective’). Sid explained that the sentiment analyzer is a trainable component that can be used with any language. The tool comes out of the box trained for common Web usage but you can increase the sophistication to fit your requirements.
For many situations document level sentiment is sufficient. However, for multi-topic content such as long documents and emails, you can get sentiment at the entity level. For example, you can find that a message indicated that a blogger liked the product but did not like the dealer. You need to train the tool to handle multiple entities in a document but the ability to do this comes with the tool. Here is a sample sentiment alert.
You can also use the sentiment analyzer to look at Twitter and other micro-messaging tools. Here you need to first train the sentiment analyzer on the common text short cuts used in these tools but, again, the ability to perform this training comes with the tool. I think this capability will be very useful for business intelligence and marketing activity. You can use it both inside and outside the enterprise. While the Web applications might seem more obvious, I also think it will be very useful for looking at internal documents.
Attivio also released a new Classification Module that provides an enhanced means to classify documents to a set of categories. For example, in the media industry, information could be classified by news categories (local, world, business) or classified ads (community, housing, for sale), etc. Previously they had a rules-based classifier. Now Attivio has brought the machine learning technology used in the sentiment analyzer to enable auto-classification of new documents. You first train the tool on a category through examples, and then it will be able to carry on the task with new documents. Sid mentioned that these new features are also part of the Traction implementation. With Traction, you can use prior manual classifications to train the classifier, saving a lot of time.
These are all good moves for a product that I have been hearing good things about. Sid said that one of the reasons he is excited about the Traction connection is that they will be able to innovate together. I look forward to seeing what comes out of this pairing.
by Bill Ives
The introduction of Web 2.0 social media into the enterprise creates large amounts of potentially useful information about the social side of business processes. The key is gaining awareness and access to this social intelligence. Paula Thronton recent covers this issue nicely in The Context of “Intent.” She opens with, “Two differentiating attributes of 2.0 are adaptation and emergence. Adaptive systems rely on feedback loops for continuous assessment. Emergence is the result of self-organizing adaptation.”
Paula later adds, “The goal is to bring together relevant facts to inform discovery (the possibilities) that then lead to design — especially adaptive design to support individuals interacting with or on behalf of a business. Such facts are often difficult to find and difficult to effectively interpret and leverage.” And then offers, “A more 2.0 approach would bring the facts into the context they’re related to, featuring (draw attention to via teasers) certain findings in tidbits, leading to more detail.”
I added in a comment on Paula’s post that enterprise 2.0 provides a wealth of social “data” as a byproduct of its use. We are only beginning to figure out how to harvest this content. The organizations that do this are truly working in the enterprise 2.0 space.
Increasing awareness in context and finding related content is the goal of Darwin Ecosystem. Its Awareness Engine™ gives users the ability to perceive and be conscious of events and patterns of activities captured in the enterprise and Web 2.0. The introduction of Web 2.0 tools into the enterprise also opens up possibilities for even more information silos if these tools do not connect. Darwin provides a way to see emerging patterns of related content across information generated through a variety of tools. As a disclosure before going further, I am working with Darwin and have a small stake in the firm.
Darwin reduces the effort of keeping-up with enterprise 2.0 content (Information Overload Management). It classifies and correlates patterns of the business activities trapped in your Web. Through its Scan Cloud™ Darwin makes visible and measurable the value of the enterprise 2.0 content (Awareness and Monitoring). You can see related items in a tag cloud like visualization and this relationship shifts as you move through the cloud. This dynamic nature makes it easy for users to see the correlation across the enterprise 2.0 content (Discovering and Sharing).
Using correlation metrics based on Chaos Theory Darwin looks for the emergence of correlated themes within chaotic content. This moves Darwin away from the voting and popularity rankings used by many information aggregators and search engines to rank information. By using correlation metrics, it does not require a known process or taxonomy to discover useful and related information. Darwin allows for the emergence that Paula discussed. You can set up ongoing filters or “attractors” to explore emerging themes in specific topics of interest or simply look at the broad picture.
Darwin is a not a replacement of existing enterprise 2.0 technologies, dashboards, and document management tools. It is designed to complement and leverage these technologies by making their tacit knowledge more visible. Darwin looks at events, blogs and other Web 2.0 sources that may correlate with the enterprise actions, its competitors and the voice of its critics and customers. This allows the enterprise to be capable of more easily discovering its own knowledge assets as well as its market and competitive positioning.
For example, using Darwin’s Awareness Engine, a program manager in Marketing becomes aware of concerns in R&D, Sales or other initiatives that are emerging as he/she tries to promote a product (all without having to wait and depend the next meeting or coffee break encounter). They can also see competitive moves that may correlate with their own plans or the efforts of their customers and prospects. Likewise management can have a high-level view of what topics are emerging across all divisions and initiatives to better steer the business or measure the effectiveness of the vision’s execution, in the process discovering emerging and stronger initiatives or employees that are noteworthy.
It is a Web browser application (Scan Cloud™) or it can become a custom solution through API access. It is delivered through a Web server with services and a database correlating the different Web 2.0 sources. For the enterprise there is an on-premise solution running on Ruby on Rails and making use of RSS feeds. Its Virtual Cortex™ database can be set on Oracle, MS-SQL or mySQL according to scalability needs. You can go from inside out so an enterprise can correlate their initiatives, knowledge and business intelligence against the pulse of the Web. The application is currently in Alpha stage and available for use. You can access the Web edition through free registration on the Darwin Ecosystem site. The enterprise edition is available for early adopters.
Here is a brief sample of how Darwin works. I set up a query on social media and saw the word, gain, in the cluster shown below. I looked to see why the word, gain, is associated with social media by selecting gain and highlighting the cluster of correlations between social media and gain.
Then I saw three articles that matched this correction in the past few days as you see in the image below.
There was as story about how NASDAQ launches social network site as shown below. This was very interesting to me.
I also saw an article that reported in the UK that social networking sites accounts for 25% of display ads, as well as an article on using Facebook traffic to drive brand loyalty. None of these articles appeared on a Google search on social media and gain. Nor did they appear on Google News on the topic. Below is a shot of the complete Darwin interface so you can the relationship of the detailed components shown above. This is a sample of how you can discover new information through correlation that is hard, if not impossible to find in traditional search.
You can also find stories about themes that are emerging in the Web, as well as images that correlate with content as shown in results about a very recent helicopter crash in the Pacific ocean.
Darwin also allows you to correlate content in traditional mainstream media with what is happening in the blogs and other social media. In 2008 Darwin won a Young Entrepreneurs Award from the Office for Science and Technology of the French Embassy in the United States. The Young Entrepreneurs Initiative (YEi) is a platform for mentoring and networking US-based entrepreneurs who understand the importance of internationalizing their vision and their activity, and who wish to set up a technology venture in France.
They recently started the Darwin Discovery Engine Blog to provide more background on their efforts, as well as commentary on content issues on the Web and enterprise 2.0.
by Bill Ives
A consortium of technology companies announced the creation of the Open Mashup Alliance in late September. This effort is being led by JackBe and provides open access to their Enterprise Mashup Markup Language (EMML) language through a Creative Commons arrangement. I have written about Jackbe before (see: Jackbe is Refining its Enterprise Mashup Offering). Recently I spoke with John Crupi CTO of JackBe about this new move and other things going on at the company.
John said that the alliance is open to any organization with an interest in the advancement of EMML and Enterprise Mashup interoperability and compatibility. The other founding companies include: Adobe, Bank of America, Capgemini, Hinchcliffe & Co., HP, Intel, Kapow Technologies, ProgrammableWeb, Synteractive, and Xignite. JackBe also provides a free-to-use EMML reference runtime engine. The Open Mashup Alliance will steward and enhance the EMML v1.0 specification for future contribution to a standards body.
John said that JackBe wanted to contribute to EMML because the company felt that this was the logical next step in the evolution of mashups. The development of an open language allows for interoperability and portability for reduced adopter risk and cost. Additionally, JackBe says customers have been asking for EMML to be made open. John believes that this is the best proven technology for enterprise mashups out there thus far but knows there is more that can be done in this area.
John said he saw what happened to the early Webservices market when vendors competed on standards, while the industry became fractured. By using a Creative Commons approach, innovation can be built on top of the current EMML and made available to all members. The EMML specification, along with a supporting runtime reference implementation, documentation, and sample code, is also available on the Alliance website. I think the alliance is a smart move and it will be interesting to see how it evolves.
JackBe has become increasingly focused on enhancing the speed and efficiency of information access. Customers can pull data from several sources to create dashboards on the fly. JackBe is well aware of the need for security and governance based on working so closely with the Department of Defense. John said that they have seen increased adoption in the economic downturn as people look for less expensive ways to manage business data. This is consistent with what I have heard from other Enterprise 2.0 vendors.
Last year they launched the Mashup Developer Community (MDC). This community offers the general availability of the Presto Enterprise Mashup Platform Developer Edition, along with free training and support. The Presto Developer Edition is a complete edition of the Presto Enterprise Mashup software that can be used by developers indefinitely at no cost. The Presto Developer Edition also includes 50 mashup-ready APIs from ProgrammableWeb. The MDC also provides a support environment for enterprise mashup developers through interactive forums monitored and managed by a large mashup community, including JackBe’s mashup engineers. The MDC also supports the integration of mashups into the enterprise through specialty areas such as ‘Mashups and SOA’, ‘Mashups and Portals’, and ‘Mashups and Oracle’. The MDC is another good move to spread the creation of mashups and support the Enterprise 2.0 concept.
by Jenny Ambrozek
This post began as a response (unpublished) to @DesignerDepot ‘s popular The History & Evolution of Social Media post.
Discovering I was reflecting on the evolution of social media and projecting forward on October 29, 2009, the 4oth Anniversary of ARPANET and the beginning of the Internet, I was inspired to share more widely and honor the occasion.
I wonder as you look back at the evolution of social media and then forecast forward what do you see?
For me that landscape scan focuses on sociologist Moreno’s sociograms and social network analysis dating from the 1930′s. 7 decades on social network analysis is an evolved discipline, as evidenced in the work of Mark Granovetter, Ron Burt, David Krackhardt, Valdis Krebs, Steve Borgatti, Duncan Watts, Albert-Laszlo Barabasi, Rob Cross, Patti Anklam, FAS Research, Doris Spielthenner, recent books by Christakis and Fowler, Easly & Kleinberg and more.
It’s my experience that the real value for enterprises comes when you apply a social network analysis lens to understanding if, and how, value is created through social network platforms. To me the missing functionality from the platforms we call “social networks” like Facebook and LinkedIn is the ability to make the networks visible, analyze the evolving ties and work them.
In 2008 a group of 10 Facebook owners came together for the Facebook Groups in Business Investigation (FGIBI). Our original plan was to map the relationships that new group members joining had to existing members.
I’d learned from Valdis Krebs, (in analyzing results of the Online Communities in Business Study 2004 with Joe Cothrel), that 1st degree ties are interesting but for understanding influence 2nd degree ties are more important. Hence in our Facebook Groups Investigation we wanted to track the relationships of new people joining. Were they first degree ties to the group owner or potentially more valuable distant ties? Study member Kimberly Samaha, owner of our participating Bordeaux Colloquium Group persevered manually tracking ties but for large groups this is impossible.
I welcome dissenting views but make the case that the real value and potential of online social networking in the next 40 years will come when platforms like Facebook and LinkedIn integrate network analysis. IBM Atlas and Trampoline Systems are working to deliver this knowledge inside enterprises.
November 6 I’ll be in Chicago for a Collective Intelligence Summit. As I investigate the platform providers sponsoring the event I’ll be looking to see if network visualization and analysis is included.
Please take a moment to share what you see will be most impactful in the Internet’s next 40 years.
~ Jenny Ambrozek
by Celine Roque
Older entries »
Have you ever looked at an image and thought, I think I’ve seen that somewhere. Problem is, you couldn’t quite place it. Wouldn’t it be great to just grab a picture and run it through a search engine? Google has an image search, but it runs on keywords, not real image comparisons. As we wait for them to develop and polish one, in the meantime, we can use tools such as TinEye.
TinEye, created by Idée Inc, is a reverse image search engine. Upload an image and this tool will tell you where matches can be found on the web, so that you can trace the original source, possibly learn about its history and get a hi-res version. If you are the owner or creator of an image, use TinEye to track how your work is being used by others, and see the modifications they’ve done, if any.
Like most modern search engines, TinEye uses crawlers to look for images around the web. Right now they have over a billion images on their index – quite a small number if you think about the ever-expanding volume of Internet content. Here’s how it works:
“When you submit an image to be searched, TinEye creates a unique and compact digital signature or ‘fingerprint’ for it, then compares this fingerprint to every other image in our index to retrieve matches. TinEye can even find a partial fingerprint match. TinEye does not typically find similar images (i.e. a different image with the same subject matter [faces]); it finds exact matches including those that have been cropped, edited or resized [logos, symbols].”
The search engine works with JPEG, PNG and GIF files, with optimum results for images greater than 300×300 pixels and without clear watermarks. Maximum file size accepted is 1MB. For unregistered users, uploaded images are automatically discarded after 72 hours. Registration is free, and allows you to keep the file and retain a permalink to the query.
To see examples of what TinEye is capable of, have a look at their Cool Searches page.