Data Pre-Processing (Concept Identification)

К оглавлению
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 
136 137 138 139 140 141 142 143 144 145 146 

The quality of the map (or network) extracted from the text can be enhanced by preprocessing

the data prior to running the analysis: Text pre-processing condenses the

data to the concepts that capture the features of the texts that are relevant to the analyst.

This technique is also the first step in the procedure of performing meta-matrix text

analysis (see section 4). In a previous publication we have described text pre-processing

strategies and results with AutoMap in detail (Diesner & Carley, 2004). As a first preprocessing

technique we applied a delete list customized for this dataset1. Deletion

removes non-content bearing concepts such as conjunctions and articles from texts

(Carley, 1993). This reduces the number of concepts the analyst needs to consider when

forming thesauri. Then we stemmed the texts with the AutoMap stemmer, which is based

on the Porter Stemmer (Porter, 1980). Stemming detects inflections and derivations of

concepts in order to convert each concept into the related morpheme (Jurafsky & Martin,

2000). Stemming simplifies the process of constructing a generalization thesaurus and

can often eliminate spelling errors and typos. Then we used AutoMap’s Named-Entity

Recognition functionality. Named-Entity Recognition retrieves concepts such as proper

names, numerals, and abbreviations contained in a text set (Magnini, Negri, Prevete &

Tanev, 2002). This technique helps to index agents, organizations, places, and events

and facilitates building the meta-matrix thesaurus. There were 591 named entities in our

dataset. This list of named entities was used to:

1. Translate relevant phrases into a unit that will be recognized as a single concept.

This can be realized in the generalization thesaurus in AutoMap by, e.g., replacing

the spaces by words that are separated with underscores.

Table 4. Dataset — number of texts that terror group appears in

Source Aksa Fatah Hamas Hezbollah Islamic

Jihad

al Qaeda

The Washington Post 2 1 2 1 1 2

The New York Times 1 2 3 2 2 1

The Economist 1 2 4 1 2 1

Total 4 5 9 4 5 4

Examples:

Holi War into Holy_War. The apparent misspelling of Holi results from

stemming.

Golan Height into Golan_Heights.

2. Translate people’s names, various versions of their names as they appear in the

data set, aliases and synonyms that these people use into the organization that this

person is associated with.

Examples:

Dr. Abdel Aziz Rantisi and Dr. Rantisi into Aziz_Al-Rantisi, who is a

member of Hamas.

Mahmoud Abba and Abu Mazen into Mahmoud_Abbas, who is a

member of the Palestinian Authority.

3. Translate various spellings of a group and synonyms for groups into one unique

name of the related group or organization.

Examples:

Hizbullah into Hezbollah.

Islamic Resistance Movement into Hamas.

Thesaurus Creation

The resulting 170 pairs of associations of text-level concepts with higher-level concepts

formed a generalization thesaurus. As noted, a generalization thesaurus translates textlevel

concepts into higher-level concepts. A single higher-level concept typically has

multiple text-level entries associated with it in a thesaurus. For example, Imad Falouji (the

higher-level concept), a Hamas member, appeared in the text set as Imad Falouji and Mr.

Falouji (two related text-level concepts). The more text-level entries are associated with

a higher-level concept, the greater the level of generalization being employed by the

analyst.

Since no pre-defined thesaurus was available to us that would have matched terrorismrelated

concepts to meta-matrix entity classes, we built a second generalization thesaurus.

After applying the generalization thesaurus, we built and applied a second generalization

thesaurus with 50 entries that translates people’s names into organizations or

more abstract groups with which these people are associated. We used four basic

guidelines:

1. Members of the six terrorist groups that the data set focuses on into the related

terrorist organization.

Examples:

Aziz Al-Rantisi into Hamas.

2. Representatives of the governments of various countries into the country’s

government.

Examples:

Omar Sulieman into Egypt_Government.

Mahmoud Abbas into Palestinian_Authority.

3. People’s names into organizations or abstract groups that they belong to.

Examples:

Hans_Blix, Kofi_Annan, and Michael_Chandler into UN.

Hanadi Jaradat and Saed Hanani into Suicide_Bomber.

Haviv_Dodon, Muhammad_Faraj

Samer_Ufi into Victim_Killed.

In doing this, the basic principle we were applying was to retain specific actors — those

who appeared to play primary roles, whereas secondary actors were reclassified by their

role, such as victim. Not all names of people that can be associated with a group were

translated into the related group. We applied this strategy in order to enable us to

retranslate the entity class Agent, to which we assigned these names in the meta-matrix

thesaurus that we applied after the second generalization thesaurus, into the names of

key players relevant to us in a sub-matrix text analysis that can be run after the meta-matrix

text analysis. Names that we decided not to match with an organization are for example

Osama bin Laden, Yasser Arafat and Ariel Sharon. This level of maintenance of detail of

information always depends on the research question or goal. Our goal was to detect the

network structure of terrorist groups.

After finishing the generalization process2 we built and employed a meta-matrix thesaurus.

In order to support the analyst in matching text-level concepts against meta-matrix

categories, AutoMap offers the options to: a) load a list of all unique concepts appearing

in the text set into the left most column of the meta-matrix thesaurus or b) save a list of

a union of all unique concepts on a directory of the analyst’s choice. In the next step the

analyst has to manually go through this list and to decide whether or not to associate

each single concept with meta-matrix categories. Our dataset contained 2,083 unique

concepts after applying the generalization thesauri. Of these unique concepts, 303 were

assigned to a single entity class in the meta-matrix, and 23 of them to two entity classes

(Table 5, sum of column one). A total of 1780 of the 2083 unique concepts we did not

assigned to any meta-matrix entity class, but they were kept as non-categorized concepts.

The creation of a meta-matrix thesaurus is step 3, concept classification, in the

procedure of performing meta-matrix text analysis (see section 4).

In the next step we applied the meta-matrix thesaurus to the data set3 and ran a meta-matrix

text analysis on the pre-processed text set4. This technique forms step 4, perform map

analysis, in the procedure of performing meta-matrix text analysis (see section 4).

Characteristics of the Textual Networks

as Meta-Matrices (Graph and Analyze

Results)

In this section, we report the results of the meta-matrix text analysis and sub-matrix text

analysis we ran on our data set. This task is step 5 in the procedure of performing metamatrix

text analysis. The intent in this section is to illustrate the type of results and graphs

possible using the proposed meta-matrix approach to NTA, not to present a comprehensive

analysis of terrorist networks. In doing this example, we will analyze: 1) unique and

total frequencies of the concepts and statements, 2) unique and total frequencies of the

statements that were formed from concepts associated with meta-matrix entity classes,

and 3) the distribution of statements formed from meta-matrix entity classes across the

data set.

For our analysis we considered the six meta-matrix entity classes in Table 3. Therefore,

we have six unique entity level concepts. Considering only concepts that fall into one

or more of these categories, we found an average of 99.2 total concepts per text, ranging

from 37 to 163. Based on these concepts, on average of 18.9 unique statements (ranging

from 8 to 29) and 45.7 total statements (ranging from 12 to 84) were formed per text. Thus,

on average, each unique statement appeared 2.4 times per text. Theoretically, each text

could contain up to 36 unique statements. The theoretic maximum would be achieved if

there existed at least one concept associated with each entity, and at least one concept

of each entity formed a statement with at least one concept in each other entity class. The

multiple occurrences of unique statements are expressed in the number of total statements.

Table 5. Creation and application of meta-matrix thesaurus (sorted by frequency)

Category Cumulated sum of

assignment of concepts

to entity classes in the

meta-matrix thesaurus

Cumulated sum of

appearance of entity

classes in texts after

application of metamatrix

thesaurus

Cumulated sum of

linkage of concepts

associated with metamatrix

entity classes

into statements

Organization 48 569 434

Location 81 404 404

Agent 54 250 217

Resource 75 261 188

Task-Event 27 168 146

Knowledge 41 134 128

Across the 18 meta-matrices extracted from our sample texts, 822 total statements were

formed within and between the cells of the meta-matrix (see Table 6 for distribution of total

statements across meta-matrix). Notice that the upper and lower triangle of the metamatrix

in Table 6 are not symmetric. For example, in Table 6 from Resource (row) to

Organization (column) there are a total of 23 statements, but from Organization (row) to

Resource (column) there are a total of 35 statements. Indeed, there is no need for

symmetry as the relations between concepts (edges between nodes) found with AutoMap

are directed, which is inherently pre-defined by the directed structure of language. The

results in Table 6 show that concepts associated with each meta-matrix entity class

appears approximately as often in posterior positions of statements (last row in Table 6)

as in anterior positions (last column in Table 6). Thus, the in-degree or receptivity of a

meta-matrix entity class approximately equals the out-degree or expansiveness of the

class. This is due, in part, to the use of proximity in the text to place links among concepts

and reflects, if anything, the lack of overly stylized sentential form.

Within the meta-matrix, the entity class that linked most frequently to other entity classes

was Organization (179 links), followed by Location (108), Agent (95), Resource (71), Task-

Event (66), and Knowledge (53). If we do not look at these absolute values, but at

percentages of the linkage of meta-matrix entity classes to the same or other entity

classes, our results reveal that concepts in the entity class Task-Event are more likely

to be connected to concepts in classes other than Task-Event. In contrast to Task-Event,

concepts in the entity class Location are most likely to link to other Location concepts

(Table 7).

Furthermore, the results indicate that within the networks that we extracted from the texts,

most information refers to membership networks (13.8% of all statements, Figure 1).

Table 6. Number of links (total number of statements) between meta-matrix categories

Meta-Matrix Agent Knowledge Resource Task-

Event

Organization Location Sum

Agent 24 8 8 12 55 12 119

Knowledge 10 18 9 3 20 11 71

Resource 8 9 39 11 23 20 110

Task-Event 13 7 9 10 20 17 76

Organization 58 23 35 19 90 44 269

Location 9 10 17 25 47 69 177

Sum 122 75 117 80 255 173 822

Table 7. Linkage of meta-matrix entity classes

Meta-Matrix entity

class

With same entity

class (%)

With other entity

classes (%)

Task-Event 13 87

Agent 20 80

Knowledge 25 75

Organization 33 67

Resource 35 65

Location 39 61

Although there is also substantial information on inter-organizational networks (11.1%)

and organizational location networks (10.4%). The least information is provided on

precedence networks (1.2%) and knowledge requirement networks (1.2%). This suggests

that more is known, or at least presented in the news, about who the terrorists are and

where they are than about what they do when and what they need to know in order to

engage in such actions or why.

The analysis of the distribution of statements formed from meta-matrix entity classes

across the text set reveals that all entities are covered in at least one third of the texts.

In addition, Organization, Location, and Agent classes appear in more than half of the

texts (Table 8). Again, this suggests that more is reported about who and where than

about what, how and why. We note that a human reading of these texts may pick up a

little more about what and how, although such information does appear to be less

common in general in the texts used for this purely illustrative analysis.

Figure 1: Total number of links between meta-matrix categories

Table 8: Number of texts in that links appears

Meta-Matrix Agent Knowledge Resource Task-

Event

Organization Location Sum

Agent 13 5 6 10 17 9 10.0

Knowledge 7 9 5 3 9 5 6.3

Resource 4 4 9 7 12 11 7.8

Task-Event 9 3 7 4 11 10 7.3

Organization 17 11 13 11 18 16 14.3

Location 7 7 10 11 17 14 11.0

Sum 9.5 6.5 8.3 7.7 14.0 10.8 9.5

113

91 90

69

58

43 42 39 39 37

25 24 21 21 20 18 18 18 16

10 10

0

20

40

60

80

100

120

AO/OA

OL/LO

OO

LL

RO/OR

KO/OK

TL/LT

RR

TO/OT

RL/LR

AT/TA

AA

AL/LA

KL/LK

RT/TR

KK

KR/RK

AK/ KA

AR/ RA

KT/TK

TT

Meta-Matrix Cell

Total Number of statements

Statements

In Figure 1 and Tables 6 and 8, we have been discussing the total links or statements.

Looking at the total links provides information about the overall structure of the

discussion and the elements of the structure (agents, knowledge, etc.) that are considered

critical by the authors or for which they have a wealth of information. It is often useful

to ask about unique links, however, if we want to understand the structure itself. In Figure

2, we display the number of links per sub-matrix that are unique. That is, a link or statement

is only counted once regardless of how many texts it appears in.

Comparison of Figures 1 and 2 shows that a great deal of information — particularly in

the Agent-to-Agent sub-matrix is repeated across texts. This suggests that either many

of the texts were discussing the same information (repetition), or they got their information

from the same source. Note that if we knew that each source was unique, then the

difference between the total (Figure 1) and the unique (Figure 2) would be an indicator

of the reliability of the information.

The overall structure for this covert network is very sparse. In some sense, based on these

texts, more is known about the affiliations, locations, resources, and knowledge of agents

and organizations than is known about the interrelations of knowledge, resources and

tasks (Table 9). Further, if we compare the number of unique links (Table 9) to the number

Figure 2. Number of unique links between meta-matrix categories

34 33

25

22 21 21 20 19 18

16

14 14 13 12 12

10 9 9 9

6

4

0

10

20

30

40

AO/OA

OL/LO

RO/OR

TO/OT

RL/LR

TL/LT

KO/OK

AT/TA

OO

AL/LA

LL

RT/TR

AA

AK/ KA

KL/LK

AR/ RA

KK

KR/RK

RR

KT/TK

TT

Meta-Matrix cells

Number of unique statements

Legend:

A = Agent

K = Knowledge

R = Resource

T = Task/ Event

O = Organization

L = Location

Table 9. Number of links (unique number) between meta-matrix categories

Meta-Matrix Agent Knowledge Resource Task-Event Organization Location

Agent 13 12 10 19 34 16

Knowledge 9 9 6 20 12

Resource 9 14 25 21

Task-Event 4 22 21

Organization 18 33

Location 14

Statements

of texts that contain links (for each sub-matrix) (Table 8) we see that the two tables are

similar. In other words, many links appear in only one text. It is interesting to note which

sub-matrices have more unique links than texts – e.g., the Agent-by-Knowledge and the

Organization-by-Knowledge sub-matrices. This indicates that the texts that discuss the

knowledge network tend to do so by discussing multiple linkages (e.g., all of these people

know item z). Whereas texts that discuss, e.g., the social network (Agent-by-Agent) are

more likely to simply talk about a single pair of actors and the nature of their relationship.

Whether this pattern of reporting would hold in other cultures is debatable.

Beyond learning about the network structure of the meta-matrices and the distribution

of concepts and connections between them across the sample data, analysts might be

interested in investigating in more detail the concepts and links contained in the metamatrix.

In order to gain this knowledge, sub-matrix text analysis5 can be run. For

Table 10: Who has what means? Organizational capability network (organization by

resource)

Table 11: Who knows what? Knowledge network (agent by knowledge)

Table 12: Who is located where and does what? (Localized assignment network: agent

by task-event by location)

Statements formed from Higher-Level Concepts (Sub-Matrix Analysis)

Sample text 1: Sample text 2:

1 Al-Qaeda – training camp 1 Al-Aksa - assets

1 network- Hawala 1 Al-Aksa - money

1 Hawala – money 1 Hamas - sponsoring

1 finance – network 1 aid - Hamas

1 camp - US-Government 1 aid - Treasury Department

1 money - Hamas

1 support - Hamas

1 Treasury - assistance

1 US-Government - assistance

1 assets - Treasury Department

Statements formed from Higher-Level Concepts (Sub-Matrix Analysis)

Sample text 1: Sample text 2:

1 chairman – monitoring 1 FBI - Analyst

1 evidence – Saddam Hussein

Statements formed from Higher-Level Concepts (Sub-Matrix Analysis)

Sample text 1: Sample text 2:

1 Saddam Hussein - Iraq 1 arrest - Leader

1 Leader - Germany

illustrating the results of this procedure, we show a map from the same text in Tables 10

to 12. A map contains one coded statement per line and its frequency.

These various sub-matrix networks enable a better understanding of what attributes of

the meta-matrix link to other attributes, and with what strength. All three sub-matrices

together enable a broader view of the situation. Figures 3 and 4 illustrate this broader

picture. The comparison of figures 3 and 4 illustrates that text 1 presents a more

disconnected story than does text 2. Further, even if the two stories were combined, the

Figure 4: Visualization of sub-matrices from sample text 2

Figure 3: Visualization of sub-matrices from sample text 1

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written

permission of Idea Group Inc. is prohibited.

overall map would tell us little about the structure of the two terrorist groups — al-Qaeda

and Hamas.

Meta-matrix data and sub-matrix data generated with AutoMap can be saved and then

re-analyzed outside of AutoMap using standard social network analysis tools. AutoMap

can both code these networks and then output them in two useful exchange formats for

use with other network analysis tools — DL for UCINET and DyNetML for ORA

(www.casos.cs.cmu.edu/projects/ORA, Carley and Reminga, 2004). For this chapter, we

use ORA as it enables the analysis of all the cells in the meta-matrix at once. In either case,

the combination of text and network analysis enables the analyst to readily combine rich

textual data with organizational data connected through other methods, thus enhancing

the analysis process.

Discussion-Features and Limitations

The techniques of meta-matrix text analysis and sub-matrix text analysis described herein

can support analysts in investigating the network structure of social and organizational

systems that are represented in textual data. Furthermore, these novel and integrative

methods enable analysts to classify words in texts into entity classes (node types)

associated with networks common to organizational structures according to a theoretically

and empirically validated ontology — the meta-matrix.

The validity of the method and the results presented in this chapter are constrained by

the little experience we gained so far with these novel techniques, the small number of

texts analyzed, and the implementation of the techniques into one software. The tool

should also be applied to multiple larger data sets.

Lessons Learned

In general, we find that the entity-name recognizer greatly enhances the ability to locate

concepts associated with the meta-matrix ontology. In particular, it facilitates locating

Agents, Organizations, and Locations. For entity classes that are less associated with

proper nouns, the name recognizer is of less value.

Coding texts using AutoMap is not a completely automated process. However, AutoMap

does provide a high degree of automation that assists the user and increases the

efficiency and effectiveness of meta-matrix text analysis in comparison to manual coding.

As with most text analysis techniques that seek to extract meaning, significant manual

effort needs to be expended on constructing the delete list and thesauri, even though the

method is computer-supported. For example, the delete list used in this study took 30

minutes to construct. However, the thesauri (and there are three) took four days to

construct. Thesauri enable the minimization of miscoding, as in missed relations, due to

aliases and misspellings, and differences due to the underlying languages. Analysts

have to decide on an optimal trade-off between speed of the computer-supported

research process and enhancement of the quality of automated coding caused by the

manual creation and refinement of pre-processing tools according to their goals and

resources.

It is worth noting that significant improvement over straight manual coding can be

achieved by building thesauri and delete lists based on only a fraction of texts. As more

texts in this domain are coded, we will have to expend relatively little additional effort to

expand the delete and thesauri list. For example, we suspect that hundreds of additional

texts will be codable with maybe only a day more attention to the thesauri. The reason

is that, when in the same domain, construction of thesauri is like building a sample via

the snowball method (i.e., with each iteration fewer and fewer novel concepts are found).

How large that fraction should be is a point for future work. However, preliminary studies

suggest 10% is probably sufficient. Future work should explore whether intelligent data

mining and machine learning techniques can be combined with social network analysis

and text analysis to provide a more automated approach to constructing thesauri on the

fly.

We also find that the higher the level of generalization used in the generalization

thesaurus, the greater the ability to compare two diverse texts. Not counting typographical

errors, often the translation of two to ten text-level concepts per high-level concept

seems sufficient to generate a “language” for the domain being studied.

We note that when forming thesauri, it is often critical to keep track of why certain types

of concepts are generalized into others. At the moment there is no way to keep that

rationalization within AutoMap. In general, the user should keep a lab notebook or readme

file for keeping such rationalizations.

Finally, we note that for extracting social or organizational structure from texts a large

corpus is needed. The point here is comprehensiveness, not necessarily a specific

number of texts. Thus, one might use the entire content of a book that describes and

discusses an organization or a large set of newspaper articles. In building this corpus,

not all texts have to be of the same type. Thus, the analyst can combine newspaper

reports, books, board-of-directors reports, Web pages, etc. Once the networks are

extracted via AutoMap they can be combined into a comprehensive description of the

organization being examined. Further, the analyst needs to pre-define what the basic

criteria are for including a text in the corpus — e.g., it might be publication venue, time

frame, geographic area, specific people, organizations, or locations mentioned.

Considerations for Future Work

We also note that the higher the level of generalization, the more ideas are being inferred

from, rather than extracted from, the texts. Research needs to be done on the appropriate

levels of generalization. Note that the level of generalization can be measured as the

average number of text-level concepts associated with each higher level concept.

One of the strengths of NTA is that the networks extracted from the texts can be combined

in a set theoretic fashion. So we can talk about the network formed by the union or

intersection of the set of networks drawn from the set of texts. When combining these

networks we can, for each statement, track the number of texts that contained that

statement. Since a statement is a relation connecting two concepts, this approach

effectively provides a weight for that relation. Alternatively, the analyst can compute

whether any text contained that statement. In this case, there are no weights and the links

in the network are simply present or not (binary). If these texts represent diverse sources

of information, then the weights are indicative of the certainty or verifiability of a relation.

Future work might also explore utilizing Bayesian learning techniques for estimating the

overall confidence in a relation rather than just summing up the number of texts in which

the statement was present.

We also note that when people read texts there is a process of automatic inference. For

example, when people read about a child talking to a parent they infer based on social

experience that the child is younger. Similarly, it appears that such inferences are common

between the entity classes. For example, if Agent X has Resource Y and Knowledge K

is needed to use Resource Y, then in general Agent X will have Knowledge K. Future work

needs to investigate whether a simple inference engine at the entity class level would

facilitate coding. We note that previous work found that using expert systems to assist

coding in terms of adding general social knowledge was quite effective (Carley, 1988).

Thus, we expect this to be a promising avenue for future research.

Finally, we note that the use of an ontology adds a hierarchical level to the coding. This

is invaluable from an interpretative perspective. There is no reason, conceptually, why

multiple hierarchical levels could not be added, denoting finer and finer levels of detail.

We suspect however, based on the use of hierarchical coding schemes in various

scientific fields (e.g., biology and organization theory) that: a) such hierarchies are likely

to not be infinitely deep, b) a certain level of theoretical maturity and consensus in a field

is needed for such a hierarchy to be generally useful, and c) eventually we will need to

move beyond such a “flat” scheme for extracting meaning. As to this last point, by flat

what we are referring to is the fact that a hierarchy can be completely represented in two

dimensions. We found, even when doing this limited coding that some text-level

concepts and higher-level concepts needed to be cross-classified into two or more entity

classes. As more levels are added in an ontological hierarchy, such cross classification

is likely to occur at each level, resulting in a network of inference, not a simple hierarchy

and so a non-flat structure. Future work should examine how to code, represent, and

reason about such networks.

Conclusion

One of the key advantages of classic content analysis was that macro social change could

be tracked by changes in content, and over- or under-representation of various words.

For example, movements toward war might be signaled by an increasing usage of words

describing hostile acts, foreign powers, and weapons. One of the key advantages of

Network Text Analysis (NTA) over standard text analysis is that it enables the extraction

of meaning and enables interpretation by signaling not just what words are used but how

they are used. This enables differences and similarities in viewpoints to be examined, and

it enables the tracking of micro social change as evidenced by changes in meaning. By

adding an ontology to NTA, differences and similarities in viewpoints about a metastructure

described or discussed in the text can be examined.

In this chapter, we used the meta-matrix ontology as we were interested in the underlying

social/organizational structure described in the texts. Several points are critical to note.

First, the mere fact that we used an ontology to define a set of meta-concepts enables

the extraction of a hierarchy of meaning thus affording the analyst with greater interpretive

ability. Second, any ontology could be used, and the analyst needs to consider the

appropriate ontology for their work. In creating this ontology the analyst wants to think

in terms of the set of entity classes and the relations among them that define the second

level network of interest. For us, these entity classes and relations were those relevant

to defining the organizational structure of a group.

The proposed meta-matrix approach to text analysis makes it possible to track more micro

social change in terms of changes, not just in meaning, but in the social and organizational

structures. Using techniques such as this facilitates a more systematic analysis of

groups, broadens the types of questions that can be effectively answered using texts,

and brings the richness of textual information to bear in defining and understanding the

structure of the organizations and society in which we live.

Acknowledgments

We want to thank Maksim Tsvetovat and Jeffrey Reminga from CASOS, CMU for helping

with generating the visualizations.

References

Alexa, M. (1997). Computer-assisted text analysis methodology in the social sciences.

Arbeitsbericht: ZUMA.

Bakker, R.R. (1987). Knowledge graphs: Representation and structuring of scientific

knowledge. Dissertation. University Twente.

Batagelj, V., Mrvary, A., & Zaveršnik, M. (2002). Network analysis of texts. In T. Erjavec

& J. Gros (Eds.), Proceedings of the 5th International Multi-Conference Information

Society - Language Technologies (pp. 143-148). Ljubljana, October. Jezikovne

tehnologije / Language Technologies, Ljubljana.

Burkart, M. (1997). Thesaurus. In M. Buder, W. Rehfeld, T. Seeger, & D. Strauch (Eds.),

Grundlagen der Praktischen Information und Dokumentation: Ein Handbuch

zur Einführung in die Fachliche Informationsarbeit (pp. 160-179) (4th edition).

München: Saur.

Carley, K.M., & Reminga, J. (2004). ORA: Organization risk analyzer. Carnegie Mellon

University. School of Computer Science, Institute for Software Research International,

Technical Report CMU-ISRI-04-101.

Carley, K.M. (2003). Dynamic network analysis. In R. Breiger, K.M. Carley, & P. Pattison

(Eds.), Summary of the NRC workshop on social network modeling and analysis

(pp. 133-145). Committee on Human Factors, National Research Council.

Carley, K.M. (2002). Smart agents and organizations of the future. In L. Lievrouw & S.

Livingstone (Eds.), The handbook of new media (pp. 206-220). Thousand Oaks,

CA: Sage.

Carley, K.M. (1997a). Extracting team mental models through textual analysis. Journal

of Organizational Behavior, 18, 533-558.

Carley, K.M. (1997b). Network text analysis: The network position of concepts. In C.W.

Roberts (Ed.), Text analysis for the social sciences (pp. 79-102). Mahwah, NJ:

Lawrence Erlbaum.

Carley, K.M. (1993). Coding choices for textual analysis: A comparison of content

analysis and map analysis. In P. Marsden (Ed.), Sociological Methodology, 23, 75-

126. Oxford: Blackwell.

Carley, K.M. (1988). Formalizing the social expert’s knowledge. Sociological Methods

and Research, 17(2), 165-232.

Carley, K.M. (1986). An approach for relating social structure to cognitive structure.

Journal of Mathematical Sociology, 12, 137-189.

Carley, K.M., Dombrowski, M., Tsvetovat, M., Reminga, J., & Kamneva, N. (2003).

Destabilizing dynamic covert networks. Proceedings of the 8th International

Command and Control Research and Technology Symposium. Washington, DC.

Evidence Based Research, Vienna, V.A.

Carley, K. M., & Hill, V. (2001). Structural change and learning within organizations. In

A. Lomi & E.R. Larsen (Eds.), Dynamics of organizations: Computational modeling

and organizational theories (pp. 63-92). Live Oak, CA: MIT Press/AAAI

Press.

Carley, K.M., & Krackhardt, D. (1999). A typology for C2 measures. In Proceedings of

the 1999 International Symposium on Command and Control Research and

Technology. Newport, RI, June.

Carley, K.M., & Palmquist, M. (1992). Extracting, representing, and analyzing mental

models. Social Forces, 70(3), 601-636.

Carley, K.M., & Reminga, J. (2004). ORA: Organizational risk analyzer. Carnegie Mellon

University, School of Computer Science, Institute for Software Research International,

Technical Report CMU-ISRI-04-106.

Carley, K.M, Ren, Y., & Krackhardt, D. (2000). Measuring and modeling change in c3i

architectures. In Proceedings of the 2000 Command and Control Research and

Technology Symposium. Naval Postgraduate School, Monterrey, CA, June, 2000.

Corman, S.R., Kuhn, T., Mcphee, R.D., & Dooley, K.J. (2002). Studying complex discursive

systems: Centering resonance analysis of communication. Human Communication,

28(20), 157-206.

Danowski, J. (1993). Network analysis of message content. In W.D. Richards & G.A.

Barnett (Eds.), Progress in communication science, XII (pp. 197-222). Norwood,

NJ: Ablex.

Diesner, J., & Carley, K.M. (2004). AutoMap1.2 - Extract, analyze, represent, and

compare mental models from texts. Carnegie Mellon University, School of Computer

Science, Institute for Software Research International, Technical Report

CMU-ISRI-04-100.

Galbraith, J. (1977). Organizational design. Reading, MA: Addison-Wesley.

Hill, V., & Carley, K.M. (1999). An approach to identifying consensus in a subfield: The

case of organizational culture. Poetics, 27, 1-30.

James, P. (1992). Knowledge graphs. In R.P. van der Riet & R.A. Meersman (Eds.),

Linguistic instruments in knowledge engineering (pp. 97-117). Amsterdam: Elsevier.

Jurafsky, D., & Marton, J.H. (2000). Speech and language processing. Upper Saddle

River, NJ: Prentice Hall.

Kelle, U. (1997). Theory building in qualitative research and computer programs for the

management of textual data. Sociological Research Online, 2(2). Retrieved from

the WWW at: http://www.socresonline.org.uk/2/2/1.html

Klein, H. (1997). Classification of text analysis software. In R. Klar & O. Opitz (Eds.),

Classification and knowledge organization: Proceedings of the 20th annual

conference of the Gesellschaft für Klassifikation e.V. (pp. 255-261). University of

Freiburg, Berlin. New York: Springer.

Kleinnijenhuis, J., de Ridder, J.A., & Rietberg, E.M. (1996). Reasoning in economic

discourse: An application of the network approach in economic discourse. In C.W.

Roberts (Ed.), Text analysis for the social sciences (pp. 79-102). Mahwah, NJ:

Lawrence Erlbaum.

Krackhardt, D., & Carley, K.M. (1998). A PCANS model of structure in organization. In

Proceedings of the 1998 International Symposium on Command and Control

Research and Technology Evidence Based Research (pp. 113-119). Vienna, VA.

Magnini, B., Negri, M., Prevete, R., & Tanev, H. (2002). A WordNet-based approach to

named entities recognition. In Proceedings of SemaNet’02: Building and Using

Semantic Networks (pp. 38-44). Taipei, Taiwan.

March, J.G., & Simon, H.A. (1958). Organizations. New York: Wiley.

Monge, P.R., & Contractor, N.S. (2003). Theories of communication networks.Oxford

University Press.

Monge, P.R., & Contractor, N.S. (2001). Emergence of communication networks. In F.M.

Jablin, & L.L. Putnam (Eds.), The new handbook of organizational communication:

Advances in theory, research and methods (pp. 440-502). Thousand Oaks,

CA: Sage.

Popping, R. (2003). Knowledge graphs and network text analysis. Social Science

Information, 42(1), 91-106.

Popping, R. (2000). Computer-assisted text analysis. London, Thousand Oaks: Sage.

Popping, R., & Roberts, C.W. (1997). Network approaches in text analysis. In R. Klar &

O. Opitz (Eds.), Classification and Knowledge Organization: Proceedings of the

20th annual conference of the Gesellschaft für Klassifikation e.V. (pp. 381-389),

University of Freiburg, Berlin. New York: Springer.

Porter, M.F. (1980). An algorithm for suffix stripping. I, 14(3), 130-137.

Reimer, U. (1997). Neue Formen der Wissensrepräsentation. In M. Buder, W. Rehfeld, T.

Seeger & D. Strauch (Eds.), Grundlagen der praktischen Information und

Dokumentation: Ein Handbuch zur Einführung in die fachliche Informationsarbeit

(pp. 180-207) (4th edition). München: Saur.

Ryan, G.W., & Bernard, H.R. (2000). Data management and analysis methods. In N.

Denzin & Y. Lincoln (Eds.), Handbook of qualitative research (pp. 769-802) (2nd

edition). Thousand Oaks, CA: Sage.

Scott, J.P. (2000). Social network analysis: A handbook (2nd edition). London: Sage.

Simon, H.A. (1973). Applying information technology to organizational design. Public

Administration Review, 33, 268-78.

Sowa, J.F. (1984). Conceptual structures: Information processing in mind and machine.

Reading, MA: Addison-Wesley.

Wasserman, S., & Faust, K. (1994). Social network analysis. Methods and applications.

Cambridge: Cambridge University Press.

Zuell, C., & Alexa, M. (2001). Automatisches Codieren von Textdaten. Ein Ueberblick

ueber neue Entwicklungen. In W. Wirth & E. Lauf (Eds.), Inhaltsanalyse –

Perspektiven, Probleme, Potenziale (pp. 303-317). Koeln: Herbert von Halem.

Endnotes

1 The delete list was applied with the rhetorical adjacency option. Rhetorical

adjacency means that text-level concepts matching entries in the delete list are

replaced by imaginary placeholders. Those place holders ensure that only concepts,

which occurred within a window before pre-processing, can form statements

(Diesner & Carley, 2004).

2 We did not choose the thesaurus content only option. Thus, adjacency does not

apply.

3 We used the thesaurus content only option in combination with the rhetorical

adjacency. Thus, the meta-matrix categories are the unique concepts.

4 We used the following statement formation settings: Directionality: uni-directional,

Window Size: 4, Text Unit: Text (for detailed information about analysis

settings in AutoMap see Diesner & Carley, 2004).

5 Sub-Matrix selection was performed with the rhetorical adjacency option.

i This work was supported in part by the National Science Foundation under grants

ITR/IM IIS-0081219, IGERT 9972762 in CASOS, and CASOS – the Center for

Computational Analysis of Social and Organizational Systems at Carnegie Mellon

University (http://www.casos.cs.cmu.edu). The views and conclusions contained

in this document are those of the authors and should not be interpreted as

representing the official policies, either expressed or implied, of the National

Science Foundation or the U.S. government.

Appendix

Software: AutoMap: Diesner, J. & Carley, K.M. (2004). AutoMap1.2: Software for

Network Text Analysis.

AutoMap is a network text analysis tool that extracts, analyzes, represents, and compares

mental models from texts. The software package performs map analysis, meta-matrix text

analysis, and sub-matrix text analysis. As an input, AutoMap takes raw, free flowing, and

unmarked texts with ASCII characters. When performing analysis, AutoMap encodes

the links between concepts in a text and builds a network of the linked concepts. As an

output, AutoMap generates representations of the extracted mental models as a map file

and a stat file per text, various term distribution lists and matrices in comma separated

value (csv) format, and outputs in DL format for UCINET and DyNetML format. The scope

of functionalities and outputs supported by AutoMap enables one way of analyzing

complex, large-scale systems and provide multi-level access to the meaning of textual

data.

Limitations: Coding in AutoMap is computer-assisted. Computer-assisted coding

means that the machine applies a set of coding rules that were defined by a human

(Ryan and Bernard, 2000, p.786; Kelle, 1997, p. 6; Klein, 1997, p. 256). Coding rules

in AutoMap imply text pre-processing. Text pre-processing condenses the data to

the concepts that capture the features of the texts that are relevant to the user. Preprocessing

techniques provided in AutoMap are Named-Entity Recognition,

Stemming, Deletion, and Thesaurus application. The creation of delete lists and

thesauri requires some manual effort (see Discussion section for details).

Hardware and software requirements: AutoMap1.2 has been implemented in Java 1.4. The

system has been validated for Windows. The installer for AutoMap1.2 for Windows and

a help file that includes examples of all AutoMap1.2 functionalities are available online

under http://www.casos.cs.cmu.edu/projects/automap/software.html at no charge. More

information about AutoMap, such as publications, sponsors, and contact information

is provided under http://www.casos.cs.cmu.edu/projects/automap/index.html.

AutoMap has been written such that the only limit on the number of texts that can be

analyzed, the number of concepts that can be extracted, etc., are determined by the

processing power and storage space of the user’s machine.