Data Pre-Processing (Concept Identification)
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
136 137 138 139 140 141 142 143 144 145 146
The quality of the map (or network) extracted from the text can be enhanced by preprocessing
the data prior to running the analysis: Text pre-processing condenses the
data to the concepts that capture the features of the texts that are relevant to the analyst.
This technique is also the first step in the procedure of performing meta-matrix text
analysis (see section 4). In a previous publication we have described text pre-processing
strategies and results with AutoMap in detail (Diesner & Carley, 2004). As a first preprocessing
technique we applied a delete list customized for this dataset1. Deletion
removes non-content bearing concepts such as conjunctions and articles from texts
(Carley, 1993). This reduces the number of concepts the analyst needs to consider when
forming thesauri. Then we stemmed the texts with the AutoMap stemmer, which is based
on the Porter Stemmer (Porter, 1980). Stemming detects inflections and derivations of
concepts in order to convert each concept into the related morpheme (Jurafsky & Martin,
2000). Stemming simplifies the process of constructing a generalization thesaurus and
can often eliminate spelling errors and typos. Then we used AutoMap’s Named-Entity
Recognition functionality. Named-Entity Recognition retrieves concepts such as proper
names, numerals, and abbreviations contained in a text set (Magnini, Negri, Prevete &
Tanev, 2002). This technique helps to index agents, organizations, places, and events
and facilitates building the meta-matrix thesaurus. There were 591 named entities in our
dataset. This list of named entities was used to:
1. Translate relevant phrases into a unit that will be recognized as a single concept.
This can be realized in the generalization thesaurus in AutoMap by, e.g., replacing
the spaces by words that are separated with underscores.
Table 4. Dataset — number of texts that terror group appears in
Source Aksa Fatah Hamas Hezbollah Islamic
Jihad
al Qaeda
The Washington Post 2 1 2 1 1 2
The New York Times 1 2 3 2 2 1
The Economist 1 2 4 1 2 1
Total 4 5 9 4 5 4
Examples:
Holi War into Holy_War. The apparent misspelling of Holi results from
stemming.
Golan Height into Golan_Heights.
2. Translate people’s names, various versions of their names as they appear in the
data set, aliases and synonyms that these people use into the organization that this
person is associated with.
Examples:
Dr. Abdel Aziz Rantisi and Dr. Rantisi into Aziz_Al-Rantisi, who is a
member of Hamas.
Mahmoud Abba and Abu Mazen into Mahmoud_Abbas, who is a
member of the Palestinian Authority.
3. Translate various spellings of a group and synonyms for groups into one unique
name of the related group or organization.
Examples:
Hizbullah into Hezbollah.
Islamic Resistance Movement into Hamas.
Thesaurus Creation
The resulting 170 pairs of associations of text-level concepts with higher-level concepts
formed a generalization thesaurus. As noted, a generalization thesaurus translates textlevel
concepts into higher-level concepts. A single higher-level concept typically has
multiple text-level entries associated with it in a thesaurus. For example, Imad Falouji (the
higher-level concept), a Hamas member, appeared in the text set as Imad Falouji and Mr.
Falouji (two related text-level concepts). The more text-level entries are associated with
a higher-level concept, the greater the level of generalization being employed by the
analyst.
Since no pre-defined thesaurus was available to us that would have matched terrorismrelated
concepts to meta-matrix entity classes, we built a second generalization thesaurus.
After applying the generalization thesaurus, we built and applied a second generalization
thesaurus with 50 entries that translates people’s names into organizations or
more abstract groups with which these people are associated. We used four basic
guidelines:
1. Members of the six terrorist groups that the data set focuses on into the related
terrorist organization.
Examples:
Aziz Al-Rantisi into Hamas.
2. Representatives of the governments of various countries into the country’s
government.
Examples:
Omar Sulieman into Egypt_Government.
Mahmoud Abbas into Palestinian_Authority.
3. People’s names into organizations or abstract groups that they belong to.
Examples:
Hans_Blix, Kofi_Annan, and Michael_Chandler into UN.
Hanadi Jaradat and Saed Hanani into Suicide_Bomber.
Haviv_Dodon, Muhammad_Faraj
Samer_Ufi into Victim_Killed.
In doing this, the basic principle we were applying was to retain specific actors — those
who appeared to play primary roles, whereas secondary actors were reclassified by their
role, such as victim. Not all names of people that can be associated with a group were
translated into the related group. We applied this strategy in order to enable us to
retranslate the entity class Agent, to which we assigned these names in the meta-matrix
thesaurus that we applied after the second generalization thesaurus, into the names of
key players relevant to us in a sub-matrix text analysis that can be run after the meta-matrix
text analysis. Names that we decided not to match with an organization are for example
Osama bin Laden, Yasser Arafat and Ariel Sharon. This level of maintenance of detail of
information always depends on the research question or goal. Our goal was to detect the
network structure of terrorist groups.
After finishing the generalization process2 we built and employed a meta-matrix thesaurus.
In order to support the analyst in matching text-level concepts against meta-matrix
categories, AutoMap offers the options to: a) load a list of all unique concepts appearing
in the text set into the left most column of the meta-matrix thesaurus or b) save a list of
a union of all unique concepts on a directory of the analyst’s choice. In the next step the
analyst has to manually go through this list and to decide whether or not to associate
each single concept with meta-matrix categories. Our dataset contained 2,083 unique
concepts after applying the generalization thesauri. Of these unique concepts, 303 were
assigned to a single entity class in the meta-matrix, and 23 of them to two entity classes
(Table 5, sum of column one). A total of 1780 of the 2083 unique concepts we did not
assigned to any meta-matrix entity class, but they were kept as non-categorized concepts.
The creation of a meta-matrix thesaurus is step 3, concept classification, in the
procedure of performing meta-matrix text analysis (see section 4).
In the next step we applied the meta-matrix thesaurus to the data set3 and ran a meta-matrix
text analysis on the pre-processed text set4. This technique forms step 4, perform map
analysis, in the procedure of performing meta-matrix text analysis (see section 4).
Characteristics of the Textual Networks
as Meta-Matrices (Graph and Analyze
Results)
In this section, we report the results of the meta-matrix text analysis and sub-matrix text
analysis we ran on our data set. This task is step 5 in the procedure of performing metamatrix
text analysis. The intent in this section is to illustrate the type of results and graphs
possible using the proposed meta-matrix approach to NTA, not to present a comprehensive
analysis of terrorist networks. In doing this example, we will analyze: 1) unique and
total frequencies of the concepts and statements, 2) unique and total frequencies of the
statements that were formed from concepts associated with meta-matrix entity classes,
and 3) the distribution of statements formed from meta-matrix entity classes across the
data set.
For our analysis we considered the six meta-matrix entity classes in Table 3. Therefore,
we have six unique entity level concepts. Considering only concepts that fall into one
or more of these categories, we found an average of 99.2 total concepts per text, ranging
from 37 to 163. Based on these concepts, on average of 18.9 unique statements (ranging
from 8 to 29) and 45.7 total statements (ranging from 12 to 84) were formed per text. Thus,
on average, each unique statement appeared 2.4 times per text. Theoretically, each text
could contain up to 36 unique statements. The theoretic maximum would be achieved if
there existed at least one concept associated with each entity, and at least one concept
of each entity formed a statement with at least one concept in each other entity class. The
multiple occurrences of unique statements are expressed in the number of total statements.
Table 5. Creation and application of meta-matrix thesaurus (sorted by frequency)
Category Cumulated sum of
assignment of concepts
to entity classes in the
meta-matrix thesaurus
Cumulated sum of
appearance of entity
classes in texts after
application of metamatrix
thesaurus
Cumulated sum of
linkage of concepts
associated with metamatrix
entity classes
into statements
Organization 48 569 434
Location 81 404 404
Agent 54 250 217
Resource 75 261 188
Task-Event 27 168 146
Knowledge 41 134 128
Across the 18 meta-matrices extracted from our sample texts, 822 total statements were
formed within and between the cells of the meta-matrix (see Table 6 for distribution of total
statements across meta-matrix). Notice that the upper and lower triangle of the metamatrix
in Table 6 are not symmetric. For example, in Table 6 from Resource (row) to
Organization (column) there are a total of 23 statements, but from Organization (row) to
Resource (column) there are a total of 35 statements. Indeed, there is no need for
symmetry as the relations between concepts (edges between nodes) found with AutoMap
are directed, which is inherently pre-defined by the directed structure of language. The
results in Table 6 show that concepts associated with each meta-matrix entity class
appears approximately as often in posterior positions of statements (last row in Table 6)
as in anterior positions (last column in Table 6). Thus, the in-degree or receptivity of a
meta-matrix entity class approximately equals the out-degree or expansiveness of the
class. This is due, in part, to the use of proximity in the text to place links among concepts
and reflects, if anything, the lack of overly stylized sentential form.
Within the meta-matrix, the entity class that linked most frequently to other entity classes
was Organization (179 links), followed by Location (108), Agent (95), Resource (71), Task-
Event (66), and Knowledge (53). If we do not look at these absolute values, but at
percentages of the linkage of meta-matrix entity classes to the same or other entity
classes, our results reveal that concepts in the entity class Task-Event are more likely
to be connected to concepts in classes other than Task-Event. In contrast to Task-Event,
concepts in the entity class Location are most likely to link to other Location concepts
(Table 7).
Furthermore, the results indicate that within the networks that we extracted from the texts,
most information refers to membership networks (13.8% of all statements, Figure 1).
Table 6. Number of links (total number of statements) between meta-matrix categories
Meta-Matrix Agent Knowledge Resource Task-
Event
Organization Location Sum
Agent 24 8 8 12 55 12 119
Knowledge 10 18 9 3 20 11 71
Resource 8 9 39 11 23 20 110
Task-Event 13 7 9 10 20 17 76
Organization 58 23 35 19 90 44 269
Location 9 10 17 25 47 69 177
Sum 122 75 117 80 255 173 822
Table 7. Linkage of meta-matrix entity classes
Meta-Matrix entity
class
With same entity
class (%)
With other entity
classes (%)
Task-Event 13 87
Agent 20 80
Knowledge 25 75
Organization 33 67
Resource 35 65
Location 39 61
Although there is also substantial information on inter-organizational networks (11.1%)
and organizational location networks (10.4%). The least information is provided on
precedence networks (1.2%) and knowledge requirement networks (1.2%). This suggests
that more is known, or at least presented in the news, about who the terrorists are and
where they are than about what they do when and what they need to know in order to
engage in such actions or why.
The analysis of the distribution of statements formed from meta-matrix entity classes
across the text set reveals that all entities are covered in at least one third of the texts.
In addition, Organization, Location, and Agent classes appear in more than half of the
texts (Table 8). Again, this suggests that more is reported about who and where than
about what, how and why. We note that a human reading of these texts may pick up a
little more about what and how, although such information does appear to be less
common in general in the texts used for this purely illustrative analysis.
Figure 1: Total number of links between meta-matrix categories
Table 8: Number of texts in that links appears
Meta-Matrix Agent Knowledge Resource Task-
Event
Organization Location Sum
Agent 13 5 6 10 17 9 10.0
Knowledge 7 9 5 3 9 5 6.3
Resource 4 4 9 7 12 11 7.8
Task-Event 9 3 7 4 11 10 7.3
Organization 17 11 13 11 18 16 14.3
Location 7 7 10 11 17 14 11.0
Sum 9.5 6.5 8.3 7.7 14.0 10.8 9.5
113
91 90
69
58
43 42 39 39 37
25 24 21 21 20 18 18 18 16
10 10
0
20
40
60
80
100
120
AO/OA
OL/LO
OO
LL
RO/OR
KO/OK
TL/LT
RR
TO/OT
RL/LR
AT/TA
AA
AL/LA
KL/LK
RT/TR
KK
KR/RK
AK/ KA
AR/ RA
KT/TK
TT
Meta-Matrix Cell
Total Number of statements
Statements
In Figure 1 and Tables 6 and 8, we have been discussing the total links or statements.
Looking at the total links provides information about the overall structure of the
discussion and the elements of the structure (agents, knowledge, etc.) that are considered
critical by the authors or for which they have a wealth of information. It is often useful
to ask about unique links, however, if we want to understand the structure itself. In Figure
2, we display the number of links per sub-matrix that are unique. That is, a link or statement
is only counted once regardless of how many texts it appears in.
Comparison of Figures 1 and 2 shows that a great deal of information — particularly in
the Agent-to-Agent sub-matrix is repeated across texts. This suggests that either many
of the texts were discussing the same information (repetition), or they got their information
from the same source. Note that if we knew that each source was unique, then the
difference between the total (Figure 1) and the unique (Figure 2) would be an indicator
of the reliability of the information.
The overall structure for this covert network is very sparse. In some sense, based on these
texts, more is known about the affiliations, locations, resources, and knowledge of agents
and organizations than is known about the interrelations of knowledge, resources and
tasks (Table 9). Further, if we compare the number of unique links (Table 9) to the number
Figure 2. Number of unique links between meta-matrix categories
34 33
25
22 21 21 20 19 18
16
14 14 13 12 12
10 9 9 9
6
4
0
10
20
30
40
AO/OA
OL/LO
RO/OR
TO/OT
RL/LR
TL/LT
KO/OK
AT/TA
OO
AL/LA
LL
RT/TR
AA
AK/ KA
KL/LK
AR/ RA
KK
KR/RK
RR
KT/TK
TT
Meta-Matrix cells
Number of unique statements
Legend:
A = Agent
K = Knowledge
R = Resource
T = Task/ Event
O = Organization
L = Location
Table 9. Number of links (unique number) between meta-matrix categories
Meta-Matrix Agent Knowledge Resource Task-Event Organization Location
Agent 13 12 10 19 34 16
Knowledge 9 9 6 20 12
Resource 9 14 25 21
Task-Event 4 22 21
Organization 18 33
Location 14
Statements
of texts that contain links (for each sub-matrix) (Table 8) we see that the two tables are
similar. In other words, many links appear in only one text. It is interesting to note which
sub-matrices have more unique links than texts – e.g., the Agent-by-Knowledge and the
Organization-by-Knowledge sub-matrices. This indicates that the texts that discuss the
knowledge network tend to do so by discussing multiple linkages (e.g., all of these people
know item z). Whereas texts that discuss, e.g., the social network (Agent-by-Agent) are
more likely to simply talk about a single pair of actors and the nature of their relationship.
Whether this pattern of reporting would hold in other cultures is debatable.
Beyond learning about the network structure of the meta-matrices and the distribution
of concepts and connections between them across the sample data, analysts might be
interested in investigating in more detail the concepts and links contained in the metamatrix.
In order to gain this knowledge, sub-matrix text analysis5 can be run. For
Table 10: Who has what means? Organizational capability network (organization by
resource)
Table 11: Who knows what? Knowledge network (agent by knowledge)
Table 12: Who is located where and does what? (Localized assignment network: agent
by task-event by location)
Statements formed from Higher-Level Concepts (Sub-Matrix Analysis)
Sample text 1: Sample text 2:
1 Al-Qaeda – training camp 1 Al-Aksa - assets
1 network- Hawala 1 Al-Aksa - money
1 Hawala – money 1 Hamas - sponsoring
1 finance – network 1 aid - Hamas
1 camp - US-Government 1 aid - Treasury Department
1 money - Hamas
1 support - Hamas
1 Treasury - assistance
1 US-Government - assistance
1 assets - Treasury Department
Statements formed from Higher-Level Concepts (Sub-Matrix Analysis)
Sample text 1: Sample text 2:
1 chairman – monitoring 1 FBI - Analyst
1 evidence – Saddam Hussein
Statements formed from Higher-Level Concepts (Sub-Matrix Analysis)
Sample text 1: Sample text 2:
1 Saddam Hussein - Iraq 1 arrest - Leader
1 Leader - Germany
illustrating the results of this procedure, we show a map from the same text in Tables 10
to 12. A map contains one coded statement per line and its frequency.
These various sub-matrix networks enable a better understanding of what attributes of
the meta-matrix link to other attributes, and with what strength. All three sub-matrices
together enable a broader view of the situation. Figures 3 and 4 illustrate this broader
picture. The comparison of figures 3 and 4 illustrates that text 1 presents a more
disconnected story than does text 2. Further, even if the two stories were combined, the
Figure 4: Visualization of sub-matrices from sample text 2
Figure 3: Visualization of sub-matrices from sample text 1
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
overall map would tell us little about the structure of the two terrorist groups — al-Qaeda
and Hamas.
Meta-matrix data and sub-matrix data generated with AutoMap can be saved and then
re-analyzed outside of AutoMap using standard social network analysis tools. AutoMap
can both code these networks and then output them in two useful exchange formats for
use with other network analysis tools — DL for UCINET and DyNetML for ORA
(www.casos.cs.cmu.edu/projects/ORA, Carley and Reminga, 2004). For this chapter, we
use ORA as it enables the analysis of all the cells in the meta-matrix at once. In either case,
the combination of text and network analysis enables the analyst to readily combine rich
textual data with organizational data connected through other methods, thus enhancing
the analysis process.
Discussion-Features and Limitations
The techniques of meta-matrix text analysis and sub-matrix text analysis described herein
can support analysts in investigating the network structure of social and organizational
systems that are represented in textual data. Furthermore, these novel and integrative
methods enable analysts to classify words in texts into entity classes (node types)
associated with networks common to organizational structures according to a theoretically
and empirically validated ontology — the meta-matrix.
The validity of the method and the results presented in this chapter are constrained by
the little experience we gained so far with these novel techniques, the small number of
texts analyzed, and the implementation of the techniques into one software. The tool
should also be applied to multiple larger data sets.
Lessons Learned
In general, we find that the entity-name recognizer greatly enhances the ability to locate
concepts associated with the meta-matrix ontology. In particular, it facilitates locating
Agents, Organizations, and Locations. For entity classes that are less associated with
proper nouns, the name recognizer is of less value.
Coding texts using AutoMap is not a completely automated process. However, AutoMap
does provide a high degree of automation that assists the user and increases the
efficiency and effectiveness of meta-matrix text analysis in comparison to manual coding.
As with most text analysis techniques that seek to extract meaning, significant manual
effort needs to be expended on constructing the delete list and thesauri, even though the
method is computer-supported. For example, the delete list used in this study took 30
minutes to construct. However, the thesauri (and there are three) took four days to
construct. Thesauri enable the minimization of miscoding, as in missed relations, due to
aliases and misspellings, and differences due to the underlying languages. Analysts
have to decide on an optimal trade-off between speed of the computer-supported
research process and enhancement of the quality of automated coding caused by the
manual creation and refinement of pre-processing tools according to their goals and
resources.
It is worth noting that significant improvement over straight manual coding can be
achieved by building thesauri and delete lists based on only a fraction of texts. As more
texts in this domain are coded, we will have to expend relatively little additional effort to
expand the delete and thesauri list. For example, we suspect that hundreds of additional
texts will be codable with maybe only a day more attention to the thesauri. The reason
is that, when in the same domain, construction of thesauri is like building a sample via
the snowball method (i.e., with each iteration fewer and fewer novel concepts are found).
How large that fraction should be is a point for future work. However, preliminary studies
suggest 10% is probably sufficient. Future work should explore whether intelligent data
mining and machine learning techniques can be combined with social network analysis
and text analysis to provide a more automated approach to constructing thesauri on the
fly.
We also find that the higher the level of generalization used in the generalization
thesaurus, the greater the ability to compare two diverse texts. Not counting typographical
errors, often the translation of two to ten text-level concepts per high-level concept
seems sufficient to generate a “language” for the domain being studied.
We note that when forming thesauri, it is often critical to keep track of why certain types
of concepts are generalized into others. At the moment there is no way to keep that
rationalization within AutoMap. In general, the user should keep a lab notebook or readme
file for keeping such rationalizations.
Finally, we note that for extracting social or organizational structure from texts a large
corpus is needed. The point here is comprehensiveness, not necessarily a specific
number of texts. Thus, one might use the entire content of a book that describes and
discusses an organization or a large set of newspaper articles. In building this corpus,
not all texts have to be of the same type. Thus, the analyst can combine newspaper
reports, books, board-of-directors reports, Web pages, etc. Once the networks are
extracted via AutoMap they can be combined into a comprehensive description of the
organization being examined. Further, the analyst needs to pre-define what the basic
criteria are for including a text in the corpus — e.g., it might be publication venue, time
frame, geographic area, specific people, organizations, or locations mentioned.
Considerations for Future Work
We also note that the higher the level of generalization, the more ideas are being inferred
from, rather than extracted from, the texts. Research needs to be done on the appropriate
levels of generalization. Note that the level of generalization can be measured as the
average number of text-level concepts associated with each higher level concept.
One of the strengths of NTA is that the networks extracted from the texts can be combined
in a set theoretic fashion. So we can talk about the network formed by the union or
intersection of the set of networks drawn from the set of texts. When combining these
networks we can, for each statement, track the number of texts that contained that
statement. Since a statement is a relation connecting two concepts, this approach
effectively provides a weight for that relation. Alternatively, the analyst can compute
whether any text contained that statement. In this case, there are no weights and the links
in the network are simply present or not (binary). If these texts represent diverse sources
of information, then the weights are indicative of the certainty or verifiability of a relation.
Future work might also explore utilizing Bayesian learning techniques for estimating the
overall confidence in a relation rather than just summing up the number of texts in which
the statement was present.
We also note that when people read texts there is a process of automatic inference. For
example, when people read about a child talking to a parent they infer based on social
experience that the child is younger. Similarly, it appears that such inferences are common
between the entity classes. For example, if Agent X has Resource Y and Knowledge K
is needed to use Resource Y, then in general Agent X will have Knowledge K. Future work
needs to investigate whether a simple inference engine at the entity class level would
facilitate coding. We note that previous work found that using expert systems to assist
coding in terms of adding general social knowledge was quite effective (Carley, 1988).
Thus, we expect this to be a promising avenue for future research.
Finally, we note that the use of an ontology adds a hierarchical level to the coding. This
is invaluable from an interpretative perspective. There is no reason, conceptually, why
multiple hierarchical levels could not be added, denoting finer and finer levels of detail.
We suspect however, based on the use of hierarchical coding schemes in various
scientific fields (e.g., biology and organization theory) that: a) such hierarchies are likely
to not be infinitely deep, b) a certain level of theoretical maturity and consensus in a field
is needed for such a hierarchy to be generally useful, and c) eventually we will need to
move beyond such a “flat” scheme for extracting meaning. As to this last point, by flat
what we are referring to is the fact that a hierarchy can be completely represented in two
dimensions. We found, even when doing this limited coding that some text-level
concepts and higher-level concepts needed to be cross-classified into two or more entity
classes. As more levels are added in an ontological hierarchy, such cross classification
is likely to occur at each level, resulting in a network of inference, not a simple hierarchy
and so a non-flat structure. Future work should examine how to code, represent, and
reason about such networks.
Conclusion
One of the key advantages of classic content analysis was that macro social change could
be tracked by changes in content, and over- or under-representation of various words.
For example, movements toward war might be signaled by an increasing usage of words
describing hostile acts, foreign powers, and weapons. One of the key advantages of
Network Text Analysis (NTA) over standard text analysis is that it enables the extraction
of meaning and enables interpretation by signaling not just what words are used but how
they are used. This enables differences and similarities in viewpoints to be examined, and
it enables the tracking of micro social change as evidenced by changes in meaning. By
adding an ontology to NTA, differences and similarities in viewpoints about a metastructure
described or discussed in the text can be examined.
In this chapter, we used the meta-matrix ontology as we were interested in the underlying
social/organizational structure described in the texts. Several points are critical to note.
First, the mere fact that we used an ontology to define a set of meta-concepts enables
the extraction of a hierarchy of meaning thus affording the analyst with greater interpretive
ability. Second, any ontology could be used, and the analyst needs to consider the
appropriate ontology for their work. In creating this ontology the analyst wants to think
in terms of the set of entity classes and the relations among them that define the second
level network of interest. For us, these entity classes and relations were those relevant
to defining the organizational structure of a group.
The proposed meta-matrix approach to text analysis makes it possible to track more micro
social change in terms of changes, not just in meaning, but in the social and organizational
structures. Using techniques such as this facilitates a more systematic analysis of
groups, broadens the types of questions that can be effectively answered using texts,
and brings the richness of textual information to bear in defining and understanding the
structure of the organizations and society in which we live.
Acknowledgments
We want to thank Maksim Tsvetovat and Jeffrey Reminga from CASOS, CMU for helping
with generating the visualizations.
References
Alexa, M. (1997). Computer-assisted text analysis methodology in the social sciences.
Arbeitsbericht: ZUMA.
Bakker, R.R. (1987). Knowledge graphs: Representation and structuring of scientific
knowledge. Dissertation. University Twente.
Batagelj, V., Mrvary, A., & Zaveršnik, M. (2002). Network analysis of texts. In T. Erjavec
& J. Gros (Eds.), Proceedings of the 5th International Multi-Conference Information
Society - Language Technologies (pp. 143-148). Ljubljana, October. Jezikovne
tehnologije / Language Technologies, Ljubljana.
Burkart, M. (1997). Thesaurus. In M. Buder, W. Rehfeld, T. Seeger, & D. Strauch (Eds.),
Grundlagen der Praktischen Information und Dokumentation: Ein Handbuch
zur Einführung in die Fachliche Informationsarbeit (pp. 160-179) (4th edition).
München: Saur.
Carley, K.M., & Reminga, J. (2004). ORA: Organization risk analyzer. Carnegie Mellon
University. School of Computer Science, Institute for Software Research International,
Technical Report CMU-ISRI-04-101.
Carley, K.M. (2003). Dynamic network analysis. In R. Breiger, K.M. Carley, & P. Pattison
(Eds.), Summary of the NRC workshop on social network modeling and analysis
(pp. 133-145). Committee on Human Factors, National Research Council.
Carley, K.M. (2002). Smart agents and organizations of the future. In L. Lievrouw & S.
Livingstone (Eds.), The handbook of new media (pp. 206-220). Thousand Oaks,
CA: Sage.
Carley, K.M. (1997a). Extracting team mental models through textual analysis. Journal
of Organizational Behavior, 18, 533-558.
Carley, K.M. (1997b). Network text analysis: The network position of concepts. In C.W.
Roberts (Ed.), Text analysis for the social sciences (pp. 79-102). Mahwah, NJ:
Lawrence Erlbaum.
Carley, K.M. (1993). Coding choices for textual analysis: A comparison of content
analysis and map analysis. In P. Marsden (Ed.), Sociological Methodology, 23, 75-
126. Oxford: Blackwell.
Carley, K.M. (1988). Formalizing the social expert’s knowledge. Sociological Methods
and Research, 17(2), 165-232.
Carley, K.M. (1986). An approach for relating social structure to cognitive structure.
Journal of Mathematical Sociology, 12, 137-189.
Carley, K.M., Dombrowski, M., Tsvetovat, M., Reminga, J., & Kamneva, N. (2003).
Destabilizing dynamic covert networks. Proceedings of the 8th International
Command and Control Research and Technology Symposium. Washington, DC.
Evidence Based Research, Vienna, V.A.
Carley, K. M., & Hill, V. (2001). Structural change and learning within organizations. In
A. Lomi & E.R. Larsen (Eds.), Dynamics of organizations: Computational modeling
and organizational theories (pp. 63-92). Live Oak, CA: MIT Press/AAAI
Press.
Carley, K.M., & Krackhardt, D. (1999). A typology for C2 measures. In Proceedings of
the 1999 International Symposium on Command and Control Research and
Technology. Newport, RI, June.
Carley, K.M., & Palmquist, M. (1992). Extracting, representing, and analyzing mental
models. Social Forces, 70(3), 601-636.
Carley, K.M., & Reminga, J. (2004). ORA: Organizational risk analyzer. Carnegie Mellon
University, School of Computer Science, Institute for Software Research International,
Technical Report CMU-ISRI-04-106.
Carley, K.M, Ren, Y., & Krackhardt, D. (2000). Measuring and modeling change in c3i
architectures. In Proceedings of the 2000 Command and Control Research and
Technology Symposium. Naval Postgraduate School, Monterrey, CA, June, 2000.
Corman, S.R., Kuhn, T., Mcphee, R.D., & Dooley, K.J. (2002). Studying complex discursive
systems: Centering resonance analysis of communication. Human Communication,
28(20), 157-206.
Danowski, J. (1993). Network analysis of message content. In W.D. Richards & G.A.
Barnett (Eds.), Progress in communication science, XII (pp. 197-222). Norwood,
NJ: Ablex.
Diesner, J., & Carley, K.M. (2004). AutoMap1.2 - Extract, analyze, represent, and
compare mental models from texts. Carnegie Mellon University, School of Computer
Science, Institute for Software Research International, Technical Report
CMU-ISRI-04-100.
Galbraith, J. (1977). Organizational design. Reading, MA: Addison-Wesley.
Hill, V., & Carley, K.M. (1999). An approach to identifying consensus in a subfield: The
case of organizational culture. Poetics, 27, 1-30.
James, P. (1992). Knowledge graphs. In R.P. van der Riet & R.A. Meersman (Eds.),
Linguistic instruments in knowledge engineering (pp. 97-117). Amsterdam: Elsevier.
Jurafsky, D., & Marton, J.H. (2000). Speech and language processing. Upper Saddle
River, NJ: Prentice Hall.
Kelle, U. (1997). Theory building in qualitative research and computer programs for the
management of textual data. Sociological Research Online, 2(2). Retrieved from
the WWW at: http://www.socresonline.org.uk/2/2/1.html
Klein, H. (1997). Classification of text analysis software. In R. Klar & O. Opitz (Eds.),
Classification and knowledge organization: Proceedings of the 20th annual
conference of the Gesellschaft für Klassifikation e.V. (pp. 255-261). University of
Freiburg, Berlin. New York: Springer.
Kleinnijenhuis, J., de Ridder, J.A., & Rietberg, E.M. (1996). Reasoning in economic
discourse: An application of the network approach in economic discourse. In C.W.
Roberts (Ed.), Text analysis for the social sciences (pp. 79-102). Mahwah, NJ:
Lawrence Erlbaum.
Krackhardt, D., & Carley, K.M. (1998). A PCANS model of structure in organization. In
Proceedings of the 1998 International Symposium on Command and Control
Research and Technology Evidence Based Research (pp. 113-119). Vienna, VA.
Magnini, B., Negri, M., Prevete, R., & Tanev, H. (2002). A WordNet-based approach to
named entities recognition. In Proceedings of SemaNet’02: Building and Using
Semantic Networks (pp. 38-44). Taipei, Taiwan.
March, J.G., & Simon, H.A. (1958). Organizations. New York: Wiley.
Monge, P.R., & Contractor, N.S. (2003). Theories of communication networks.Oxford
University Press.
Monge, P.R., & Contractor, N.S. (2001). Emergence of communication networks. In F.M.
Jablin, & L.L. Putnam (Eds.), The new handbook of organizational communication:
Advances in theory, research and methods (pp. 440-502). Thousand Oaks,
CA: Sage.
Popping, R. (2003). Knowledge graphs and network text analysis. Social Science
Information, 42(1), 91-106.
Popping, R. (2000). Computer-assisted text analysis. London, Thousand Oaks: Sage.
Popping, R., & Roberts, C.W. (1997). Network approaches in text analysis. In R. Klar &
O. Opitz (Eds.), Classification and Knowledge Organization: Proceedings of the
20th annual conference of the Gesellschaft für Klassifikation e.V. (pp. 381-389),
University of Freiburg, Berlin. New York: Springer.
Porter, M.F. (1980). An algorithm for suffix stripping. I, 14(3), 130-137.
Reimer, U. (1997). Neue Formen der Wissensrepräsentation. In M. Buder, W. Rehfeld, T.
Seeger & D. Strauch (Eds.), Grundlagen der praktischen Information und
Dokumentation: Ein Handbuch zur Einführung in die fachliche Informationsarbeit
(pp. 180-207) (4th edition). München: Saur.
Ryan, G.W., & Bernard, H.R. (2000). Data management and analysis methods. In N.
Denzin & Y. Lincoln (Eds.), Handbook of qualitative research (pp. 769-802) (2nd
edition). Thousand Oaks, CA: Sage.
Scott, J.P. (2000). Social network analysis: A handbook (2nd edition). London: Sage.
Simon, H.A. (1973). Applying information technology to organizational design. Public
Administration Review, 33, 268-78.
Sowa, J.F. (1984). Conceptual structures: Information processing in mind and machine.
Reading, MA: Addison-Wesley.
Wasserman, S., & Faust, K. (1994). Social network analysis. Methods and applications.
Cambridge: Cambridge University Press.
Zuell, C., & Alexa, M. (2001). Automatisches Codieren von Textdaten. Ein Ueberblick
ueber neue Entwicklungen. In W. Wirth & E. Lauf (Eds.), Inhaltsanalyse –
Perspektiven, Probleme, Potenziale (pp. 303-317). Koeln: Herbert von Halem.
Endnotes
1 The delete list was applied with the rhetorical adjacency option. Rhetorical
adjacency means that text-level concepts matching entries in the delete list are
replaced by imaginary placeholders. Those place holders ensure that only concepts,
which occurred within a window before pre-processing, can form statements
(Diesner & Carley, 2004).
2 We did not choose the thesaurus content only option. Thus, adjacency does not
apply.
3 We used the thesaurus content only option in combination with the rhetorical
adjacency. Thus, the meta-matrix categories are the unique concepts.
4 We used the following statement formation settings: Directionality: uni-directional,
Window Size: 4, Text Unit: Text (for detailed information about analysis
settings in AutoMap see Diesner & Carley, 2004).
5 Sub-Matrix selection was performed with the rhetorical adjacency option.
i This work was supported in part by the National Science Foundation under grants
ITR/IM IIS-0081219, IGERT 9972762 in CASOS, and CASOS – the Center for
Computational Analysis of Social and Organizational Systems at Carnegie Mellon
University (http://www.casos.cs.cmu.edu). The views and conclusions contained
in this document are those of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of the National
Science Foundation or the U.S. government.
Appendix
Software: AutoMap: Diesner, J. & Carley, K.M. (2004). AutoMap1.2: Software for
Network Text Analysis.
AutoMap is a network text analysis tool that extracts, analyzes, represents, and compares
mental models from texts. The software package performs map analysis, meta-matrix text
analysis, and sub-matrix text analysis. As an input, AutoMap takes raw, free flowing, and
unmarked texts with ASCII characters. When performing analysis, AutoMap encodes
the links between concepts in a text and builds a network of the linked concepts. As an
output, AutoMap generates representations of the extracted mental models as a map file
and a stat file per text, various term distribution lists and matrices in comma separated
value (csv) format, and outputs in DL format for UCINET and DyNetML format. The scope
of functionalities and outputs supported by AutoMap enables one way of analyzing
complex, large-scale systems and provide multi-level access to the meaning of textual
data.
Limitations: Coding in AutoMap is computer-assisted. Computer-assisted coding
means that the machine applies a set of coding rules that were defined by a human
(Ryan and Bernard, 2000, p.786; Kelle, 1997, p. 6; Klein, 1997, p. 256). Coding rules
in AutoMap imply text pre-processing. Text pre-processing condenses the data to
the concepts that capture the features of the texts that are relevant to the user. Preprocessing
techniques provided in AutoMap are Named-Entity Recognition,
Stemming, Deletion, and Thesaurus application. The creation of delete lists and
thesauri requires some manual effort (see Discussion section for details).
Hardware and software requirements: AutoMap1.2 has been implemented in Java 1.4. The
system has been validated for Windows. The installer for AutoMap1.2 for Windows and
a help file that includes examples of all AutoMap1.2 functionalities are available online
under http://www.casos.cs.cmu.edu/projects/automap/software.html at no charge. More
information about AutoMap, such as publications, sponsors, and contact information
is provided under http://www.casos.cs.cmu.edu/projects/automap/index.html.
AutoMap has been written such that the only limit on the number of texts that can be
analyzed, the number of concepts that can be extracted, etc., are determined by the
processing power and storage space of the user’s machine.