back to SVP
project
Report on SVP( Strategic Vectorization Project )
Director: Bob Futrelle
Code: Dan Crispell
Report: Mingyan Shao
The idea of
SVP is: beginning with JPEG file, apply moment
analysis on the pixel intensity information, then create QuadricFit
objects, group these QuadricFit objects together and fit the groups into
line segments. Two possible operations may be done to the
line segments, they can be chained together if
they could merge into a slow curve, or they can be
extended to find intersections. The result image contains
a set of lines, some are chained, andsome are
connected to an intersection.
Steps:
- Preprocess: Convert Image to
2-dimensional array of pixel intensities
- Perform PCA(Principal Component
Analysis)
- Group PCAs based on the location,
orientation, and width
- Fit Line segments to PCA groups
- Postprocess: Detect and deal with
slow curves and intersections
Details:
- Preprocess:
Convert Image to 2-dimensional array of pixel intensities
- Brief Idea
Read an image of JPEG format into an BufferedImage
object, from which get a 2D int array called 'image' of the pixel
intensities. In the 'image' array, the more intensity a pixel is, the
larger number in the corresponding cell of the array. The numbers
in the array vary from 0 to 255. For the convenience of looking
for high intensity, SVP use 0 as white and 255 as black(while
Java2D uses the opposite values), .
When changing an imge to BufferedImage object, a border is added to the
original image. The border width is set by a SVP parameter, OUTER_BORDER_WIDTH.
The default value is 100.
- Perform
PCA(Principal Component Analysis)
- int[][] --> QuadricFit[][]
- Brief Idea
For each pixel in the 'image' array(described in previous part), the
0th, 1st and 2nd moment analyses are performed, so that we can seperate
the cells which are possibly part of a LINE from others, e.g., part of BAR. Only those
cells which pass the 0th and 1st moment tests can be thought of
part of a LINE, and then can have a QuadricFit object create based
on these cells.
This step creates a new 2D array for the image, called 'quadrics', each
cell in which is either a QuadricFit object or NULL. If a pixel passes
the 0th, 1st and 2nd moment tests in sequence, a QuadricFit object will
be created for the corresponding cell in 'quadrics' array, otherwise the
cell will be NULL.
Moment Tests
To take these moment tests, for each 'image' cell(described in
Preprocess), say A, we work on a region of cells. The center of this
region is A, the offset/size of this region is decided by a SVP
parameter, XYBAR_BOX_SIZE. In this region, three
values will be calculated. X_weighted_total is the sum of the cells'
product of intensity and column number, Y_weighted_total is the sum of
the cells' product of intensity and row number. Total is the sum of all
the cells' intensity. In the 0th moment test, the average
density of this region is calculated using Total. If the average
density is smaller than a SVP parameter, MAX_BLACK_DENSITY,
the center cell A passes the 0th moment test. In other words, the 0th
moment test guarantees that the surrounding area of a pixel is not too
dense, otherwise, the cell is probably part of a bar chart which has a
solid inner part. The 1st moment test is used to check if this cell is
the actual center of the region using the weighted
total(X_weighted_total, and Y_weighted_total). Since the weights in the
weighted total are row and column respectively, the calculated center of
the region will be a point which contributed most to the
weighted_total. If the actual center cell of the region is
roughly(decided by a SVP parameter, XYBAR_ERROR_TOLERANCE)
the calculated center, this cell passes the 1st moment test---the
cell is part of a line. Next, we will use 2nd moment analysis to
create a QuadricFit object for this specific cell. For this cell,
like in 1st moment test, we also work on a region, whose size is
set by a SVP parameter, QUADRIC_BOX_SIZE. Based
on the cell and the region surrounding it, we construct a QuadricFit
object, which will be described next.
QuadricFit
A QuadricFit object can be considered as a tiny segment of a line. Its
data members include center point, the angle of major axis, the lengths
of major and minor axes, and group id(described later). For each cell,
we construct a region, as described before. From this region, we get the
offset of each pixel to the center pixel, based on which we can
calculate the angle of major axis, then lengths of major and minor axes.
, QuadricFit, InterpolatedFunction,
etc.
- Parameters
XYBAR_BOX_SIZE: Indicate
the offse/size of the region to be considered around a ‘center’ cell
QUADRIC_BOX_SIZE: Region
whose size should be smaller than this parameter could be a quadric
MAX_BLACK_DENSITY: We calculate the average
density of a region. If the average density is smaller than this
parameter, this region passes the 0th moment test, which
guarantees that surrounding area of a pixel is not too dense.
XYBAR_ERROR_TOLERANCE: For the 1st
moment test. By deciding if the weighted center is smaller than this
parameter, we decide if the calculated center is the actual center or
not.
- Group PCAs
based on the location, orientation, and width
- QuadricFit[][]
--> Vector of QuadricGroup
- Brief Idea
Till now, the JPEG image has been converted into a 2D array of
QuadricFit element. Each QuadricFit is possiblely a tiny part of some
line(s). Since our purpose(currently) is to get those lines, we
should first group some QuadricFit objects together according to
some criteria, e.g., position, orientation, then try to fit line
segments to the groups(described later). In this step, we focus on
grouping QuadricFit objects. The grouped QuadricFit objects are
kept in QuadricGroup object, which is then stored in a Vector of
QuadricGroup.
QuadricGroup
To decide if a QuadricFit belongs to it, the QuadricGroup checks four
things. If the QuadricFit pass these 4 test, it will be added to the
QuadricGroup. The four tests are: ( From QuadricGroup.java )
1. Its(QuadricFit) theta value must be within a
certain factor times the std. deviation of the mean group theta.
2. if added to the group, it must not bring the theta
variance over a certain limit. This prevents slowly changing curves from
being treated as single lines.
3. Its width value must be within a certain factor times
the std. deviation of the mean group width.
4. if added to the group, it must not bring the width
variance over a certain limit.
- Related Classes: ImageQuadricData, QuadricGroup,QuadricFit, etc
- Parameters
MAX_QUADRIC_ASPECT_RATIO: This
parameter is useful in grouping quadricfit objects. A quadricfit object
can be grouped into a quadric group if its width and length satisfy a
criteria like width/length < this parameter
MIN_QUADRIC_GROUP_SIZE: Every
group should contain at least quadricfit objects more than this
parameter.
MAX_WIDTH_VARIANCE: guarantee
that after a quadricfit object has been added to this group, the change
of width of this group will not be too big.
MAX_THETA_VARIANCE:
guarantee that after a quadricfit object has been added to this group,
the change of theta of this group will not be too big.
- Fit Line
segments to PCA groups
- Vector of
QuadricGroup --> Vector of LineSegment
- Brief Idea:
For each QuadricGroup,
- decide the possible line is
closer to horizontal or vertical
- Best fit line. Linear regression using least square.
- find endpoints
- given
endpoints, get best line model and construct a
LineSegment object
- Decide the
possible line is closer to horizontal or vertical
Construct the boundingbox of a group of QuadricFit objects. The boundingbox,
which usually is a rectangle, shows the rough
coverage of the group of cells. Using the right upper and left
lower points, get the direction of the possible“line” and determine whether this line is closer to
horizontal or vertical.
- Best Fit Line
Check if the line to close to be horizontal,
or vertical, then use Least Squared Fitting to get slope and
intercept. The slope we got now is the best_fit_slope
and is calculated. From the QuadricFitGroup, we have another slope
called quad_slope. We need a balance between these two slopes, thus
we use a SVP parameter, THETA_MEAN_CUTOFF_GROUP_SIZE,
to decide the contribution of each of them to the slope of the group.
If the group size is greater than this parameter, best_fit_slope will contribute more than quad_slope because
best_fit_slope is based on enough number of Quadrics objects and
should be quite reliable, otherwise, quad_slop of
the QuadricGroup will contribute more to the
slope of the whole group. Given best_fit_slope,
x_total and y_total, it’s easy to calculate best_fit_inercept.
Now,
we know slope and intercept, we need to calculate two endpoints. To do
this, first we calculate the unit x and y of the direction, which is easy
since we know slope already. Next, find the minimal and maximal
projected distances(on x if closer to horizontal, on y if closer to
vertical) by going through the quadric group. Then, two endpoints can be
calculated by these projected distances, slope and intercept.
- Given QuadricGroup and two endpoints, we
need to find the best line model. The best line model means
minimizing type 1(core) and type 2(wing) errors.
In this step, we need to find a
critical point that can not only cover the QuadricGroup, but also
minimize the core and wings errors. First we use QuadricGroup’s
width and a SVP parameter, LINE_FITER_WIDTH,
to set a range of width to look for the critical point. We split this
range into a number of strips and calculate the average pixel density of
each strip. By going through these strips, we calculate the sum of the
errors, and keep the width for the smallest error. After we scan through
all the strips, the width we get is the critical point, for one wing.
For the other wing, do the same thing to get the critical point.
Using these two critical points, we can calculate the width of
the line model and adjust the center of the line.
Based
on a QuadricGroup, we get a line model of endpoints, width, slope,
intercept, and related statistical information.
- Related Classes:
LineSegmentFitter.java LineSegment.java
- Parameters:
LINE_FITTER_WIDTH: Given
two endpoints, to find best line model, this parameter is used to
calculate max-width of the possible line
THETA_MEAN_CUTOFF_GROUP_SIZE: For
best-fit-line. We use this parameter and actual group size to calculate
the best-fit-ratio = min(1.0,
actual-group-size/theta_mean_cutoff_group_size), to keep best-fit-ratio
<= 1.
- Postprocess:
Detect and deal with slow curves and intersections
- Vector
of LineSegments, Pixels, LineSegmentFitter --> Chain segments
if possible, or find and handle TroubleSpots if necessary.
- Brief
Idea:
Based on the position distance of the endpoints of LineSegments,
process the lines in two possible ways, chain together, or find and
handle troublespots.
If SVP parameter DBG_DO_CHAINING
is set, we try to chain some linesegments together. First, put all the
endpoints of the linesegments into a vector. For each endpoint, scan all
the other endpoints, select those who are not in the same line with it,
but are near to it, then chain the related linesegments together. To
chain them together, two endpoints at a time, first calculate the
mid-point between these two points, next extend the two points by
increasing its x and y coordinates in its original
direction. The distance between one endpoint and
the middle point decides the increasement. Both of the
chained endpoints set "chained" attribution to be true.
Continue on other endpoints in the vector until no more left.
For those points who are not been chained, they will be extended as far as possible. Thus the endpoints
of the QuadricGroup will be as close as possible to each other.
This step is proceed if SVP parameter DBG_DO_EXTENSIONS is true,
and this step is also the preprocessing of finding and handling trouble spots (which will be called if
SVP parameter DBG_DO_TROUBLESPOTS
is set).
To extend the endpoint, we should use four SVP parameters, MAX_WING_EXTENSION_DEVIATION(after extension, the pixel density should be smaller
than this parameter), MAX_CORE_EXTENSION_DEVIATION(after extension, the pixel density should be smaller
than this parameter), MAX_LINE_EXTENSIONS(max number of slice extended), COMPARISON_TOLERANCE(to
deal with very small std. deviation). The idea of extension is trying
to extend for as many as MAX_LINE_EXTENSIONS times, until the noise of
core or wings is too high. In each time of extension, the size of
extension is determined by two SVP parameters, SLICE_SIZE and EXTENSION_INCREMENT. When we try
to extend one slice forward, we should check the new core and new
wings, if they contain much more noise, the extension will be ended.
For each endpoints, which is not chained and has not been add into some
other endpoint's troublespot, we look through
all the available endpoints, and test if the distance between them
is smaller than half of a SVP parameter INTERSECTION_FINDER_SIZE,
if so, that point will be added into its troublespot. In the end, every
unchained endpoint is in some troublespot, some of which
may only contain one endpoint though.
The troublespot will be handled in different ways according to its
size, i.e., how many endpoints are contained in it. The smallest
possible size is one, which does not need handling. If two, there
are two cases. One is that the two linesegments are actually part
of a single line which is seperated by tick mark, the other case
is that they are just two lines. For the first case, we check if
the two lines/endpoints are colinear, i.e., both of them are parts
of a single line. If they are colinear, they are seperated by
tickmark. We assume the tick mark(s) is(are) in the middle pints of the
two endpoints. After change the direcitons of the two endpoints/lines
so that they are in the same direction, and get the middle point, we
scan two wings for the possible tickmarks. One possible tickmars may be
the one starts from the middle point, ends with some point in the
normal vector of the two lines. If the possible tick-line's width
is bigger than 1.0, we create a tickmark and add it into the
linesegments vector, and increase the tickmark counter by 1. The same
work is done for the other wing and possible tickmark. If the two
endpoints in one troublespot are not colinear, just push endpoints
using one wing model.
If the troublespot has more than two endpoints, which is the
generic case, we then check for colinear lines, merge them, and merge
others. To do this, compare all the angles between each pair of
two lines, and get those colinear pairs(whose angle is almost PI, then
merge them just like troublespot with two colinear endpoints. For those
lines which are not colinear, just do one wing extension.
- Related
classes: IntersectionFinder.java,LineSegmentFitter.java, etc.
- Parameters
CHAINING_BOX_SIZE: Each
chaining box should have more endpoints than this parameter.
INTERSECTION_FINDER_SIZE: The
size of pixel box to find possible intersections.
MAX_WING_EXTENSION_DEVIATION: Used
to guarantee the pixel density to be high enough when extending wing(s)
of endpoints
MAX_CORE_EXTENSION_DEVIATION: same
as MAX_WING_EXTENSION_DEVIATION, works for core.
MAX_LINE_EXTENSIONS: Max times of
extension.
COMPARISON_TOLERANCE: to deal with very small std. Deviation in extension.
SLICE_SIZE: Size of
one slice to extend
EXTENSION_INCREMENT: If the
extension can be done, the actual extension is not decided by
SLICE_SIZE, but this parameter. This parameter is smaller than
SLICE_SIZE, making small move.
back to SVP project