SVP Report

Report on SVP( Strategic Vectorization Project )

Director: Bob Futrelle
Code: Dan Crispell
Report: Mingyan Shao

The idea of SVP is: beginning with JPEG file, apply moment analysis on the pixel intensity information, then create QuadricFit objects, group these QuadricFit objects together and fit the groups into line segments. Two possible operations may be done to the line segments, they can be chained together if they could merge into a slow curve, or they can be extended to find intersections. The result image contains a set of lines, some are chained, andsome are connected to an intersection.

Steps:

Preprocess: Convert Image to 2-dimensional array of pixel intensities

Perform PCA(Principal Component Analysis)

Group PCAs based on the location, orientation, and width

Fit Line segments to PCA groups

Postprocess: Detect and deal with slow curves and intersections

Details:

Preprocess: Convert Image to 2-dimensional array of pixel intensities

JPEG --> int[][]

Brief Idea
Read an image of JPEG format into an BufferedImage object, from which get a 2D int array called 'image' of the pixel intensities. In the 'image' array, the more intensity a pixel is, the larger number in the corresponding cell of the array. The numbers in the array vary from 0 to 255. For the convenience of looking for high intensity, SVP use 0 as white and 255 as black(while Java2D uses the opposite values), .
When changing an imge to BufferedImage object, a border is added to the original image. The border width is set by a SVP parameter, OUTER_BORDER_WIDTH. The default value is 100.

Related Class: ArrayImage

Perform PCA(Principal Component Analysis)

int[][] --> QuadricFit[][]

Brief Idea
For each pixel in the 'image' array(described in previous part), the 0th, 1st and 2nd moment analyses are performed, so that we can seperate the cells which are possibly part of a LINE from others, e.g., part of BAR. Only those cells which pass the 0th and 1st moment tests can be thought of part of a LINE, and then can have a QuadricFit object create based on these cells.
This step creates a new 2D array for the image, called 'quadrics', each cell in which is either a QuadricFit object or NULL. If a pixel passes the 0th, 1st and 2nd moment tests in sequence, a QuadricFit object will be created for the corresponding cell in 'quadrics' array, otherwise the cell will be NULL.

Moment Tests
To take these moment tests, for each 'image' cell(described in Preprocess), say A, we work on a region of cells. The center of this region is A, the offset/size of this region is decided by a SVP parameter, XYBAR_BOX_SIZE. In this region, three values will be calculated. X_weighted_total is the sum of the cells' product of intensity and column number, Y_weighted_total is the sum of the cells' product of intensity and row number. Total is the sum of all the cells' intensity. In the 0th moment test, the average density of this region is calculated using Total. If the average density is smaller than a SVP parameter, MAX_BLACK_DENSITY, the center cell A passes the 0th moment test. In other words, the 0th moment test guarantees that the surrounding area of a pixel is not too dense, otherwise, the cell is probably part of a bar chart which has a solid inner part. The 1st moment test is used to check if this cell is the actual center of the region using the weighted total(X_weighted_total, and Y_weighted_total). Since the weights in the weighted total are row and column respectively, the calculated center of the region will be a point which contributed most to the weighted_total. If the actual center cell of the region is roughly(decided by a SVP parameter, XYBAR_ERROR_TOLERANCE) the calculated center, this cell passes the 1st moment test---the cell is part of a line. Next, we will use 2nd moment analysis to create a QuadricFit object for this specific cell. For this cell, like in 1st moment test, we also work on a region, whose size is set by a SVP parameter, QUADRIC_BOX_SIZE. Based on the cell and the region surrounding it, we construct a QuadricFit object, which will be described next.

QuadricFit
A QuadricFit object can be considered as a tiny segment of a line. Its data members include center point, the angle of major axis, the lengths of major and minor axes, and group id(described later). For each cell, we construct a region, as described before. From this region, we get the offset of each pixel to the center pixel, based on which we can calculate the angle of major axis, then lengths of major and minor axes.

Related Classes: ImageQuadricData,QuadricFitter

, QuadricFit, InterpolatedFunction, etc.

Parameters
XYBAR_BOX_SIZE:             Indicate the offse/size of the region to be considered around a ‘center’ cell
QUADRIC_BOX_SIZE:        Region whose size should be smaller than this parameter could be a quadric
MAX_BLACK_DENSITY:     We calculate the average density of a region. If the average density is smaller than this parameter, this region passes the 0^th moment test, which guarantees that surrounding area of a pixel is not too dense.
XYBAR_ERROR_TOLERANCE:      For the 1^st moment test. By deciding if the weighted center is smaller than this parameter, we decide if the calculated center is the actual center or not.

Group PCAs based on the location, orientation, and width

QuadricFit[][] --> Vector of QuadricGroup
Brief Idea
Till now, the JPEG image has been converted into a 2D array of QuadricFit element. Each QuadricFit is possiblely a tiny part of some line(s). Since our purpose(currently) is to get those lines, we should first group some QuadricFit objects together according to some criteria, e.g., position, orientation, then try to fit line segments to the groups(described later). In this step, we focus on grouping QuadricFit objects. The grouped QuadricFit objects are kept in QuadricGroup object, which is then stored in a Vector of QuadricGroup.

QuadricGroup
To decide if a QuadricFit belongs to it, the QuadricGroup checks four things. If the QuadricFit pass these 4 test, it will be added to the QuadricGroup. The four tests are: ( From QuadricGroup.java )
1. Its(QuadricFit) theta value must be within a certain factor times the std. deviation of the mean group theta.
2. if added to the group, it must not bring the theta variance over a certain limit. This prevents slowly changing curves from being treated as single lines.
3. Its width value must be within a certain factor times the std. deviation of the mean group width.
4. if added to the group, it must not bring the width variance over a certain limit.
Related Classes: ImageQuadricData, QuadricGroup,QuadricFit, etc
Parameters
MAX_QUADRIC_ASPECT_RATIO: This parameter is useful in grouping quadricfit objects. A quadricfit object can be grouped into a quadric group if its width and length satisfy a criteria like width/length < this parameter
MIN_QUADRIC_GROUP_SIZE: Every group should contain at least quadricfit objects more than this parameter.
MAX_WIDTH_VARIANCE: guarantee that after a quadricfit object has been added to this group, the change of width of this group will not be too big.
MAX_THETA_VARIANCE: guarantee that after a quadricfit object has been added to this group, the change of theta of this group will not be too big.

Fit Line segments to PCA groups

Vector of QuadricGroup --> Vector of LineSegment
Brief Idea:
For each QuadricGroup,

decide the possible line is closer to horizontal or vertical
Best fit line. Linear regression using least square.
find endpoints
given endpoints, get best line model and construct a LineSegment object

Decide the possible line is closer to horizontal or vertical
Construct the boundingbox of a group of QuadricFit objects. The boundingbox, which usually is a rectangle, shows the rough coverage of the group of cells. Using the right upper and left lower points, get the direction of the possible“line” and determine whether this line is closer to horizontal or vertical.
Best Fit Line
Check if the line to close to be horizontal, or vertical, then use Least Squared Fitting to get slope and intercept. The slope we got now is the best_fit_slope and is calculated. From the QuadricFitGroup, we have another slope called quad_slope. We need a balance between these two slopes, thus we use a SVP parameter, THETA_MEAN_CUTOFF_GROUP_SIZE, to decide the contribution of each of them to the slope of the group. If the group size is greater than this parameter, best_fit_slope will contribute more than quad_slope because best_fit_slope is based on enough number of Quadrics objects and should be quite reliable, otherwise, quad_slop of the QuadricGroup will contribute more to the slope of the whole group. Given best_fit_slope, x_total and y_total, it’s easy to calculate best_fit_inercept.

Now, we know slope and intercept, we need to calculate two endpoints. To do this, first we calculate the unit x and y of the direction, which is easy
since we know slope already. Next, find the minimal and maximal projected distances(on x if closer to horizontal, on y if closer to vertical) by going through the quadric group. Then, two endpoints can be calculated by these projected distances, slope and intercept.

Given QuadricGroup and two endpoints, we need to find the best line model. The best line model means minimizing type 1(core) and type 2(wing) errors.
In this step, we need to find a critical point that can not only cover the QuadricGroup, but also minimize the core and wings errors. First we use QuadricGroup’s width and a SVP parameter, LINE_FITER_WIDTH, to set a range of width to look for the critical point. We split this range into a number of strips and calculate the average pixel density of each strip. By going through these strips, we calculate the sum of the errors, and keep the width for the smallest error. After we scan through all the strips, the width we get is the critical point, for one wing. For the other wing, do the same thing to get the critical point. Using these two critical points, we can calculate the width of the line model and adjust the center of the line.

Based on a QuadricGroup, we get a line model of endpoints, width, slope, intercept, and related statistical information.

Related Classes: LineSegmentFitter.java LineSegment.java
Parameters:
LINE_FITTER_WIDTH: Given two endpoints, to find best line model, this parameter is used to calculate max-width of the possible line
THETA_MEAN_CUTOFF_GROUP_SIZE: For best-fit-line. We use this parameter and actual group size to calculate the best-fit-ratio = min(1.0, actual-group-size/theta_mean_cutoff_group_size), to keep best-fit-ratio <= 1.

Postprocess: Detect and deal with slow curves and intersections

Vector of LineSegments, Pixels, LineSegmentFitter --> Chain segments if possible, or find and handle TroubleSpots if necessary.
Brief Idea:
Based on the position distance of the endpoints of LineSegments, process the lines in two possible ways, chain together, or find and handle troublespots.

If SVP parameter DBG_DO_CHAINING is set, we try to chain some linesegments together. First, put all the endpoints of the linesegments into a vector. For each endpoint, scan all the other endpoints, select those who are not in the same line with it, but are near to it, then chain the related linesegments together. To chain them together, two endpoints at a time, first calculate the mid-point between these two points, next extend the two points by increasing its x and y coordinates in its original direction. The distance between one endpoint and the middle point decides the increasement. Both of the chained endpoints set "chained" attribution to be true. Continue on other endpoints in the vector until no more left.

For those points who are not been chained, they will be extended as far as possible. Thus the endpoints of the QuadricGroup will be as close as possible to each other. This step is proceed if SVP parameter DBG_DO_EXTENSIONS is true, and this step is also the preprocessing of finding and handling trouble spots (which will be called if SVP parameter DBG_DO_TROUBLESPOTS is set).

To extend the endpoint, we should use four SVP parameters, MAX_WING_EXTENSION_DEVIATION(after extension, the pixel density should be smaller than this parameter), MAX_CORE_EXTENSION_DEVIATION(after extension, the pixel density should be smaller than this parameter), MAX_LINE_EXTENSIONS(max number of slice extended), COMPARISON_TOLERANCE(to deal with very small std. deviation). The idea of extension is trying to extend for as many as MAX_LINE_EXTENSIONS times, until the noise of core or wings is too high. In each time of extension, the size of extension is determined by two SVP parameters, SLICE_SIZE and EXTENSION_INCREMENT. When we try to extend one slice forward, we should check the new core and new wings, if they contain much more noise, the extension will be ended.

For each endpoints, which is not chained and has not been add into some other endpoint's troublespot, we look through all the available endpoints, and test if the distance between them is smaller than half of a SVP parameter INTERSECTION_FINDER_SIZE, if so, that point will be added into its troublespot. In the end, every unchained endpoint is in some troublespot, some of which may only contain one endpoint though.

The troublespot will be handled in different ways according to its size, i.e., how many endpoints are contained in it. The smallest possible size is one, which does not need handling. If two, there are two cases. One is that the two linesegments are actually part of a single line which is seperated by tick mark, the other case is that they are just two lines. For the first case, we check if the two lines/endpoints are colinear, i.e., both of them are parts of a single line. If they are colinear, they are seperated by tickmark. We assume the tick mark(s) is(are) in the middle pints of the two endpoints. After change the direcitons of the two endpoints/lines so that they are in the same direction, and get the middle point, we scan two wings for the possible tickmarks. One possible tickmars may be the one starts from the middle point, ends with some point in the normal vector of the two lines. If the possible tick-line's width is bigger than 1.0, we create a tickmark and add it into the linesegments vector, and increase the tickmark counter by 1. The same work is done for the other wing and possible tickmark. If the two endpoints in one troublespot are not colinear, just push endpoints using one wing model.
If the troublespot has more than two endpoints, which is the generic case, we then check for colinear lines, merge them, and merge others. To do this, compare all the angles between each pair of two lines, and get those colinear pairs(whose angle is almost PI, then merge them just like troublespot with two colinear endpoints. For those lines which are not colinear, just do one wing extension.
Related classes: IntersectionFinder.java,LineSegmentFitter.java, etc.
Parameters
CHAINING_BOX_SIZE:       Each chaining box should have more endpoints than this parameter.
INTERSECTION_FINDER_SIZE:    The size of pixel box to find possible intersections.
MAX_WING_EXTENSION_DEVIATION: Used to guarantee the pixel density to be high enough when extending wing(s) of endpoints
MAX_CORE_EXTENSION_DEVIATION:   same as MAX_WING_EXTENSION_DEVIATION, works for core.
MAX_LINE_EXTENSIONS: Max times of extension.
COMPARISON_TOLERANCE: to deal with very small std. Deviation in extension.
SLICE_SIZE:   Size of one slice to extend
EXTENSION_INCREMENT: If the extension can be done, the actual extension is not decided by SLICE_SIZE, but this parameter. This parameter is smaller than SLICE_SIZE, making small move.

back to SVP project