DAO design for writing big XML file on database [on hold]



I am currently working on JavaEE application (Spring, Hibernate). I have to put a big XML file (more than 1 gigabyte) on a relational database (Postgres).

The application does not use batch processing. I've done some searching but I did not find any solution for the design of the DAO layer: if I use only one transaction, the server will not response to any request until it finishes the insertion of rows (a huge number of rows). So, using 1 transaction is not a good idea. I can split XML file basing on its tags data: every tag content will be inserted on a row. The idea is to use multithreading to manage transactions (every transaction inserts a defined number of rows). I have found a difficulties to would know how to define the necessary number of transactions to maintain a good time response of the application. I also search how to manage failure of certain transactions. For example, If only 3 transactions write over 1000000 fail, I should try again all the transactions?

When searching, I find that batch processing like Spring batch manages database records and transactions failure. But in my application, we did not use batch processing.

Unfortunately, I can not change the database to Nsql database or add Spring Batch framework to the project.


Related to : DAO design for writing big XML file on database [on hold]
How to read a movie xml file in c# for use in a sql server database? [on hold]
Programming Languages

I have an xml file which looks exactly like this just a lot more;

<movielist>
<movie>
    <title>Amazon Quest </title>
    <year>1954</year>
    <length>75 min</length>
    <director>Steve Sekely</director>
    <rating>7</rating>
    <genre>Action</genre>
    <genre>Drama</genre>
    <actor>Tom Neal</actor>
    <actor>Carole Mathews</actor>
    <actor>Carole Donne</actor>
    <actor>Don Zelaya</actor>
    <actor>Ralph Graves</actor>
</movie>
<movie>
    <title>American Ninja 3: Blood Hunt </title>
    <year>1989</year>
    <length>89 min</length>
    <certification>R</certification>
    <director>Cedric Sundstrom</director>
    <rating>7</rating>
    <genre>Action</genre>
    <genre>Drama</genre>
    <actor>David Bradley</actor>
    <actor>Steve James</actor>
    <actor>Marjoe Gortner</actor>
    <actor>Michele B. Chan</actor>
    <actor>Yehuda Efroni</actor>
</movie>
</movielist>

This is my code to read it :

class Program
    {
        static void Main(string[] args)
        {
            XmlTextReader reader = null;

            try
            {
                reader = new XmlTextReader("movies.xml");

                while (reader.Read())
                {
                    if (reader.IsStartElement())
                    {
                        if (reader.IsEmptyElement)
                            Console.WriteLine("<{0}/>",
reader.Name);
                        else
                        {
                            reader.Read(); 
                            Console.WriteLine(reader.ReadString());  
                        }
                    }
                }

            }

            finally
            {
                if (reader != null)
                    reader.Close();
            }
        }
    }

but it gives me an error saying;

An unhandled exception of type 'System.Xml.XmlException' occurred in System.Xml.dll

Additional information: Unexpected end of file has occurred. The following elements are not closed: movielist. Line ___(last line which looks exactly like the last line in this), position 1.

What could be the problem. It reads the file just fine until then and the supposed line of error is the line where I say:-

if (reader.IsStartElement())


Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big?
Programming Languages

I am using the netCDF4 library in python and just came across the issue stated in the title. At first I was blaming groups for this, but it turns out that it is a difference between the NETCDF4 and NETCDF3_CLASSIC formats.

In the program below (which you will unfortunately not be able to run, because it requires database access), I am creating a simple time series netcdf file of the same data in 3 different ways: 1) as NETCDF3_CLASSIC file, 2) as NETCDF4 flat file, 3) as NETCDF4 file with groups. What I find with a simple timing and the ls command is:

1) NETCDF3          1.3483 seconds      1922704 bytes
2) NETCDF4 flat     8.5920 seconds     15178689 bytes
3) NETCDF4 groups   8.5565 seconds     15178896 bytes

It's exactly the same routine which creates 1) and 2), the only difference is the format argument in the netCDF4.Dataset method. Is this a bug or a feature?

Thanks, Martin

Code:

import sys
import datetime as dt
import numpy as np
from obs_stations_database import ObsStationsDatabase
import netCDF4 as nc

SERVER = "************"

def read_timeseries(user, password, network="GAW", station="GLH",
parameter="O3",
        daterange=None):
    # interpret daterange if given (convert string to datetime, format
YYYY-MM-DD)
    if daterange is not None:
        try:
            if isinstance(daterange[0], basestring):
                daterange[0] = dt.datetime.strptime(daterange[0],
"%Y-%m-%d")
            if isinstance(daterange[1], basestring):
                daterange[1] = dt.datetime.strptime(daterange[1],
"%Y-%m-%d")
        except IOError as e:
            raise IOError(e)
    with ObsStationsDatabase(user_name=user, user_passcode=password,
                             database_host=SERVER) as db:
        station_id = db.get_stations(network_name=network,
station_id=station,
                key_only=True, as_dict=False)[0][1]
        print "station_id = ", station_id
        series_id = db.get_parameter_series_id(network, station_id,
parameter)
        print "series_id = ", series_id
        if series_id is not None:
            t0 = dt.datetime.now()
            data = db.get_hourly_data(series_id, daterange=daterange)
            t1 = dt.datetime.now()
            print "Database loading took %10.4f seconds." %
((t1-t0).total_seconds())
        series_info = db.get_parameter_series_info(series_id)
    return data, series_info


def write_to_netcdf_single(filename, data, series_info,
format='NETCDF4'):
    vname = series_info["parameter_name"]
    t0 = dt.datetime.now()
    with nc.Dataset(filename, "w", format=format) as f:
        # define dimensions and variables
        dim = f.createDimension('time', None)
        time = f.createVariable('time', 'f8', ('time',))
        time.units = "days since 1900-01-01 00:00:00"
        time.calendar = "gregorian"
        param = f.createVariable(vname, 'f4', ('time',))
        param.units = "nmol mol-1"    ### replace this with database
query result!
        flag = f.createVariable(vname+'_flag', 'i2', ('time',))
        flag.long_name = "Data quality flag for %s. Values, see WMO
code table 033 020" % (v
        # define global attributes
        for k, v in sorted(series_info.items()):
            if isinstance(v, dt.datetime):
                v = v.isoformat(" ")
            setattr(f, k, v)
        # store data values
        time[:] = nc.date2num(data.time, units=time.units,
calendar=time.calendar)
        param[:] = data.value
        flag[:] = data.flag
    t1 = dt.datetime.now()
    print "Writing simple file took %10.4f seconds." %
((t1-t0).total_seconds())


def write_to_netcdf_grouped(filename, data, series_info):
    t0 = dt.datetime.now()
    with nc.Dataset(filename, "w", format='NETCDF4') as f:
        for i, sinfo in enumerate(series_info):
            print i, sinfo
            vname = sinfo["parameter_name"]
            # define groups
            grp = f.createGroup(sinfo["station_id"])
            # define dimensions and variables
            dim = grp.createDimension('time', None)
            time = grp.createVariable('time', 'f8', ('time',))
            time.units = "days since 1900-01-01 00:00:00"
            time.calendar = "gregorian"
            param = grp.createVariable(vname, 'f4', ('time',))
            param.units = "nmol mol-1"    ### replace this with
database query result!
            flag = grp.createVariable(vname+'_flag', 'i2', ('time',))
            flag.long_name = "Data quality flag for %s. Values, see
WMO code table 033 020"
            # define global attributes
            for k, v in sorted(sinfo.items()):
                if isinstance(v, dt.datetime):
                    v = v.isoformat(" ")
                setattr(grp, k, v)
            # store data values
            time[:] = nc.date2num(data[i].time, units=time.units,
calendar=time.calendar)
            param[:] = data[i].value
            flag[:] = data[i].flag
    t1 = dt.datetime.now()
    print "Writing grouped file took %10.4f seconds." %
((t1-t0).total_seconds())



if __name__ == "__main__":
    if len(sys.argv) < 3:
        print "Usage: obs_station_to_netcdf user password"
        print "(username and password for the obs_surface_stations
database"
        exit(2)
    user = sys.argv[1]
    password = sys.argv[2]
    network = "GAW"
    station = raw_input("Enter station code: ")
    data, series_info = read_timeseries(user, password, network,
station, parameter="O3")
    print series_info
    filename = "%s_%s_nc3.nc" % (series_info["station_id"],
series_info["parameter_name"])
    write_to_netcdf_single(filename, data, series_info,
format='NETCDF3_CLASSIC')
    filename = "%s_%s.nc" % (series_info["station_id"],
series_info["parameter_name"])
    write_to_netcdf_single(filename, data, series_info)
    filename = filename.rstrip(".nc") + "_grouped.nc"
    write_to_netcdf_grouped(filename, [data], [series_info])

And to prove that this is really the same data, here are the ncdumps (global attribute/group attributes truncated):

NETCDF3_CLASSIC:

netcdf ASK123N00_O3_nc3 {
dimensions:
    time = UNLIMITED ; // (120069 currently)
variables:
    double time(time) ;
            time:units = "days since 1900-01-01 00:00:00" ;
            time:calendar = "gregorian" ;
    float O3(time) ;
            O3:units = "nmol mol-1" ;
    short O3_flag(time) ;
            O3_flag:long_name = "Data quality flag for O3. Values, see
WMO code table 033 020" ;

// global attributes:
            :comments = "Time range 1-24 detected: Converted to 0-23
assuming data was given at interval endpoints" ;
...
}

NETCDF4 flat:

netcdf ASK123N00_O3 {
dimensions:
    time = UNLIMITED ; // (120069 currently)
variables:
    double time(time) ;
            time:units = "days since 1900-01-01 00:00:00" ;
            time:calendar = "gregorian" ;
    float O3(time) ;
            O3:units = "nmol mol-1" ;
    short O3_flag(time) ;
            O3_flag:long_name = "Data quality flag for O3. Values, see
WMO code table 033 020" ;

// global attributes:
...
}

NETCDF4 groups:

netcdf ASK123N00_O3_grouped {

group: ASK123N00 {
  dimensions:
    time = UNLIMITED ; // (120069 currently)
  variables:
    double time(time) ;
            time:units = "days since 1900-01-01 00:00:00" ;
            time:calendar = "gregorian" ;
    float O3(time) ;
            O3:units = "nmol mol-1" ;
    short O3_flag(time) ;
            O3_flag:long_name = "Data quality flag for O3. Values, see
WMO code table 033 020" ;

  // group attributes:
....
  } // group ASK123N00
}

Mysql database design concept for a hospital [on hold]
Programming Languages

I am developing a hospital Management Application in mysql and php. I am giving brief of tables for brevity. I have among other tables for

medicine

id, medicine_name, medicine_id

patient_table

id, general_regn_no, ipd_regn_no, patient_name

patient_detail_entry(patient_admission_table)

id general_regn_no ipd_regn_no patient_name room_name_id, 
room_category_id admission_time ....

conceptualizing a form which will have following fields

Primary fields

general_regn_no, ipd_regn_no, patient_name, room_name
date_time

medicine request fields for issue of medicine to medicine store.(billing is done directly by medicine store).hence price not needed.

medicine_name  medicine_id  quantity
medicine_name  medicine_id  quantity
medicine_name  medicine_id  quantity

so medicine request fields will be repeat fields. This request should go to medicine store, which in turn will issue medicine on receipt of request.

medicine_store_table

is_delivered, ipd_patient_id, patient_name,
issue_date_time, medicine_name, quantity

Now the question is should I have a separate table for issue and request or a common table with many to many relationship between patient and medicine.

Thanks.


How to create one XML file for each record from a XML file export from MYSQL database?
Programming Languages

Good day to everyone. I was exported the a set of data in xml from MYSQL database. But, I want to separate the existing xml file into 1 ROW in 1 XML file. The example is as below:

Exported XML result from database:

Filename: result01.xml

Script in file:

<ROWDATA>
<ROW>
    <DOCKEY>57911</DOCKEY>
    <DOCNO>MY1113</DOCNO>
    <DOCDATE>20141201</DOCDATE>
</ROW>
<ROW>
    <DOCKEY>57913</DOCKEY>
    <DOCNO>MY1114</DOCNO>
    <DOCDATE>20141201</DOCDATE>

</ROW>
<ROW>
    <DOCKEY>57915</DOCKEY>
    <DOCNO>MY1115</DOCNO>
    <DOCDATE>20141201</DOCDATE>
</ROW>
<ROW>
    <DOCKEY>57915</DOCKEY>
    <DOCNO>MY1115</DOCNO>
    <DOCDATE>20141201</DOCDATE>
</ROW>
<ROW>
    <DOCKEY>57957</DOCKEY>
    <DOCNO>MY1160</DOCNO>
    <DOCDATE>20141201</DOCDATE>
</ROW>
</ROWDATA>

But what I need is to create one file per row:

Filename: 57911.MY1113.xml XML in file:

<ROWDATA>
    <ROW DOCKEY="57911" DOCNO="MY1113" DOCDATE="20141201">
</ROW></ROWDATA>

Filename: 57913.MY1114.xml XML in file:

<ROWDATA>
      <ROW DOCKEY="57913" DOCNO="MY1114" DOCDATE="20141201">
</ROW></ROWDATA>

Does anyone know if there's a simple way of creating multiple XML files

as I mentioned. Your feedback is highly appreciated.

Thank you very much.


Database Design for book ,author and editor [on hold]
Programming Languages

1)Author public class Author {

Integer id;
String  name;

} 2)Editor public class Editor {

Integer id;
String  name;

} 3)Book import java.util.List;

public class Book {

Integer      id;

String       title;

// A book may have several authors. Note that the order of the authors
// is important, i.e. we want to be able to tell who's the first
author,
// who's the second author, and so on.
List<Author> authors;

// A book is edited by one editor.
Editor       editor;

}

Q-1) What is the type of the relationship between authors and books?
Q-2) What is the type of the relationship between editors and books?
Q-3) How many tables are needed for this database?
Q-4) Which of the following is the correct schema for the authors table?
1-(id, name)
2-(id, name, book_id)
3-(id, name, editor_id)
4-(id, name, book_id, editor_id)


DAO design for writing big XML file on database [on hold]
Programming Languages

I am currently working on JavaEE application (Spring, Hibernate). I have to put a big XML file (more than 1 gigabyte) on a relational database (Postgres).

The application does not use batch processing. I've done some searching but I did not find any solution for the design of the DAO layer: if I use only one transaction, the server will not response to any request until it finishes the insertion of rows (a huge number of rows). So, using 1 transaction is not a good idea. I can split XML file basing on its tags data: every tag content will be inserted on a row. The idea is to use multithreading to manage transactions (every transaction inserts a defined number of rows). I have found a difficulties to would know how to define the necessary number of transactions to maintain a good time response of the application. I also search how to manage failure of certain transactions. For example, If only 3 transactions write over 1000000 fail, I should try again all the transactions?

When searching, I find that batch processing like Spring batch manages database records and transactions failure. But in my application, we did not use batch processing.

Unfortunately, I can not change the database to Nsql database or add Spring Batch framework to the project.



Privacy Policy - Copyrights Notice - Feedback - Report Violation - RSS 2017 © bighow.org All Rights Reserved .