#archiveteam 2015-02-04,Wed

↑back Search

Time Nickname Message
00:04 🔗 Ymgve has joined #archiveteam
00:24 🔗 achip has joined #archiveteam
00:29 🔗 marvinw has quit IRC (Max SendQ exceeded)
00:30 🔗 lytv has quit IRC (Max SendQ exceeded)
00:33 🔗 lytv has joined #archiveteam
00:34 🔗 marvinw has joined #archiveteam
00:44 🔗 Mayonaise has joined #archiveteam
01:08 🔗 xk_id_ has quit IRC (Remote host closed the connection)
01:25 🔗 xk_id has joined #archiveteam
01:28 🔗 Ymgve has quit IRC ()
01:39 🔗 achip has quit IRC ()
01:46 🔗 lytv has quit IRC (Read error: Connection reset by peer)
01:49 🔗 lytv has joined #archiveteam
01:55 🔗 mistym has quit IRC (Remote host closed the connection)
02:10 🔗 mistym has joined #archiveteam
02:10 🔗 chfoo the ovi store project is currently in progress in #downlovi. tracker: http://tracker.archiveteam.org/ovi-store/
02:38 🔗 rejon has joined #archiveteam
02:39 🔗 abartov has quit IRC (Ping timeout: 258 seconds)
02:49 🔗 lytv has quit IRC (Read error: Connection reset by peer)
02:51 🔗 parsons_ has quit IRC (Ping timeout: 248 seconds)
02:51 🔗 parsons_ has joined #archiveteam
02:52 🔗 lytv has joined #archiveteam
02:59 🔗 primus104 has quit IRC (Leaving.)
03:06 🔗 achip has joined #archiveteam
03:29 🔗 rejon has quit IRC (Read error: Operation timed out)
03:56 🔗 Nertsy` is now known as Nertsy
04:13 🔗 Infreq has joined #archiveteam
04:15 🔗 Infreq WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
04:24 🔗 garyrh_ Infreq, yahoosucks
04:25 🔗 Infreq xd, awesome thanks
04:33 🔗 ruukasu has quit IRC (Remote host closed the connection)
04:35 🔗 ruukasu has joined #archiveteam
04:51 🔗 SketchCow Do good
04:56 🔗 kyan has joined #archiveteam
05:11 🔗 achip has quit IRC (Remote host closed the connection)
05:21 🔗 aaaaaaaaa has quit IRC (Leaving)
05:25 🔗 punx has left
05:33 🔗 mistym has quit IRC (Remote host closed the connection)
05:36 🔗 sep332 has quit IRC (bye)
05:37 🔗 ruukasu has quit IRC (Remote host closed the connection)
05:40 🔗 ruukasu has joined #archiveteam
06:06 🔗 underscor has quit IRC (Ping timeout: 370 seconds)
06:31 🔗 underscor has joined #archiveteam
06:31 🔗 swebb sets mode: +o underscor
06:32 🔗 Start is now known as StartAway
07:32 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
07:32 🔗 dashcloud has joined #archiveteam
08:04 🔗 Selanda_ has joined #archiveteam
08:08 🔗 midas has quit IRC (hub.dk irc.underworld.no)
08:08 🔗 S[h]O[r]T has quit IRC (hub.dk irc.underworld.no)
08:08 🔗 Selanda has quit IRC (hub.dk irc.underworld.no)
08:08 🔗 raylee has quit IRC (hub.dk irc.underworld.no)
08:08 🔗 Atluxity has quit IRC (hub.dk irc.underworld.no)
08:08 🔗 Nemo_bis has quit IRC (hub.dk irc.underworld.no)
08:12 🔗 achip has joined #archiveteam
08:14 🔗 cloudmons has joined #archiveteam
08:20 🔗 achip has quit IRC (Read error: Operation timed out)
08:26 🔗 useretail has quit IRC (hub.se irc.ac.za)
09:24 🔗 LittUp has joined #archiveteam
09:34 🔗 primus104 has joined #archiveteam
09:48 🔗 Nemo_bis has joined #archiveteam
09:48 🔗 midas has joined #archiveteam
09:49 🔗 raylee has joined #archiveteam
10:14 🔗 cloudmons has quit IRC (Remote host closed the connection)
10:22 🔗 cloudmons has joined #archiveteam
10:36 🔗 BlueMaxim has quit IRC (Quit: Leaving)
10:49 🔗 schbirid has joined #archiveteam
10:49 🔗 phuzion has quit IRC (Read error: Operation timed out)
10:52 🔗 phuzion has joined #archiveteam
11:06 🔗 xtr-201 has quit IRC (Ping timeout: 370 seconds)
11:21 🔗 MMovie1 has joined #archiveteam
11:46 🔗 Infreq has quit IRC ()
11:47 🔗 Iggytm has joined #archiveteam
11:47 🔗 Iggytm has quit IRC (Client Quit)
12:36 🔗 Ymgve has joined #archiveteam
12:41 🔗 useretail has joined #archiveteam
13:14 🔗 primus104 has quit IRC (Leaving.)
14:17 🔗 SketchCow OK, who wants this one. (godane?)
14:17 🔗 SketchCow http://www.metmuseum.org/research/metpublications/titles-with-full-text-online?searchtype=F
14:17 🔗 SketchCow 416 PDFs (with keywords to scrape, maybe other data points to scrape) of highest quality
14:17 🔗 Nertsy has quit IRC (Read error: Operation timed out)
14:18 🔗 Nertsy has joined #archiveteam
14:27 🔗 signius has quit IRC (Ping timeout: 512 seconds)
14:34 🔗 sep332 has joined #archiveteam
14:36 🔗 signius has joined #archiveteam
15:01 🔗 sankin has joined #archiveteam
15:07 🔗 ruukasu has quit IRC (Remote host closed the connection)
15:07 🔗 SadDM Oh wow! Lots of good stuff there. Unfortunately I don't have the temporal bandwidth at the moment.
15:08 🔗 SketchCow Whoever goes for it, call it.
15:08 🔗 StartAway has quit IRC (Disconnected.)
15:09 🔗 SadDM Is it in danger?
15:10 🔗 ruukasu has joined #archiveteam
15:16 🔗 wacky_ is now known as wacky
15:32 🔗 mistym has joined #archiveteam
15:33 🔗 mistym has quit IRC (Remote host closed the connection)
15:52 🔗 mistym has joined #archiveteam
15:55 🔗 Ravenloft has joined #archiveteam
15:58 🔗 achip has joined #archiveteam
16:02 🔗 ersi If it exists, it's in danger.
16:02 🔗 ersi DANGERZONE!
16:07 🔗 brook_ is now known as broke
16:10 🔗 dashcloud has quit IRC (Ping timeout: 512 seconds)
16:14 🔗 dashcloud has joined #archiveteam
16:14 🔗 primus104 has joined #archiveteam
16:17 🔗 dashcloud has quit IRC (Read error: Operation timed out)
16:19 🔗 dashcloud has joined #archiveteam
16:25 🔗 primus104 has quit IRC (Leaving.)
16:58 🔗 mistym has quit IRC (Remote host closed the connection)
17:00 🔗 Start has joined #archiveteam
17:17 🔗 primus104 has joined #archiveteam
17:17 🔗 mistym has joined #archiveteam
17:29 🔗 rejon has joined #archiveteam
17:38 🔗 Nertsy has quit IRC (Read error: Operation timed out)
17:42 🔗 Nertsy has joined #archiveteam
17:43 🔗 Ravenloft has quit IRC (Read error: Connection reset by peer)
17:45 🔗 Start has quit IRC (Disconnected.)
17:53 🔗 phuzion has quit IRC (Quit: Adios y'all)
17:53 🔗 aaaaaaaaa has joined #archiveteam
18:19 🔗 chfoo SketchCow: can you hold off doing anything with ovi-store rsync directory on fos? most of the data is http 403 junk.
18:21 🔗 SketchCow OK.
18:21 🔗 SketchCow Inkblazers 100% uploaded.
18:23 🔗 Nemo_bis Archivebot: "http://koti.kapsi.fi/~federico/tmp/SBN-URLs.txt on 01-25; 9,101.8 MB in 82,492 resp. at 0.1/s, 434,076 in q.; 1 con. w/ 1000-50000 ms delay; igoff 6ho7afbue5ag4f7jrvuckfl9u"
18:24 🔗 Nemo_bis This is not right, why does the number of queued URLs keep increasing? I guess it also loads requisite resources, but by doing so it consumes time and makes the throttle stricter
18:26 🔗 xtr-201 has joined #archiveteam
18:29 🔗 Selanda_ has quit IRC (Read error: Operation timed out)
18:32 🔗 Selanda has joined #archiveteam
18:41 🔗 useretail has quit IRC (hub.se irc.ac.za)
18:42 🔗 useretai- has joined #archiveteam
19:24 🔗 Ravenloft has joined #archiveteam
19:53 🔗 mistym has quit IRC (Remote host closed the connection)
20:08 🔗 mistym has joined #archiveteam
20:13 🔗 K4k has joined #archiveteam
20:15 🔗 dashcloud has quit IRC (Read error: Operation timed out)
20:20 🔗 dashcloud has joined #archiveteam
20:32 🔗 Start has joined #archiveteam
20:33 🔗 yipdw Nemo_bis: because the job was set up that way
20:34 🔗 yipdw there's also no "don't fetch page requisites" option because it's not a common thing
20:39 🔗 mistym has quit IRC (Remote host closed the connection)
20:42 🔗 BlueMaxim has joined #archiveteam
20:42 🔗 schbirid someone put https://www.backblaze.com/hard-drive-test-data.html into the datasets collection please
20:55 🔗 mistym has joined #archiveteam
21:02 🔗 Ravenloft has quit IRC (Ping timeout: 512 seconds)
21:16 🔗 Nemo_bis yipdw: which way? what are the new URLs being queued?
21:22 🔗 Start has quit IRC (Quit: Disconnected.)
21:24 🔗 K4k has quit IRC (Read error: Operation timed out)
21:36 🔗 schbirid has quit IRC (Quit: Leaving)
21:59 🔗 sankin has quit IRC (Leaving.)
22:14 🔗 Sanqui <@Sanqui> wget is saving both in the warc and to directories
22:14 🔗 Sanqui <@Sanqui> anybody wanna look over my wget command? http://pastie.org/9887160
22:23 🔗 Peetz0r Is there something wrong with the archive.org servers?
22:24 🔗 Peetz0r it seems that my 'ia upload' which is in progress, is much slower than usual
22:24 🔗 Peetz0r the ETA jumps between 2 hours and 20 hours
22:25 🔗 yipdw Nemo_bis: probably page requisites
22:25 🔗 Peetz0r where about 1,5 hour would be as fast as usua;
22:25 🔗 Peetz0r usual*
22:25 🔗 Peetz0r the identifier of my upload is 2015-02-04.ftp.susx.ac.uk.tar and the source IP would be 82.197.212.29
22:25 🔗 Peetz0r SketchCow: maybe you can look?
22:28 🔗 Peetz0r the ETA is jumping back and forth again
22:29 🔗 Peetz0r I have even seen it say less than 1 hour, and more than 24
22:29 🔗 Peetz0r both are not okay
22:34 🔗 Peetz0r it is definately not my connection, I measured 868/684 while a few downloads and at least one upload was running just now: http://www.speedtest.net/result/4117114775.png
22:35 🔗 Sanqui is it possible to run wget in parallel? if I start another warcing instance, will they fight each other?
22:37 🔗 Peetz0r In general, yes, it is possible to run wget in parrallel. But I don't know about your context
22:38 🔗 Peetz0r I just download ftp sites (see #effteepee) and since most of those are quite slow, I downloada few of them simultaneously
22:38 🔗 Sanqui downloading a list of sites
22:38 🔗 Sanqui of internet centrum
22:38 🔗 Sanqui but it's going like 25kBps
22:38 🔗 Peetz0r Are you using scripts that do stuff for you?
22:39 🔗 Sanqui <Sanqui> <@Sanqui> anybody wanna look over my wget command? http://pastie.org/9887160
22:39 🔗 Peetz0r if so, then the answer would depends on how those scripts are designed
22:39 🔗 Peetz0r ah :)
22:39 🔗 Peetz0r I didn't know wget had warc support :)
22:40 🔗 Sanqui http://archiveteam.org/index.php?title=Wget
22:40 🔗 Sanqui do you suggest something different?
22:40 🔗 Sanqui /better?
22:40 🔗 Peetz0r if I were you, I'd just split that list in a few smaller parts and later figure out how to merge your different warc files
22:40 🔗 Sanqui hm, guess that's the way then
22:40 🔗 Peetz0r https://github.com/maturban/WARCMerge
22:40 🔗 Sanqui though it'll be annoying figuring out what's done and what isn't then
22:41 🔗 Sanqui would be nicer to just start a wget instance for each site, while launching like 5 at once at most
22:41 🔗 Sanqui also, wget should have a -e robots=abuse option, where it would parse robots.txt and scrape the disallowed URLs :)
22:43 🔗 Peetz0r heh, evil :p
22:44 🔗 Sanqui hey, gotta save that shit :P
22:44 🔗 midas Sanqui: you can throw them into the archivebot
22:45 🔗 theChip it needs the spam filter cookie though
22:45 🔗 Sanqui 81k+ sites
22:45 🔗 Sanqui and that
22:45 🔗 midas hm that kinda sucks yeah
22:45 🔗 midas dont think yipdw got around to create a cookiejar yet
22:45 🔗 theChip a warrior script would work as was pointed out with the wget --header="Cookie: iccmtspmvrfy=ano"
22:46 🔗 midas Peetz0r: your slow upload is normal, s3 can get overrun with uploads and slow down
22:46 🔗 Sanqui yeah idk if I can set up a warrior project myself though :(
22:46 🔗 theChip but that depends on me learning how to write a pipeline
22:46 🔗 theChip I can do it I want to learn anyways
22:46 🔗 midas check the programming part
22:46 🔗 midas http://archiveteam.org/index.php?title=Dev
22:47 🔗 Peetz0r midas: ah, okay
22:47 🔗 Peetz0r will just be patient then
22:47 🔗 midas yep
22:47 🔗 Sanqui I've read it, but it's still somewhat complicated
22:47 🔗 Sanqui looks like something that should be done by people already familiar :/
22:47 🔗 theChip I've got my dev environment running but can get wget-lua on my macbook, I should be able to figure out one if I could talk to a tracker *shifty eyes*
22:47 🔗 Peetz0r also, I have the issue that my downloading machine disk becomes slow and iowait goes trough the roof
22:47 🔗 Peetz0r even with ionice this is an issue
22:48 🔗 Peetz0r http://stream.haas-en-berg.nl:81/munin/Home/flappie/diskstats_utilization/sda-day.png and http://stream.haas-en-berg.nl:81/munin/Home/flappie/cpu-day.png
22:48 🔗 ersi Peetz0r: You in Europe?
22:48 🔗 Peetz0r yes
22:48 🔗 midas also normalish, lots of small files on ftp boxes
22:49 🔗 Peetz0r midas: what I upload is one huuuge tar file
22:49 🔗 midas and laptop harddrive
22:49 🔗 Sanqui hmm
22:49 🔗 ersi Peetz0r: OK, that's probably why your speed goes up and down. I've had transit issues to IA as well
22:49 🔗 Peetz0r it's a WD black :D
22:49 🔗 Sanqui I think I'm going to write a shell script to do parallel site-by-site wget
22:49 🔗 midas oh Peetz0r, nah that has to do with the ia upload command
22:49 🔗 midas it calculates a hash if im not mistaken
22:50 🔗 Peetz0r does ia upload split stuff in many small files?
22:50 🔗 Peetz0r does ia upload also kill my disk?
22:50 🔗 midas yep
22:50 🔗 Peetz0r because I use ionice only on the tarring so far
22:50 🔗 midas first, nope, second yes
22:50 🔗 Peetz0r okay, will add ionice to ia upload as well
22:50 🔗 midas ia doesnt split it
22:51 🔗 midas it will put some hurt on your disks and cpu
22:51 🔗 Peetz0r my cpu handles it just fine
22:51 🔗 Peetz0r the red and green parts of the graph are actual cpu usage
22:51 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:52 🔗 Peetz0r the purpe (huuge) part is iowait
22:52 🔗 midas didnt check your graphs yet
22:53 🔗 theChip is there a dev tracker stood up anywhere? I've tried two boxes and the redis always gets screwed up when I follow Dev/Tracker wiki page.
22:54 🔗 Peetz0r the CPU is an i3 370M (2.4G dualcore) which seems to be overkill for this
22:54 🔗 aaaaaaaaa there is a pre-built ova for the tracker
22:54 🔗 Peetz0r overkill is good because the same machine does more then just effteepee'ing
22:56 🔗 yipdw if you just need a cookie that can be set in a specific pipeline
22:56 🔗 yipdw for archivebot
22:56 🔗 yipdw I'm assuming this isn't some gigantic 20 million URL job
22:57 🔗 theChip ya I have that in my VMWare box and I just realized I could just sand up an ubuntu in the same subnet and dev there, thanks aaaaaaaaa
22:57 🔗 theChip well its 81k *seeds*
22:57 🔗 yipdw what's a seed
22:57 🔗 theChip n00b speak for individual domains i guess
22:58 🔗 theChip so if we get one shitty little forum in there and I can see the URL count shooting up
22:58 🔗 yipdw 81,000 domains?
22:58 🔗 yipdw those are seriously all related?
22:58 🔗 Peetz0r I need a cookie beacuse hungry
22:58 🔗 Peetz0r :p
22:59 🔗 midas Peetz0r: #archiveteam-bs for bs please
22:59 🔗 theChip this is just what was discovered in the ic.cz directory from the wayback machine https://raw.githubusercontent.com/chpwssn/ic.czstuff/master/waybackcatalogresults.txt
22:59 🔗 theChip sorry this is probably -bs by now
22:59 🔗 Peetz0r sorry midas
22:59 🔗 Sanqui #internetcentury also :)
22:59 🔗 yipdw oh ic.cz
23:00 🔗 Sanqui ok I don't trust wget any more lool
23:02 🔗 dashcloud has joined #archiveteam
23:07 🔗 phuzion has joined #archiveteam
23:07 🔗 phuzion has quit IRC (Remote host closed the connection)

irclogger-viewer