[00:04] *** Ymgve has joined #archiveteam [00:24] *** achip has joined #archiveteam [00:29] *** marvinw has quit IRC (Max SendQ exceeded) [00:30] *** lytv has quit IRC (Max SendQ exceeded) [00:33] *** lytv has joined #archiveteam [00:34] *** marvinw has joined #archiveteam [00:44] *** Mayonaise has joined #archiveteam [01:08] *** xk_id_ has quit IRC (Remote host closed the connection) [01:25] *** xk_id has joined #archiveteam [01:28] *** Ymgve has quit IRC () [01:39] *** achip has quit IRC () [01:46] *** lytv has quit IRC (Read error: Connection reset by peer) [01:49] *** lytv has joined #archiveteam [01:55] *** mistym has quit IRC (Remote host closed the connection) [02:10] *** mistym has joined #archiveteam [02:10] the ovi store project is currently in progress in #downlovi. tracker: http://tracker.archiveteam.org/ovi-store/ [02:38] *** rejon has joined #archiveteam [02:39] *** abartov has quit IRC (Ping timeout: 258 seconds) [02:49] *** lytv has quit IRC (Read error: Connection reset by peer) [02:51] *** parsons_ has quit IRC (Ping timeout: 248 seconds) [02:51] *** parsons_ has joined #archiveteam [02:52] *** lytv has joined #archiveteam [02:59] *** primus104 has quit IRC (Leaving.) [03:06] *** achip has joined #archiveteam [03:29] *** rejon has quit IRC (Read error: Operation timed out) [03:56] *** Nertsy` is now known as Nertsy [04:13] *** Infreq has joined #archiveteam [04:15] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [04:24] Infreq, yahoosucks [04:25] xd, awesome thanks [04:33] *** ruukasu has quit IRC (Remote host closed the connection) [04:35] *** ruukasu has joined #archiveteam [04:51] Do good [04:56] *** kyan has joined #archiveteam [05:11] *** achip has quit IRC (Remote host closed the connection) [05:21] *** aaaaaaaaa has quit IRC (Leaving) [05:25] *** punx has left [05:33] *** mistym has quit IRC (Remote host closed the connection) [05:36] *** sep332 has quit IRC (bye) [05:37] *** ruukasu has quit IRC (Remote host closed the connection) [05:40] *** ruukasu has joined #archiveteam [06:06] *** underscor has quit IRC (Ping timeout: 370 seconds) [06:31] *** underscor has joined #archiveteam [06:31] *** swebb sets mode: +o underscor [06:32] *** Start is now known as StartAway [07:32] *** dashcloud has quit IRC (Read error: Connection reset by peer) [07:32] *** dashcloud has joined #archiveteam [08:04] *** Selanda_ has joined #archiveteam [08:08] *** midas has quit IRC (hub.dk irc.underworld.no) [08:08] *** S[h]O[r]T has quit IRC (hub.dk irc.underworld.no) [08:08] *** Selanda has quit IRC (hub.dk irc.underworld.no) [08:08] *** raylee has quit IRC (hub.dk irc.underworld.no) [08:08] *** Atluxity has quit IRC (hub.dk irc.underworld.no) [08:08] *** Nemo_bis has quit IRC (hub.dk irc.underworld.no) [08:12] *** achip has joined #archiveteam [08:14] *** cloudmons has joined #archiveteam [08:20] *** achip has quit IRC (Read error: Operation timed out) [08:26] *** useretail has quit IRC (hub.se irc.ac.za) [09:24] *** LittUp has joined #archiveteam [09:34] *** primus104 has joined #archiveteam [09:48] *** Nemo_bis has joined #archiveteam [09:48] *** midas has joined #archiveteam [09:49] *** raylee has joined #archiveteam [10:14] *** cloudmons has quit IRC (Remote host closed the connection) [10:22] *** cloudmons has joined #archiveteam [10:36] *** BlueMaxim has quit IRC (Quit: Leaving) [10:49] *** schbirid has joined #archiveteam [10:49] *** phuzion has quit IRC (Read error: Operation timed out) [10:52] *** phuzion has joined #archiveteam [11:06] *** xtr-201 has quit IRC (Ping timeout: 370 seconds) [11:21] *** MMovie1 has joined #archiveteam [11:46] *** Infreq has quit IRC () [11:47] *** Iggytm has joined #archiveteam [11:47] *** Iggytm has quit IRC (Client Quit) [12:36] *** Ymgve has joined #archiveteam [12:41] *** useretail has joined #archiveteam [13:14] *** primus104 has quit IRC (Leaving.) [14:17] OK, who wants this one. (godane?) [14:17] http://www.metmuseum.org/research/metpublications/titles-with-full-text-online?searchtype=F [14:17] 416 PDFs (with keywords to scrape, maybe other data points to scrape) of highest quality [14:17] *** Nertsy has quit IRC (Read error: Operation timed out) [14:18] *** Nertsy has joined #archiveteam [14:27] *** signius has quit IRC (Ping timeout: 512 seconds) [14:34] *** sep332 has joined #archiveteam [14:36] *** signius has joined #archiveteam [15:01] *** sankin has joined #archiveteam [15:07] *** ruukasu has quit IRC (Remote host closed the connection) [15:07] Oh wow! Lots of good stuff there. Unfortunately I don't have the temporal bandwidth at the moment. [15:08] Whoever goes for it, call it. [15:08] *** StartAway has quit IRC (Disconnected.) [15:09] Is it in danger? [15:10] *** ruukasu has joined #archiveteam [15:16] *** wacky_ is now known as wacky [15:32] *** mistym has joined #archiveteam [15:33] *** mistym has quit IRC (Remote host closed the connection) [15:52] *** mistym has joined #archiveteam [15:55] *** Ravenloft has joined #archiveteam [15:58] *** achip has joined #archiveteam [16:02] If it exists, it's in danger. [16:02] DANGERZONE! [16:07] *** brook_ is now known as broke [16:10] *** dashcloud has quit IRC (Ping timeout: 512 seconds) [16:14] *** dashcloud has joined #archiveteam [16:14] *** primus104 has joined #archiveteam [16:17] *** dashcloud has quit IRC (Read error: Operation timed out) [16:19] *** dashcloud has joined #archiveteam [16:25] *** primus104 has quit IRC (Leaving.) [16:58] *** mistym has quit IRC (Remote host closed the connection) [17:00] *** Start has joined #archiveteam [17:17] *** primus104 has joined #archiveteam [17:17] *** mistym has joined #archiveteam [17:29] *** rejon has joined #archiveteam [17:38] *** Nertsy has quit IRC (Read error: Operation timed out) [17:42] *** Nertsy has joined #archiveteam [17:43] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [17:45] *** Start has quit IRC (Disconnected.) [17:53] *** phuzion has quit IRC (Quit: Adios y'all) [17:53] *** aaaaaaaaa has joined #archiveteam [18:19] SketchCow: can you hold off doing anything with ovi-store rsync directory on fos? most of the data is http 403 junk. [18:21] OK. [18:21] Inkblazers 100% uploaded. [18:23] Archivebot: "http://koti.kapsi.fi/~federico/tmp/SBN-URLs.txt on 01-25; 9,101.8 MB in 82,492 resp. at 0.1/s, 434,076 in q.; 1 con. w/ 1000-50000 ms delay; igoff 6ho7afbue5ag4f7jrvuckfl9u" [18:24] This is not right, why does the number of queued URLs keep increasing? I guess it also loads requisite resources, but by doing so it consumes time and makes the throttle stricter [18:26] *** xtr-201 has joined #archiveteam [18:29] *** Selanda_ has quit IRC (Read error: Operation timed out) [18:32] *** Selanda has joined #archiveteam [18:41] *** useretail has quit IRC (hub.se irc.ac.za) [18:42] *** useretai- has joined #archiveteam [19:24] *** Ravenloft has joined #archiveteam [19:53] *** mistym has quit IRC (Remote host closed the connection) [20:08] *** mistym has joined #archiveteam [20:13] *** K4k has joined #archiveteam [20:15] *** dashcloud has quit IRC (Read error: Operation timed out) [20:20] *** dashcloud has joined #archiveteam [20:32] *** Start has joined #archiveteam [20:33] Nemo_bis: because the job was set up that way [20:34] there's also no "don't fetch page requisites" option because it's not a common thing [20:39] *** mistym has quit IRC (Remote host closed the connection) [20:42] *** BlueMaxim has joined #archiveteam [20:42] someone put https://www.backblaze.com/hard-drive-test-data.html into the datasets collection please [20:55] *** mistym has joined #archiveteam [21:02] *** Ravenloft has quit IRC (Ping timeout: 512 seconds) [21:16] yipdw: which way? what are the new URLs being queued? [21:22] *** Start has quit IRC (Quit: Disconnected.) [21:24] *** K4k has quit IRC (Read error: Operation timed out) [21:36] *** schbirid has quit IRC (Quit: Leaving) [21:59] *** sankin has quit IRC (Leaving.) [22:14] <@Sanqui> wget is saving both in the warc and to directories [22:14] <@Sanqui> anybody wanna look over my wget command? http://pastie.org/9887160 [22:23] Is there something wrong with the archive.org servers? [22:24] it seems that my 'ia upload' which is in progress, is much slower than usual [22:24] the ETA jumps between 2 hours and 20 hours [22:25] Nemo_bis: probably page requisites [22:25] where about 1,5 hour would be as fast as usua; [22:25] usual* [22:25] the identifier of my upload is 2015-02-04.ftp.susx.ac.uk.tar and the source IP would be 82.197.212.29 [22:25] SketchCow: maybe you can look? [22:28] the ETA is jumping back and forth again [22:29] I have even seen it say less than 1 hour, and more than 24 [22:29] both are not okay [22:34] it is definately not my connection, I measured 868/684 while a few downloads and at least one upload was running just now: http://www.speedtest.net/result/4117114775.png [22:35] is it possible to run wget in parallel? if I start another warcing instance, will they fight each other? [22:37] In general, yes, it is possible to run wget in parrallel. But I don't know about your context [22:38] I just download ftp sites (see #effteepee) and since most of those are quite slow, I downloada few of them simultaneously [22:38] downloading a list of sites [22:38] of internet centrum [22:38] but it's going like 25kBps [22:38] Are you using scripts that do stuff for you? [22:39] <@Sanqui> anybody wanna look over my wget command? http://pastie.org/9887160 [22:39] if so, then the answer would depends on how those scripts are designed [22:39] ah :) [22:39] I didn't know wget had warc support :) [22:40] http://archiveteam.org/index.php?title=Wget [22:40] do you suggest something different? [22:40] /better? [22:40] if I were you, I'd just split that list in a few smaller parts and later figure out how to merge your different warc files [22:40] hm, guess that's the way then [22:40] https://github.com/maturban/WARCMerge [22:40] though it'll be annoying figuring out what's done and what isn't then [22:41] would be nicer to just start a wget instance for each site, while launching like 5 at once at most [22:41] also, wget should have a -e robots=abuse option, where it would parse robots.txt and scrape the disallowed URLs :) [22:43] heh, evil :p [22:44] hey, gotta save that shit :P [22:44] Sanqui: you can throw them into the archivebot [22:45] it needs the spam filter cookie though [22:45] 81k+ sites [22:45] and that [22:45] hm that kinda sucks yeah [22:45] dont think yipdw got around to create a cookiejar yet [22:45] a warrior script would work as was pointed out with the wget --header="Cookie: iccmtspmvrfy=ano" [22:46] Peetz0r: your slow upload is normal, s3 can get overrun with uploads and slow down [22:46] yeah idk if I can set up a warrior project myself though :( [22:46] but that depends on me learning how to write a pipeline [22:46] I can do it I want to learn anyways [22:46] check the programming part [22:46] http://archiveteam.org/index.php?title=Dev [22:47] midas: ah, okay [22:47] will just be patient then [22:47] yep [22:47] I've read it, but it's still somewhat complicated [22:47] looks like something that should be done by people already familiar :/ [22:47] I've got my dev environment running but can get wget-lua on my macbook, I should be able to figure out one if I could talk to a tracker *shifty eyes* [22:47] also, I have the issue that my downloading machine disk becomes slow and iowait goes trough the roof [22:47] even with ionice this is an issue [22:48] http://stream.haas-en-berg.nl:81/munin/Home/flappie/diskstats_utilization/sda-day.png and http://stream.haas-en-berg.nl:81/munin/Home/flappie/cpu-day.png [22:48] Peetz0r: You in Europe? [22:48] yes [22:48] also normalish, lots of small files on ftp boxes [22:49] midas: what I upload is one huuuge tar file [22:49] and laptop harddrive [22:49] hmm [22:49] Peetz0r: OK, that's probably why your speed goes up and down. I've had transit issues to IA as well [22:49] it's a WD black :D [22:49] I think I'm going to write a shell script to do parallel site-by-site wget [22:49] oh Peetz0r, nah that has to do with the ia upload command [22:49] it calculates a hash if im not mistaken [22:50] does ia upload split stuff in many small files? [22:50] does ia upload also kill my disk? [22:50] yep [22:50] because I use ionice only on the tarring so far [22:50] first, nope, second yes [22:50] okay, will add ionice to ia upload as well [22:50] ia doesnt split it [22:51] it will put some hurt on your disks and cpu [22:51] my cpu handles it just fine [22:51] the red and green parts of the graph are actual cpu usage [22:51] *** dashcloud has quit IRC (Read error: Operation timed out) [22:52] the purpe (huuge) part is iowait [22:52] didnt check your graphs yet [22:53] is there a dev tracker stood up anywhere? I've tried two boxes and the redis always gets screwed up when I follow Dev/Tracker wiki page. [22:54] the CPU is an i3 370M (2.4G dualcore) which seems to be overkill for this [22:54] there is a pre-built ova for the tracker [22:54] overkill is good because the same machine does more then just effteepee'ing [22:56] if you just need a cookie that can be set in a specific pipeline [22:56] for archivebot [22:56] I'm assuming this isn't some gigantic 20 million URL job [22:57] ya I have that in my VMWare box and I just realized I could just sand up an ubuntu in the same subnet and dev there, thanks aaaaaaaaa [22:57] well its 81k *seeds* [22:57] what's a seed [22:57] n00b speak for individual domains i guess [22:58] so if we get one shitty little forum in there and I can see the URL count shooting up [22:58] 81,000 domains? [22:58] those are seriously all related? [22:58] I need a cookie beacuse hungry [22:58] :p [22:59] Peetz0r: #archiveteam-bs for bs please [22:59] this is just what was discovered in the ic.cz directory from the wayback machine https://raw.githubusercontent.com/chpwssn/ic.czstuff/master/waybackcatalogresults.txt [22:59] sorry this is probably -bs by now [22:59] sorry midas [22:59] #internetcentury also :) [22:59] oh ic.cz [23:00] ok I don't trust wget any more lool [23:02] *** dashcloud has joined #archiveteam [23:07] *** phuzion has joined #archiveteam [23:07] *** phuzion has quit IRC (Remote host closed the connection)