Time |
Nickname |
Message |
00:04
🔗
|
|
Ymgve has joined #archiveteam |
00:24
🔗
|
|
achip has joined #archiveteam |
00:29
🔗
|
|
marvinw has quit IRC (Max SendQ exceeded) |
00:30
🔗
|
|
lytv has quit IRC (Max SendQ exceeded) |
00:33
🔗
|
|
lytv has joined #archiveteam |
00:34
🔗
|
|
marvinw has joined #archiveteam |
00:44
🔗
|
|
Mayonaise has joined #archiveteam |
01:08
🔗
|
|
xk_id_ has quit IRC (Remote host closed the connection) |
01:25
🔗
|
|
xk_id has joined #archiveteam |
01:28
🔗
|
|
Ymgve has quit IRC () |
01:39
🔗
|
|
achip has quit IRC () |
01:46
🔗
|
|
lytv has quit IRC (Read error: Connection reset by peer) |
01:49
🔗
|
|
lytv has joined #archiveteam |
01:55
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
02:10
🔗
|
|
mistym has joined #archiveteam |
02:10
🔗
|
chfoo |
the ovi store project is currently in progress in #downlovi. tracker: http://tracker.archiveteam.org/ovi-store/ |
02:38
🔗
|
|
rejon has joined #archiveteam |
02:39
🔗
|
|
abartov has quit IRC (Ping timeout: 258 seconds) |
02:49
🔗
|
|
lytv has quit IRC (Read error: Connection reset by peer) |
02:51
🔗
|
|
parsons_ has quit IRC (Ping timeout: 248 seconds) |
02:51
🔗
|
|
parsons_ has joined #archiveteam |
02:52
🔗
|
|
lytv has joined #archiveteam |
02:59
🔗
|
|
primus104 has quit IRC (Leaving.) |
03:06
🔗
|
|
achip has joined #archiveteam |
03:29
🔗
|
|
rejon has quit IRC (Read error: Operation timed out) |
03:56
🔗
|
|
Nertsy` is now known as Nertsy |
04:13
🔗
|
|
Infreq has joined #archiveteam |
04:15
🔗
|
Infreq |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
04:24
🔗
|
garyrh_ |
Infreq, yahoosucks |
04:25
🔗
|
Infreq |
xd, awesome thanks |
04:33
🔗
|
|
ruukasu has quit IRC (Remote host closed the connection) |
04:35
🔗
|
|
ruukasu has joined #archiveteam |
04:51
🔗
|
SketchCow |
Do good |
04:56
🔗
|
|
kyan has joined #archiveteam |
05:11
🔗
|
|
achip has quit IRC (Remote host closed the connection) |
05:21
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:25
🔗
|
|
punx has left |
05:33
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
05:36
🔗
|
|
sep332 has quit IRC (bye) |
05:37
🔗
|
|
ruukasu has quit IRC (Remote host closed the connection) |
05:40
🔗
|
|
ruukasu has joined #archiveteam |
06:06
🔗
|
|
underscor has quit IRC (Ping timeout: 370 seconds) |
06:31
🔗
|
|
underscor has joined #archiveteam |
06:31
🔗
|
|
swebb sets mode: +o underscor |
06:32
🔗
|
|
Start is now known as StartAway |
07:32
🔗
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
07:32
🔗
|
|
dashcloud has joined #archiveteam |
08:04
🔗
|
|
Selanda_ has joined #archiveteam |
08:08
🔗
|
|
midas has quit IRC (hub.dk irc.underworld.no) |
08:08
🔗
|
|
S[h]O[r]T has quit IRC (hub.dk irc.underworld.no) |
08:08
🔗
|
|
Selanda has quit IRC (hub.dk irc.underworld.no) |
08:08
🔗
|
|
raylee has quit IRC (hub.dk irc.underworld.no) |
08:08
🔗
|
|
Atluxity has quit IRC (hub.dk irc.underworld.no) |
08:08
🔗
|
|
Nemo_bis has quit IRC (hub.dk irc.underworld.no) |
08:12
🔗
|
|
achip has joined #archiveteam |
08:14
🔗
|
|
cloudmons has joined #archiveteam |
08:20
🔗
|
|
achip has quit IRC (Read error: Operation timed out) |
08:26
🔗
|
|
useretail has quit IRC (hub.se irc.ac.za) |
09:24
🔗
|
|
LittUp has joined #archiveteam |
09:34
🔗
|
|
primus104 has joined #archiveteam |
09:48
🔗
|
|
Nemo_bis has joined #archiveteam |
09:48
🔗
|
|
midas has joined #archiveteam |
09:49
🔗
|
|
raylee has joined #archiveteam |
10:14
🔗
|
|
cloudmons has quit IRC (Remote host closed the connection) |
10:22
🔗
|
|
cloudmons has joined #archiveteam |
10:36
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
10:49
🔗
|
|
schbirid has joined #archiveteam |
10:49
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
10:52
🔗
|
|
phuzion has joined #archiveteam |
11:06
🔗
|
|
xtr-201 has quit IRC (Ping timeout: 370 seconds) |
11:21
🔗
|
|
MMovie1 has joined #archiveteam |
11:46
🔗
|
|
Infreq has quit IRC () |
11:47
🔗
|
|
Iggytm has joined #archiveteam |
11:47
🔗
|
|
Iggytm has quit IRC (Client Quit) |
12:36
🔗
|
|
Ymgve has joined #archiveteam |
12:41
🔗
|
|
useretail has joined #archiveteam |
13:14
🔗
|
|
primus104 has quit IRC (Leaving.) |
14:17
🔗
|
SketchCow |
OK, who wants this one. (godane?) |
14:17
🔗
|
SketchCow |
http://www.metmuseum.org/research/metpublications/titles-with-full-text-online?searchtype=F |
14:17
🔗
|
SketchCow |
416 PDFs (with keywords to scrape, maybe other data points to scrape) of highest quality |
14:17
🔗
|
|
Nertsy has quit IRC (Read error: Operation timed out) |
14:18
🔗
|
|
Nertsy has joined #archiveteam |
14:27
🔗
|
|
signius has quit IRC (Ping timeout: 512 seconds) |
14:34
🔗
|
|
sep332 has joined #archiveteam |
14:36
🔗
|
|
signius has joined #archiveteam |
15:01
🔗
|
|
sankin has joined #archiveteam |
15:07
🔗
|
|
ruukasu has quit IRC (Remote host closed the connection) |
15:07
🔗
|
SadDM |
Oh wow! Lots of good stuff there. Unfortunately I don't have the temporal bandwidth at the moment. |
15:08
🔗
|
SketchCow |
Whoever goes for it, call it. |
15:08
🔗
|
|
StartAway has quit IRC (Disconnected.) |
15:09
🔗
|
SadDM |
Is it in danger? |
15:10
🔗
|
|
ruukasu has joined #archiveteam |
15:16
🔗
|
|
wacky_ is now known as wacky |
15:32
🔗
|
|
mistym has joined #archiveteam |
15:33
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
15:52
🔗
|
|
mistym has joined #archiveteam |
15:55
🔗
|
|
Ravenloft has joined #archiveteam |
15:58
🔗
|
|
achip has joined #archiveteam |
16:02
🔗
|
ersi |
If it exists, it's in danger. |
16:02
🔗
|
ersi |
DANGERZONE! |
16:07
🔗
|
|
brook_ is now known as broke |
16:10
🔗
|
|
dashcloud has quit IRC (Ping timeout: 512 seconds) |
16:14
🔗
|
|
dashcloud has joined #archiveteam |
16:14
🔗
|
|
primus104 has joined #archiveteam |
16:17
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
16:19
🔗
|
|
dashcloud has joined #archiveteam |
16:25
🔗
|
|
primus104 has quit IRC (Leaving.) |
16:58
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
17:00
🔗
|
|
Start has joined #archiveteam |
17:17
🔗
|
|
primus104 has joined #archiveteam |
17:17
🔗
|
|
mistym has joined #archiveteam |
17:29
🔗
|
|
rejon has joined #archiveteam |
17:38
🔗
|
|
Nertsy has quit IRC (Read error: Operation timed out) |
17:42
🔗
|
|
Nertsy has joined #archiveteam |
17:43
🔗
|
|
Ravenloft has quit IRC (Read error: Connection reset by peer) |
17:45
🔗
|
|
Start has quit IRC (Disconnected.) |
17:53
🔗
|
|
phuzion has quit IRC (Quit: Adios y'all) |
17:53
🔗
|
|
aaaaaaaaa has joined #archiveteam |
18:19
🔗
|
chfoo |
SketchCow: can you hold off doing anything with ovi-store rsync directory on fos? most of the data is http 403 junk. |
18:21
🔗
|
SketchCow |
OK. |
18:21
🔗
|
SketchCow |
Inkblazers 100% uploaded. |
18:23
🔗
|
Nemo_bis |
Archivebot: "http://koti.kapsi.fi/~federico/tmp/SBN-URLs.txt on 01-25; 9,101.8 MB in 82,492 resp. at 0.1/s, 434,076 in q.; 1 con. w/ 1000-50000 ms delay; igoff 6ho7afbue5ag4f7jrvuckfl9u" |
18:24
🔗
|
Nemo_bis |
This is not right, why does the number of queued URLs keep increasing? I guess it also loads requisite resources, but by doing so it consumes time and makes the throttle stricter |
18:26
🔗
|
|
xtr-201 has joined #archiveteam |
18:29
🔗
|
|
Selanda_ has quit IRC (Read error: Operation timed out) |
18:32
🔗
|
|
Selanda has joined #archiveteam |
18:41
🔗
|
|
useretail has quit IRC (hub.se irc.ac.za) |
18:42
🔗
|
|
useretai- has joined #archiveteam |
19:24
🔗
|
|
Ravenloft has joined #archiveteam |
19:53
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
20:08
🔗
|
|
mistym has joined #archiveteam |
20:13
🔗
|
|
K4k has joined #archiveteam |
20:15
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
20:20
🔗
|
|
dashcloud has joined #archiveteam |
20:32
🔗
|
|
Start has joined #archiveteam |
20:33
🔗
|
yipdw |
Nemo_bis: because the job was set up that way |
20:34
🔗
|
yipdw |
there's also no "don't fetch page requisites" option because it's not a common thing |
20:39
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
20:42
🔗
|
|
BlueMaxim has joined #archiveteam |
20:42
🔗
|
schbirid |
someone put https://www.backblaze.com/hard-drive-test-data.html into the datasets collection please |
20:55
🔗
|
|
mistym has joined #archiveteam |
21:02
🔗
|
|
Ravenloft has quit IRC (Ping timeout: 512 seconds) |
21:16
🔗
|
Nemo_bis |
yipdw: which way? what are the new URLs being queued? |
21:22
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
21:24
🔗
|
|
K4k has quit IRC (Read error: Operation timed out) |
21:36
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:59
🔗
|
|
sankin has quit IRC (Leaving.) |
22:14
🔗
|
Sanqui |
<@Sanqui> wget is saving both in the warc and to directories |
22:14
🔗
|
Sanqui |
<@Sanqui> anybody wanna look over my wget command? http://pastie.org/9887160 |
22:23
🔗
|
Peetz0r |
Is there something wrong with the archive.org servers? |
22:24
🔗
|
Peetz0r |
it seems that my 'ia upload' which is in progress, is much slower than usual |
22:24
🔗
|
Peetz0r |
the ETA jumps between 2 hours and 20 hours |
22:25
🔗
|
yipdw |
Nemo_bis: probably page requisites |
22:25
🔗
|
Peetz0r |
where about 1,5 hour would be as fast as usua; |
22:25
🔗
|
Peetz0r |
usual* |
22:25
🔗
|
Peetz0r |
the identifier of my upload is 2015-02-04.ftp.susx.ac.uk.tar and the source IP would be 82.197.212.29 |
22:25
🔗
|
Peetz0r |
SketchCow: maybe you can look? |
22:28
🔗
|
Peetz0r |
the ETA is jumping back and forth again |
22:29
🔗
|
Peetz0r |
I have even seen it say less than 1 hour, and more than 24 |
22:29
🔗
|
Peetz0r |
both are not okay |
22:34
🔗
|
Peetz0r |
it is definately not my connection, I measured 868/684 while a few downloads and at least one upload was running just now: http://www.speedtest.net/result/4117114775.png |
22:35
🔗
|
Sanqui |
is it possible to run wget in parallel? if I start another warcing instance, will they fight each other? |
22:37
🔗
|
Peetz0r |
In general, yes, it is possible to run wget in parrallel. But I don't know about your context |
22:38
🔗
|
Peetz0r |
I just download ftp sites (see #effteepee) and since most of those are quite slow, I downloada few of them simultaneously |
22:38
🔗
|
Sanqui |
downloading a list of sites |
22:38
🔗
|
Sanqui |
of internet centrum |
22:38
🔗
|
Sanqui |
but it's going like 25kBps |
22:38
🔗
|
Peetz0r |
Are you using scripts that do stuff for you? |
22:39
🔗
|
Sanqui |
<Sanqui> <@Sanqui> anybody wanna look over my wget command? http://pastie.org/9887160 |
22:39
🔗
|
Peetz0r |
if so, then the answer would depends on how those scripts are designed |
22:39
🔗
|
Peetz0r |
ah :) |
22:39
🔗
|
Peetz0r |
I didn't know wget had warc support :) |
22:40
🔗
|
Sanqui |
http://archiveteam.org/index.php?title=Wget |
22:40
🔗
|
Sanqui |
do you suggest something different? |
22:40
🔗
|
Sanqui |
/better? |
22:40
🔗
|
Peetz0r |
if I were you, I'd just split that list in a few smaller parts and later figure out how to merge your different warc files |
22:40
🔗
|
Sanqui |
hm, guess that's the way then |
22:40
🔗
|
Peetz0r |
https://github.com/maturban/WARCMerge |
22:40
🔗
|
Sanqui |
though it'll be annoying figuring out what's done and what isn't then |
22:41
🔗
|
Sanqui |
would be nicer to just start a wget instance for each site, while launching like 5 at once at most |
22:41
🔗
|
Sanqui |
also, wget should have a -e robots=abuse option, where it would parse robots.txt and scrape the disallowed URLs :) |
22:43
🔗
|
Peetz0r |
heh, evil :p |
22:44
🔗
|
Sanqui |
hey, gotta save that shit :P |
22:44
🔗
|
midas |
Sanqui: you can throw them into the archivebot |
22:45
🔗
|
theChip |
it needs the spam filter cookie though |
22:45
🔗
|
Sanqui |
81k+ sites |
22:45
🔗
|
Sanqui |
and that |
22:45
🔗
|
midas |
hm that kinda sucks yeah |
22:45
🔗
|
midas |
dont think yipdw got around to create a cookiejar yet |
22:45
🔗
|
theChip |
a warrior script would work as was pointed out with the wget --header="Cookie: iccmtspmvrfy=ano" |
22:46
🔗
|
midas |
Peetz0r: your slow upload is normal, s3 can get overrun with uploads and slow down |
22:46
🔗
|
Sanqui |
yeah idk if I can set up a warrior project myself though :( |
22:46
🔗
|
theChip |
but that depends on me learning how to write a pipeline |
22:46
🔗
|
theChip |
I can do it I want to learn anyways |
22:46
🔗
|
midas |
check the programming part |
22:46
🔗
|
midas |
http://archiveteam.org/index.php?title=Dev |
22:47
🔗
|
Peetz0r |
midas: ah, okay |
22:47
🔗
|
Peetz0r |
will just be patient then |
22:47
🔗
|
midas |
yep |
22:47
🔗
|
Sanqui |
I've read it, but it's still somewhat complicated |
22:47
🔗
|
Sanqui |
looks like something that should be done by people already familiar :/ |
22:47
🔗
|
theChip |
I've got my dev environment running but can get wget-lua on my macbook, I should be able to figure out one if I could talk to a tracker *shifty eyes* |
22:47
🔗
|
Peetz0r |
also, I have the issue that my downloading machine disk becomes slow and iowait goes trough the roof |
22:47
🔗
|
Peetz0r |
even with ionice this is an issue |
22:48
🔗
|
Peetz0r |
http://stream.haas-en-berg.nl:81/munin/Home/flappie/diskstats_utilization/sda-day.png and http://stream.haas-en-berg.nl:81/munin/Home/flappie/cpu-day.png |
22:48
🔗
|
ersi |
Peetz0r: You in Europe? |
22:48
🔗
|
Peetz0r |
yes |
22:48
🔗
|
midas |
also normalish, lots of small files on ftp boxes |
22:49
🔗
|
Peetz0r |
midas: what I upload is one huuuge tar file |
22:49
🔗
|
midas |
and laptop harddrive |
22:49
🔗
|
Sanqui |
hmm |
22:49
🔗
|
ersi |
Peetz0r: OK, that's probably why your speed goes up and down. I've had transit issues to IA as well |
22:49
🔗
|
Peetz0r |
it's a WD black :D |
22:49
🔗
|
Sanqui |
I think I'm going to write a shell script to do parallel site-by-site wget |
22:49
🔗
|
midas |
oh Peetz0r, nah that has to do with the ia upload command |
22:49
🔗
|
midas |
it calculates a hash if im not mistaken |
22:50
🔗
|
Peetz0r |
does ia upload split stuff in many small files? |
22:50
🔗
|
Peetz0r |
does ia upload also kill my disk? |
22:50
🔗
|
midas |
yep |
22:50
🔗
|
Peetz0r |
because I use ionice only on the tarring so far |
22:50
🔗
|
midas |
first, nope, second yes |
22:50
🔗
|
Peetz0r |
okay, will add ionice to ia upload as well |
22:50
🔗
|
midas |
ia doesnt split it |
22:51
🔗
|
midas |
it will put some hurt on your disks and cpu |
22:51
🔗
|
Peetz0r |
my cpu handles it just fine |
22:51
🔗
|
Peetz0r |
the red and green parts of the graph are actual cpu usage |
22:51
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:52
🔗
|
Peetz0r |
the purpe (huuge) part is iowait |
22:52
🔗
|
midas |
didnt check your graphs yet |
22:53
🔗
|
theChip |
is there a dev tracker stood up anywhere? I've tried two boxes and the redis always gets screwed up when I follow Dev/Tracker wiki page. |
22:54
🔗
|
Peetz0r |
the CPU is an i3 370M (2.4G dualcore) which seems to be overkill for this |
22:54
🔗
|
aaaaaaaaa |
there is a pre-built ova for the tracker |
22:54
🔗
|
Peetz0r |
overkill is good because the same machine does more then just effteepee'ing |
22:56
🔗
|
yipdw |
if you just need a cookie that can be set in a specific pipeline |
22:56
🔗
|
yipdw |
for archivebot |
22:56
🔗
|
yipdw |
I'm assuming this isn't some gigantic 20 million URL job |
22:57
🔗
|
theChip |
ya I have that in my VMWare box and I just realized I could just sand up an ubuntu in the same subnet and dev there, thanks aaaaaaaaa |
22:57
🔗
|
theChip |
well its 81k *seeds* |
22:57
🔗
|
yipdw |
what's a seed |
22:57
🔗
|
theChip |
n00b speak for individual domains i guess |
22:58
🔗
|
theChip |
so if we get one shitty little forum in there and I can see the URL count shooting up |
22:58
🔗
|
yipdw |
81,000 domains? |
22:58
🔗
|
yipdw |
those are seriously all related? |
22:58
🔗
|
Peetz0r |
I need a cookie beacuse hungry |
22:58
🔗
|
Peetz0r |
:p |
22:59
🔗
|
midas |
Peetz0r: #archiveteam-bs for bs please |
22:59
🔗
|
theChip |
this is just what was discovered in the ic.cz directory from the wayback machine https://raw.githubusercontent.com/chpwssn/ic.czstuff/master/waybackcatalogresults.txt |
22:59
🔗
|
theChip |
sorry this is probably -bs by now |
22:59
🔗
|
Peetz0r |
sorry midas |
22:59
🔗
|
Sanqui |
#internetcentury also :) |
22:59
🔗
|
yipdw |
oh ic.cz |
23:00
🔗
|
Sanqui |
ok I don't trust wget any more lool |
23:02
🔗
|
|
dashcloud has joined #archiveteam |
23:07
🔗
|
|
phuzion has joined #archiveteam |
23:07
🔗
|
|
phuzion has quit IRC (Remote host closed the connection) |