Time |
Nickname |
Message |
00:21
π
|
|
kristian_ has quit IRC (Leaving) |
00:39
π
|
aschmitz |
Has anyone done any work on NPR's comments? |
00:54
π
|
r3c0d3x |
Asked about this a few days back, didn't get any response, so I'd assume no. |
01:47
π
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
01:47
π
|
|
HCross has joined #archiveteam |
01:54
π
|
|
khaoohs has joined #archiveteam |
02:03
π
|
|
khaoohs has quit IRC (Quit: Leaving) |
02:10
π
|
|
tomwsmf has quit IRC (Read error: Operation timed out) |
02:30
π
|
|
mr-b has left |
02:45
π
|
|
db48x has joined #archiveteam |
03:10
π
|
|
db48x` has joined #archiveteam |
03:11
π
|
|
db48x has quit IRC (Read error: Operation timed out) |
03:14
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
03:15
π
|
|
BartoCH has joined #archiveteam |
03:22
π
|
|
nicolas17 has quit IRC (Quit: U+1F634) |
04:09
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
04:12
π
|
|
BartoCH has joined #archiveteam |
04:17
π
|
|
JesseW has joined #archiveteam |
04:17
π
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:24
π
|
|
Sk1d has joined #archiveteam |
04:26
π
|
JesseW |
we should probably get all the sites we can from http://www.users.totalise.co.uk as it appears to be a small ISP, in the process of being merged with another one (although they don't explicitly talk about shutting down the web sites) |
04:29
π
|
JesseW |
!ig 28j6lpt5lmtyrdi4dhfugpmto squarespace\.com |
04:35
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
04:35
π
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
04:35
π
|
|
HCross has joined #archiveteam |
04:43
π
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
04:43
π
|
|
Meroje has quit IRC (Quit: bye!) |
04:44
π
|
|
Meroje has joined #archiveteam |
04:53
π
|
|
DFJustin has joined #archiveteam |
04:53
π
|
|
swebb sets mode: +o DFJustin |
05:05
π
|
|
DFJustin has quit IRC (Remote host closed the connection) |
05:10
π
|
|
DFJustin has joined #archiveteam |
05:15
π
|
|
HCross has quit IRC (Read error: Operation timed out) |
05:15
π
|
|
HCross has joined #archiveteam |
05:45
π
|
JesseW |
I'm in the process of grabbing the ones I can find with archivebot |
05:51
π
|
|
quails has quit IRC (Ping timeout: 250 seconds) |
05:56
π
|
|
quails has joined #archiveteam |
05:57
π
|
|
phuzion has quit IRC (Read error: Operation timed out) |
05:58
π
|
|
phuzion has joined #archiveteam |
06:04
π
|
|
patrickod has quit IRC (Read error: Operation timed out) |
06:04
π
|
|
patrickod has joined #archiveteam |
06:05
π
|
|
phuzion has quit IRC (Read error: Operation timed out) |
06:05
π
|
|
sep332 has quit IRC (Read error: Operation timed out) |
06:05
π
|
|
midas1 has quit IRC (Read error: Operation timed out) |
06:07
π
|
|
midas1 has joined #archiveteam |
06:07
π
|
|
swebb sets mode: +o midas1 |
06:07
π
|
|
sep332 has joined #archiveteam |
06:10
π
|
|
Fake-Name has quit IRC (Ping timeout: 501 seconds) |
06:13
π
|
|
BlueMaxim has joined #archiveteam |
06:13
π
|
|
phuzion has joined #archiveteam |
06:13
π
|
|
Fake-Name has joined #archiveteam |
06:49
π
|
|
zerbrnky has joined #archiveteam |
06:49
π
|
zerbrnky |
hi all, anyone around? |
06:49
π
|
tuankiet |
Any problem? |
06:50
π
|
JesseW |
Zebranky: no. But ask whatever you were going to ask anyway... |
06:50
π
|
zerbrnky |
hm i should use a different nick D: |
06:51
π
|
zerbrnky |
i'm not Zebranky (i use a variant of this nick on places where longer names are allowed) |
06:51
π
|
|
zerbrnky is now known as rbraun |
06:51
π
|
JesseW |
oops, sorry |
06:51
π
|
rbraun |
i was looking through the gawker dumps on archive.org and yeah, there might be a problem |
06:52
π
|
JesseW |
well, a lot of our most recent stuff may not have made it up there yet |
06:52
π
|
JesseW |
and we know about the robots.txt issues |
06:52
π
|
JesseW |
is there a different problem? |
06:52
π
|
rbraun |
it looks like they were grabbed by grabbing the sitemap for each month and then grabbing from there |
06:52
π
|
rbraun |
the problem is that the sitemap for especially busy months can't be grabbed a whole month at a time |
06:53
π
|
JesseW |
hm, yeah that could be an issue. godane? |
06:53
π
|
rbraun |
so e.g. everything in January 2010 before 1/19 is missing from both this: https://archive.org/details/gawker.com-sitemap-2010-20160322 |
06:53
π
|
rbraun |
and from web.archive.org too |
06:53
π
|
rbraun |
rather it's not all missing from web.archive.org but some pages are |
06:54
π
|
rbraun |
and many of the pages that /are/ there weren't crawled this year, indicating the bulk grab in march didn't hit them |
06:54
π
|
rbraun |
this seems to be a bigger problem for older pages (probably back when they still paid their writers by the article) |
06:54
π
|
JesseW |
do you know of a way to get a list of the missing pages? |
06:55
π
|
rbraun |
yeah, you just see what the start date for the sitemap was and edit the end date to be that, iterate until it grabs thru the first of the month |
06:55
π
|
rbraun |
i'm working on it now but i was wondering if anyone had already done it |
06:56
π
|
rbraun |
e.g. january 2010 takes 3 pulls |
06:56
π
|
rbraun |
and then of course all the pages... |
06:56
π
|
rbraun |
(january 2012 is complete, though) |
06:56
π
|
JesseW |
godane is the person who has been working on it; hopefully he'll speak up |
07:03
π
|
|
Honno has joined #archiveteam |
07:11
π
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
07:11
π
|
rbraun |
is there a faster way to force wayback to crawl a list of URLs than just loading http://web.archive.org/save/[URL] for each? |
07:13
π
|
PurpleSym |
Try #archivebot |
07:18
π
|
rbraun |
oh, nice, there is a non-recursive option |
07:20
π
|
rbraun |
archiveonly < FILE is probably what i need, thanks |
07:21
π
|
rbraun |
when archivebot uploads a WARC to archive.org, does it end up in web.archive.org too? |
07:21
π
|
rbraun |
in wayback, that is |
07:23
π
|
PurpleSym |
Yes, thatβs the point. |
07:23
π
|
rbraun |
ok, thanks, this looks easier than i thought |
07:24
π
|
rbraun |
(fwiw i first discovered this issue when i noticed something from jan 2010 *wasn't* in wayback at all; then, found it wasn't in the collection i linked either) |
07:28
π
|
rbraun |
gut feeling is that 2007-2011 are affected in part |
07:28
π
|
rbraun |
(looking at http://gawkerdata.kinja.com/closing-the-book-on-gawker-com-1785555716) |
07:31
π
|
|
REiN^ has quit IRC (Read error: Connection reset by peer) |
07:33
π
|
|
phuzion has quit IRC (Read error: Operation timed out) |
07:36
π
|
|
phuzion has joined #archiveteam |
07:52
π
|
|
schbirid has joined #archiveteam |
08:14
π
|
godane |
based on site map its 2010-01-19 on: gawker.com/sitemap_bydate.xml?startTime=2010-01-01T00:00:00&endTime=2010-01-31T23:59:59 |
08:14
π
|
godane |
ok i see the problem: http://gawker.com/sitemap_bydate.xml?startTime=2010-01-01T00:00:00&endTime=2010-01-01T23:59:59 |
08:15
π
|
godane |
sometimes those sitemaps do some weird shit |
08:16
π
|
rbraun |
godane: do you have the missing ones or should i keep compiling them and feed them to archivebot? |
08:16
π
|
rbraun |
i have 2010 almost ready |
08:17
π
|
godane |
you can feed them into archivebot if you want to |
08:17
π
|
godane |
i will also see about doing it |
08:17
π
|
rbraun |
i checked several URLs from my file; some of them are in wayback and some not |
08:18
π
|
rbraun |
ok; i'm working on 2010 but i think all of 2007-11 might be affected based on volume |
08:18
π
|
rbraun |
(and the URLs not in wayback weren't saved in the big March dump, they were crawled earlier) |
08:19
π
|
godane |
i maybe doing a daily grabs now |
08:19
π
|
godane |
regrabs of what i got |
08:19
π
|
rbraun |
also really uncertain how long any of the site will stay up so |
08:20
π
|
godane |
i will work on gawker.com sitemap |
08:20
π
|
rbraun |
note that in every case i saw, if the monthly grab by default returned through X date, the original grab had all of those articles |
08:21
π
|
rbraun |
but not all the ones before that |
08:26
π
|
godane |
i'm redump grawker.com as daily sitemap grab |
08:30
π
|
godane |
kataku.com has the same problem |
08:30
π
|
godane |
*kotaku.com |
08:37
π
|
|
REiN^ has joined #archiveteam |
08:43
π
|
|
WinterFox has joined #archiveteam |
08:58
π
|
rbraun |
godane: do you want what i have for 2010? might save some time |
08:59
π
|
godane |
its not going to save me time sadly |
09:00
π
|
godane |
my script make a run at the sitemap by the day now |
09:00
π
|
rbraun |
ok |
09:00
π
|
godane |
also i will have to do that with all of gawker sites |
09:00
π
|
rbraun |
some of them i think don't have enough articles for this to have been an issue |
09:01
π
|
rbraun |
not sure which ones though |
09:01
π
|
godane |
i have uploaded some of thoses |
09:01
π
|
godane |
they were in the 10 to 100mb |
09:01
π
|
godane |
range |
09:02
π
|
rbraun |
might save the crawler time at least to not have to recrawl what's known already in the archive? |
09:02
π
|
|
BartoCH has joined #archiveteam |
09:02
π
|
rbraun |
(several different ways to do that; i was just using the date cutoff) |
09:05
π
|
rbraun |
also i'm not sure how much time is left for gawker.com specifically |
09:08
π
|
godane |
btw the sitemap cut off is weird |
09:09
π
|
godane |
like for 2008-11 i can get 3034 urls with gawker but only 1971 urls with kotaku.com |
09:09
π
|
Medowar |
google code is empty. Can someone requeue |
09:12
π
|
rbraun |
godane: there are fewer articles total for that month on kotaku though |
09:13
π
|
rbraun |
godane: for 2008-11 if i request the whole month it cuts of at the 14th for gawker but the 6th for kotaku |
09:13
π
|
rbraun |
oh, i see |
09:13
π
|
rbraun |
yeah, why didn't it grab the whole month for kotaku... |
09:13
π
|
rbraun |
fwiw their own sitemaps link in 1-week increments |
09:14
π
|
rbraun |
http://gawker.com/sitemap.xml |
09:14
π
|
rbraun |
not sure i trust that given how uneven it is but i haven't found a case where it failed yet |
09:17
π
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
09:17
π
|
|
HCross has joined #archiveteam |
09:22
π
|
godane |
sitemaps for 2006-01 are start to be uploaded: https://archive.org/details/gawker.com-sitemap-2006-01-09-20160823 |
09:24
π
|
godane |
i'm doing 11 months of daily sitemaps at once :-D |
09:24
π
|
rbraun |
that's going to produce a lot of collections... any reason not to combine those by month? |
09:24
π
|
rbraun |
also FYI while investigating this, the sitemap_bydate.xml was giving me 500 errors sometimes |
09:25
π
|
rbraun |
that was reliable if i didn't request whole-day increments |
09:25
π
|
rbraun |
but it happened some other times too; just reloading fixed it |
09:25
π
|
godane |
my script use curl to grab the sitemap by day then starts the download |
09:26
π
|
rbraun |
why not cat those together like a month at a time? |
09:27
π
|
godane |
cause i was not planing on doing that |
09:27
π
|
rbraun |
well, the reason i ask is |
09:27
π
|
rbraun |
the sitemaps provide an index of article titles |
09:28
π
|
rbraun |
so if i know gawker published an article in 1/2010 but i don't know which day... |
09:28
π
|
rbraun |
and i only know one word of the title or something |
09:28
π
|
godane |
https://archive.org/details/archiveteam-fire?and[]=subject%3A%22www.dailymail.co.uk%22 |
09:28
π
|
godane |
i do it by date of sitemap |
09:29
π
|
rbraun |
it's also easier to verify everything is in there if it's in larger chunks |
09:29
π
|
|
vOYtEC has quit IRC (Ping timeout: 244 seconds) |
09:29
π
|
godane |
i make a month sitemap may make me confuse |
09:30
π
|
godane |
thinking it was done the old method when gawker sitemap doesn't get everything |
09:30
π
|
godane |
so the daily dumps are meant to be different since the month and yearly failed |
09:31
π
|
godane |
i can turn the daily dumps into monthly or yearly for that reason |
09:32
π
|
rbraun |
hmm ok |
09:34
π
|
godane |
i'm mostly trying to keep the raw sitemap urls the same set as date of urls |
09:34
π
|
|
HCross2 has quit IRC (Quit: Connection closed for inactivity) |
09:37
π
|
|
schbird has joined #archiveteam |
09:37
π
|
schbird |
is there a way to record mouse/keyboard interaction with webrecorder.io or a similar tool? |
09:37
π
|
rbraun |
godane: can your script handle the case where it returns a 500 error and retry? |
09:38
π
|
schbird |
to actually replay all "user" interaction |
09:38
π
|
rbraun |
godane: i guess curl --retry 10 or something |
09:41
π
|
rbraun |
i was getting those intermittently even on sitemap_bydate requests that would later complete |
09:49
π
|
godane |
i'm not really getting those errors |
09:49
π
|
godane |
i get them on days that don't exist i think |
09:49
π
|
rbraun |
no, i get empty files (or with the front page only) on days that don't exist |
09:50
π
|
rbraun |
i get 500 errors when it's cranky or if i try to pull a partial day (which doesn't work) |
09:50
π
|
rbraun |
but in the former case i had to retry a few times |
09:50
π
|
rbraun |
if you pass --retry <#> to curl with some number of retries allowed, you should have no problem though |
09:52
π
|
rbraun |
only getting that on the sitemaps occasionally, not the actual pages |
09:52
π
|
|
Selavi has quit IRC (Ping timeout: 260 seconds) |
09:53
π
|
|
Kksmkrn has joined #archiveteam |
09:53
π
|
|
Kksmkrn has quit IRC (Connection closed) |
09:53
π
|
|
Kksmkrn has joined #archiveteam |
09:53
π
|
godane |
i'm going to bed now |
09:53
π
|
godane |
i will continue tomorrow |
09:54
π
|
rbraun |
ok good night |
10:00
π
|
|
Selavi has joined #archiveteam |
10:09
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
10:16
π
|
|
BartoCH has joined #archiveteam |
10:28
π
|
|
enr1c0 has joined #archiveteam |
10:31
π
|
|
Kksmkrn has quit IRC (Quit: leaving) |
11:00
π
|
|
enr1c0 has quit IRC (Quit: ZNC 1.6.3+deb1 - http://znc.in) |
11:00
π
|
|
enr1c0 has joined #archiveteam |
11:01
π
|
|
enr1c0 has left |
11:26
π
|
|
enr1c0 has joined #archiveteam |
11:30
π
|
|
enr1c0 has quit IRC (Client Quit) |
11:31
π
|
|
enr1c0 has joined #archiveteam |
11:31
π
|
|
enr1c0 has left |
11:35
π
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
11:35
π
|
|
HCross has joined #archiveteam |
12:28
π
|
|
irl has joined #archiveteam |
12:29
π
|
irl |
ok, so i was here a while ago and i'm trying to archive a whole bunch of paper manuals and documents from the 70s-90s from obscure networking hardware and computer programs relating to networking and such |
12:30
π
|
irl |
following a complete mess trying to use the university's MFD devices (they scan to email only, and couldn't do large attachments, so i was limited to ~5 pages) |
12:30
π
|
irl |
i've now decided i want to buy a scanner with an ADF to sit in the lab |
12:30
π
|
irl |
can anyone recommend such a scanner that can handle various paper types, and paper with binding holes etc. that isn't going to break constantly? |
12:31
π
|
irl |
ideally it would have linux support and not be networked, but direct into the pc |
12:31
π
|
irl |
ideally it would also be fast-ish, but i'll take reliability over speed |
12:32
π
|
PurpleSym |
I recently *built* a 25β¬ DIY book scanner, but itβs quite slow. |
12:32
π
|
irl |
i'm talking ~10,000 ish pages of manuals |
12:32
π
|
irl |
they're mostly A4 paper that's been punched and hand-bound |
12:32
π
|
PurpleSym |
So, destructive scanning then? |
12:33
π
|
irl |
with those plastic binding things |
12:33
π
|
PurpleSym |
I see. |
12:33
π
|
irl |
my hope is to be able to just put the plastic things back on them afterwards |
12:34
π
|
irl |
i've looked through ebay for scanners with adf, but i have no idea how reliable these things are |
12:35
π
|
irl |
the HP 9200C 9200 Digital Sender seems to come up a lot and looks quite heavy duty |
12:35
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
12:37
π
|
|
BartoCH has joined #archiveteam |
12:46
π
|
irl |
purchased a 9200c, seems to have good reviews |
12:47
π
|
irl |
i'm guessing a lot of these things will have valid copyright |
12:47
π
|
irl |
any advice on how to work out what i can publish and what i shouldn't publish? |
12:48
π
|
irl |
is there a place i can stash things until the copyright expires? |
12:50
π
|
joepie91 |
irl: IA :) |
12:50
π
|
joepie91 |
irl: IA will dark things if they get complaints |
12:51
π
|
joepie91 |
where 'dark' === "it's still in the archives but not publicly accessible" |
12:51
π
|
joepie91 |
(also you might want to talk to SketchCow regarding manuals) |
12:51
π
|
|
atomotic has joined #archiveteam |
13:03
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:04
π
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:04
π
|
|
BartoCH has joined #archiveteam |
13:21
π
|
irl |
joepie91: ah cool (: |
13:21
π
|
irl |
so i can basically automate most of this then using scanner->ftp->git-annex-assistant->ia |
13:21
π
|
irl |
just need to get the right metadata in the right places |
13:21
π
|
irl |
SketchCow: i might want to talk to you |
13:24
π
|
|
WinterFox has quit IRC (Read error: Operation timed out) |
13:27
π
|
|
beardicus has quit IRC (bye) |
13:27
π
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
13:28
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:31
π
|
|
beardicus has joined #archiveteam |
13:31
π
|
|
swebb sets mode: +o beardicus |
13:35
π
|
|
beardicus has quit IRC (Client Quit) |
13:37
π
|
|
beardicus has joined #archiveteam |
13:37
π
|
|
swebb sets mode: +o beardicus |
13:45
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:45
π
|
|
BartoCH has joined #archiveteam |
13:46
π
|
|
wp494 has quit IRC (Read error: Operation timed out) |
13:47
π
|
|
dashcloud has joined #archiveteam |
14:42
π
|
|
tomwsmf has joined #archiveteam |
14:47
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
14:52
π
|
|
irl_ has joined #archiveteam |
14:53
π
|
|
irl has quit IRC (Quit: WeeChat 1.5) |
14:53
π
|
|
irl_ is now known as irl |
14:56
π
|
|
irl has quit IRC (Client Quit) |
14:57
π
|
|
irl has joined #archiveteam |
15:00
π
|
irl |
SketchCow: if you're interested in old manuals, i can get you a list of the things we maybe have |
15:01
π
|
|
nicolas17 has joined #archiveteam |
15:01
π
|
irl |
SketchCow: i'm now idling here via znc, so i'll see when you respond as i guess you're not around right now |
15:02
π
|
irl |
i'll be at debian uk bbq eating burgers this weekend, but will probably start a go at this the following weekend |
15:02
π
|
irl |
(slow start - not diving in) |
15:15
π
|
|
wp494 has joined #archiveteam |
15:18
π
|
|
JesseW has joined #archiveteam |
15:23
π
|
|
schbird has quit IRC (Read error: Operation timed out) |
15:25
π
|
|
JesseW has quit IRC (Read error: Operation timed out) |
15:34
π
|
|
BartoCH has joined #archiveteam |
15:56
π
|
|
VADemon has joined #archiveteam |
16:13
π
|
SketchCow |
Hugs to irl |
16:35
π
|
irl |
hello |
16:35
π
|
irl |
SketchCow: |
16:35
π
|
irl |
still here? |
16:42
π
|
|
HCross2 has joined #archiveteam |
16:42
π
|
SketchCow |
Yep. |
16:43
π
|
SketchCow |
So much talking. Come to #archiveteam-bs |
17:00
π
|
|
tomaspark has quit IRC (Ping timeout: 255 seconds) |
17:02
π
|
|
db48x` is now known as db48x |
17:06
π
|
arkiver |
bayimg is online again |
17:06
π
|
arkiver |
I restarted the script |
17:06
π
|
arkiver |
it's not yet in the warrior |
17:06
π
|
arkiver |
http://tracker.archiveteam.org/bayimg/ |
17:06
π
|
arkiver |
* restarted the projects |
17:06
π
|
arkiver |
project* |
17:18
π
|
SketchCow |
OK SO FINALLY |
17:19
π
|
SketchCow |
http://fos.textfiles.com/pipeline.html is in version 1.0. It'll run once a day (with an indication of when it was run). It's NOT real-time, it's just a way for your nerds to notice what's going on on the site, and be able to communicate with me or each other on a status. |
17:20
π
|
SketchCow |
It's Inbox --> Outbox --> IA, and if there's interruptions at IA, the Outbox might fill and "work" but will leave some items untouched. |
17:20
π
|
arkiver |
some projects seem to be missing |
17:21
π
|
SketchCow |
It's generating right now, but Orkut is such a nightmare, it will sit there for a while. I added another black-label "line" at the bottom of the table so you can see the difference between "running" and done. Looks like 10-15 minutes of disk thrashing to get through the mess. |
17:21
π
|
arkiver |
I see |
17:21
π
|
SketchCow |
In the future, when it has the second black line at the bottom, if it's not there, it's not in the pipeline. |
17:22
π
|
SketchCow |
The script in the future will probably run in 5 minutes, as long as insanities like orkut aren't going on. |
17:23
π
|
SketchCow |
So for example, the WHOLE pipeline is backed up (google code is at 187g) because of Orkut |
17:23
π
|
arkiver |
Yep |
17:23
π
|
SketchCow |
But at least now, in the future, one of you can go "Hey, looks like boombox project is at 300gb for some reason" and we can jump on that. |
17:24
π
|
SketchCow |
Or "it's time to add an upload script to this or that project" |
17:24
π
|
arkiver |
orkut is going down in 8 days, so just a little more time |
17:26
π
|
DigDug |
i thought orkut was long gone |
17:26
π
|
arkiver |
still here as an archive https://orkut.google.com/en.html |
17:27
π
|
Kaz |
are we on track to finish orkut? I have more available if FOS can handle it, if needed. |
17:27
π
|
arkiver |
I think we're going to make it |
17:27
π
|
arkiver |
Tomorrow or the day after we're going to retry the larger communities, so you might have to do a little less concurrent |
17:28
π
|
Kaz |
nod |
17:28
π
|
arkiver |
But I'll want you before we do that |
17:28
π
|
arkiver |
the larger communities can be millions of posts |
17:28
π
|
arkiver |
(and URLs) |
17:29
π
|
SketchCow |
So, the script is going to finish running, and I'm going to make two improvements. |
17:29
π
|
SketchCow |
First, it will not copy over the finished .html file until it's 100% done, so in the future, it's just "there" and not "in progress" |
17:30
π
|
SketchCow |
Second, I'm going to make a "cheat sheat" which will occasionally be forgotten by me to update but will change the "Project" name into something better. |
17:30
π
|
nicolas17 |
I tried archiving orkut and it seemed like you didn't need more nodes |
17:30
π
|
nicolas17 |
since most of the time I got rate-limiting by the tracker anyway |
17:31
π
|
nicolas17 |
so the download rate was limited by that setting, not by how many people were running the warrior |
17:42
π
|
|
AlexLehm has joined #archiveteam |
17:54
π
|
SketchCow |
http://fos.textfiles.com/pipeline.html just finished. |
17:55
π
|
SketchCow |
NOW you can rain down questions |
18:00
π
|
|
schbird has joined #archiveteam |
18:25
π
|
|
pfallenop has quit IRC (Ping timeout: 260 seconds) |
18:25
π
|
|
pfallenop has joined #archiveteam |
18:30
π
|
|
schbird has quit IRC (Read error: Operation timed out) |
18:34
π
|
|
Zialus has quit IRC (Read error: Operation timed out) |
18:34
π
|
HCross2 |
arkiver: let me know when, and I'll reduce my quarter of a trillion concurrent |
18:38
π
|
|
Zialus has joined #archiveteam |
18:40
π
|
arkiver |
SketchCow: it would be nice if it also shows megaWARC size |
19:19
π
|
|
VerifiedJ has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
19:24
π
|
SketchCow |
Not really easy to do that, since stuff will be either out or stuck. |
19:24
π
|
SketchCow |
Oh wait. |
19:24
π
|
SketchCow |
Mmm, let me see |
19:28
π
|
SketchCow |
I got it working. It's re-running and it'll update with it after it's done. |
19:29
π
|
SketchCow |
(almost all are 40gb but it's trivial to print it) |
19:29
π
|
SketchCow |
if someone wants to be a hero and wiki all this, go ahead |
19:52
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
19:58
π
|
|
BartoCH has joined #archiveteam |
20:39
π
|
ats |
one that needs crawling for the magazines collection: http://www.muzines.co.uk |
20:39
π
|
ats |
sadly it has a stupid obnoxious Javascript-based interface... |
20:47
π
|
schbirid |
seems to work mostly fine without js here |
21:04
π
|
|
HCross2 has quit IRC (Quit: Connection closed for inactivity) |
21:13
π
|
|
Morbus has joined #archiveteam |
21:15
π
|
|
VerifiedJ has joined #archiveteam |
21:19
π
|
|
schbird has joined #archiveteam |
21:27
π
|
VerifiedJ |
GTAGaming.com's database was compromised and they may be think about shutting the website down along with www.gta4-mods.com. http://www.gtagaming.com/news/comments.php?i=2369 |
21:40
π
|
|
Honno has quit IRC (Read error: Operation timed out) |
21:50
π
|
|
VerifiedJ has left |
21:58
π
|
|
vOYtEC has joined #archiveteam |
21:59
π
|
|
schbird has quit IRC (Leaving) |
22:07
π
|
|
schbirid2 has joined #archiveteam |
22:10
π
|
|
schbirid has quit IRC (Read error: Operation timed out) |
22:33
π
|
|
RichardG has joined #archiveteam |
22:42
π
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
22:45
π
|
|
schbirid2 has joined #archiveteam |
22:47
π
|
|
AlexLehm has quit IRC (Ping timeout: 260 seconds) |
23:16
π
|
|
JW_work1 has joined #archiveteam |
23:18
π
|
|
JW_work has quit IRC (Read error: Operation timed out) |
23:23
π
|
|
RichardG has quit IRC (Read error: Operation timed out) |
23:28
π
|
SketchCow |
Who here can read an ext3 disk and is comfortable with possibly having to do a dd and then extracting of data |
23:28
π
|
SketchCow |
US preferred |
23:29
π
|
nicolas17 |
you mean a physical disk, or? |
23:36
π
|
SketchCow |
Physical, here in front of me. |
23:37
π
|
* |
nicolas17 is physically too far |
23:37
π
|
Frogging |
what's involved in it? i.e. why can't you do it? |
23:37
π
|
|
rchrch has joined #archiveteam |
23:37
π
|
SketchCow |
Don't want to |
23:37
π
|
Frogging |
ah |
23:37
π
|
SketchCow |
If you're asking what's involved, you're not for the job |
23:38
π
|
nicolas17 |
well, he's asking eg. is it a corrupted ext3 you have to recover things out of, or just a clean filesystem but you have no Linux? :P |
23:38
π
|
Frogging |
yeah, basically^ |
23:39
π
|
Frogging |
I can do magic with block devices but I'm not so good at fixing physically broken disks |
23:39
π
|
SketchCow |
Not broken |
23:40
π
|
Frogging |
ah. can you ship? |
23:43
π
|
Frogging |
I assume so because you said US preferred. I'm in Canada though. But if nobody closer wants to then I volunteer |
23:44
π
|
Frogging |
I enjoy this sort of thing |
23:45
π
|
|
kristian_ has joined #archiveteam |
23:48
π
|
|
RichardG has joined #archiveteam |
23:56
π
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |
23:57
π
|
SketchCow |
You're in line |
23:57
π
|
SketchCow |
We'll see if anyone else in the US wants it. |
23:58
π
|
SketchCow |
I can sustain a canadian mailing |