#archiveteam 2015-04-08,Wed

↑back Search

Time Nickname Message
00:06 🔗 pir^2 has joined #archiveteam
00:08 🔗 pir^2 Does anyone have a complete archive of eff.org?
00:11 🔗 primus104 has quit IRC (Leaving.)
00:11 🔗 wyatt8750 robots.txt been fixed yet?
00:12 🔗 mistym_ has joined #archiveteam
00:13 🔗 mistym_ has quit IRC (Client Quit)
00:13 🔗 pir^2 wyatt8750 - http://web.archive.org/web/20150407003912/http://wyatt8740.no-ip.org/git-glkterm-win32.tar.xz still gives me the robots error
00:13 🔗 wyatt8750 grr
00:14 🔗 wyatt8750 feex eeet! (I'll be patient, just checking in)
00:14 🔗 pir^2 wait, I just saw it work!
00:14 🔗 wyatt8750 holy crap
00:14 🔗 pir^2 wyatt:
00:15 🔗 pir^2 https://web.archive.org/web/*/http://wyatt8740.no-ip.org/git-glkterm-win32.tar.xz
00:18 🔗 pir^2 BTW Wayback (Heritrix) has a limit for file size IIRC. Maybe 5 MB? Can't remember.
00:24 🔗 pir^2 wyatt8740 - is that good now?
00:32 🔗 pir^2 For saving raw data like XML dumps (not HTML mostly), is using a tarball or WARC better?
00:33 🔗 aschmitz WARC if you want it to be browseable in the Wayback Machine or something. Tarball if it's just data you want to have archived, but isn't worthwhile to have in the Wayback Machine.
00:35 🔗 erbylnt has joined #archiveteam
00:36 🔗 pir^2 Is there any reason to have a bunch of zip files, or XML, JSON, etc. in the Wayback Machine?
00:37 🔗 aschmitz If you'd normally get to them by clicking links on a webpage, maybe. If not, not really.
00:39 🔗 mistym_ has joined #archiveteam
00:40 🔗 yipdw_ pir^2: WARC for HTTP only makes sense if you're recording request-response
00:41 🔗 yipdw_ if you're not, faking it is not a good idea
00:43 🔗 pir^2 has quit IRC (Ping timeout: 370 seconds)
00:44 🔗 pir^2 has joined #archiveteam
00:45 🔗 pir^2 I wasn't going to fake it... I was thinking of what settings to use for wget (or maybe wpull)
00:46 🔗 yipdw_ --warc-file is usually enough
00:46 🔗 yipdw_ oh wait, the other settings
00:46 🔗 yipdw_ ?
00:47 🔗 pir^2 well, whether to use --warc-file or not
00:49 🔗 pir^2 I previously used it for https://archive.org/details/dmoz-rdf-20150327 got a 27GB file and it's still not "approved" in the Archive Team collection, aand for the data I am going to save (not dmoz but similar "data dumps") preserving the HTTP requests is probably not all that important
00:50 🔗 yipdw_ if you have a bunch of URLs you can throw them into a file and !ao < FILE them to archivebot
00:50 🔗 aschmitz For me, it depends on how I expect people will find the data. If they'd run into it in the Wayback Machine, then sure, I'll do WARCs. If they're more likely to find it by either a new link to it or browsing the IA website, then I wouldn't bother with WARCs.
00:51 🔗 yipdw_ that solves at least two open questions
00:52 🔗 aschmitz Who can help debug a pipeline script?
00:58 🔗 Wolfie has joined #archiveteam
00:58 🔗 Wolfie has left
01:05 🔗 dugo has joined #archiveteam
01:07 🔗 dugo_ has quit IRC (Read error: Operation timed out)
01:13 🔗 pir^2 Does IA have a secret archive of YouTube videos?
01:14 🔗 pir^2 https://archive.org/details/youtubecrawl
01:18 🔗 yipdw_ is now known as yipdw
01:18 🔗 aschmitz Hardly seems secret. :)
01:18 🔗 aschmitz But also quite dated.
01:19 🔗 pir^2 Secret as in darkened (or whatever they call it) - " These files are currently not publicly accessible."
01:20 🔗 aschmitz Ah, hadn't tried to download any of them.
01:20 🔗 pir^2 It's not clear what is actually in the collection. Is the "crawl" just metadata or actual, full, videos?
01:20 🔗 * aschmitz shuffles over to -bs
01:38 🔗 wyatt8750 I looked into it and still wish I could figure out how this youtube archive worked
01:38 🔗 wyatt8750 and I wish it were public :\
01:38 🔗 wyatt8750 I would like to relive the better days of youtube
01:39 🔗 wyatt8750 ...before the dark times. before the EMPIRE
01:39 🔗 * wyatt8750 goes to -bs
01:54 🔗 erbylnt has quit IRC (Read error: Connection reset by peer)
02:12 🔗 pir^2 has quit IRC (-)
02:31 🔗 beardicus has quit IRC (My MacBook Pro has gone to sleep. ZZZzzz…)
02:47 🔗 mistym_ has quit IRC (Remote host closed the connection)
02:48 🔗 donkus has quit IRC (Read error: Connection reset by peer)
02:48 🔗 donkus has joined #archiveteam
03:12 🔗 mistym_ has joined #archiveteam
03:37 🔗 chfoo has quit IRC (Remote host closed the connection)
03:40 🔗 chfoo has joined #archiveteam
04:03 🔗 Ravenloft has joined #archiveteam
04:19 🔗 aaaaaaaaa has quit IRC (Leaving)
05:11 🔗 philpem has joined #archiveteam
05:17 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
05:18 🔗 Control-S has joined #archiveteam
05:19 🔗 BlueMaxim has joined #archiveteam
05:24 🔗 Ctrl-S has quit IRC (Read error: Operation timed out)
05:24 🔗 Control-S is now known as Ctrl-S
05:39 🔗 SN4T14_ has quit IRC (Ping timeout: 306 seconds)
05:45 🔗 Stiletto has joined #archiveteam
05:45 🔗 SN4T14 has joined #archiveteam
06:17 🔗 donkus has quit IRC (Ping timeout: 512 seconds)
06:47 🔗 a3nm has joined #archiveteam
06:47 🔗 a3nm WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
06:48 🔗 BlueMaxim the secret word be "yahoosucks" m'lord
06:48 🔗 a3nm thanks
06:50 🔗 a3nm has quit IRC (Client Quit)
06:50 🔗 a3_nm has joined #archiveteam
06:50 🔗 a3_nm is now known as a3nm
06:51 🔗 mistym_ has quit IRC (Remote host closed the connection)
06:52 🔗 SketchCow make a difference!
07:13 🔗 primus104 has joined #archiveteam
07:23 🔗 signius has quit IRC (Read error: Operation timed out)
07:36 🔗 signius has joined #archiveteam
07:50 🔗 primus104 has quit IRC (Leaving.)
07:53 🔗 schbirid has joined #archiveteam
08:06 🔗 Ymgve has joined #archiveteam
09:55 🔗 primus104 has joined #archiveteam
10:25 🔗 MMovie2 has joined #archiveteam
10:27 🔗 MMovie has quit IRC (Ping timeout: 306 seconds)
10:41 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:06 🔗 anomie has quit IRC (Read error: Operation timed out)
11:20 🔗 anomie has joined #archiveteam
11:29 🔗 Froggypwn has quit IRC (Read error: Connection reset by peer)
11:30 🔗 Froggypwn has joined #archiveteam
11:32 🔗 wyatt8750 didn't you know? yahoo uses bing results now
12:15 🔗 Rotab yes
12:18 🔗 ersi now? more like since years ago
12:19 🔗 Rotab yes
12:23 🔗 midas since yahoo did rm -rf on /search/, expected from yahoo
12:25 🔗 ersi yada yada yada
12:41 🔗 atomotic has joined #archiveteam
12:44 🔗 beardicus has joined #archiveteam
12:50 🔗 sankin has joined #archiveteam
13:01 🔗 erbylnt has joined #archiveteam
13:26 🔗 Ravenloft has quit IRC (Ping timeout: 265 seconds)
13:41 🔗 primus104 has quit IRC (Leaving.)
13:55 🔗 K4k has joined #archiveteam
14:54 🔗 mistym_ has joined #archiveteam
14:54 🔗 mistym_ has quit IRC (Remote host closed the connection)
15:05 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
15:07 🔗 mistym_ has joined #archiveteam
15:07 🔗 Froggypwn has quit IRC (Read error: Connection reset by peer)
15:10 🔗 Froggypwn has joined #archiveteam
15:11 🔗 johtso Any idea why internet archive doesn't do TIFF images?
15:21 🔗 Nemo_bis johtso: in what sense?
15:21 🔗 johtso in the deriving other image formats sense, don't see it listed in the derivation chart
15:22 🔗 Nemo_bis johtso: IA derives happily from TIFF
15:22 🔗 johtso I only see it listed in the text items section though
15:22 🔗 johtso will it also do the deriving if it's just an image item?
15:23 🔗 Nemo_bis I doubt it cares
15:23 🔗 Nemo_bis Derivers just let imagemagick whatever is needed and imagemagick swallows TIFFs happily
15:23 🔗 Nemo_bis At worst you'll have to manually select the format
15:24 🔗 Nemo_bis Example TIFF book https://archive.org/details/VocabolarioDellaLinguaItaliana2
15:30 🔗 SketchCow Certainly derives from TIFF.
15:45 🔗 primus104 has joined #archiveteam
15:46 🔗 erbylnt has quit IRC (Ping timeout: 370 seconds)
15:48 🔗 primus105 has joined #archiveteam
15:50 🔗 mr-b has joined #archiveteam
15:51 🔗 donkus has joined #archiveteam
15:53 🔗 primus104 has quit IRC (Read error: Operation timed out)
16:03 🔗 mistym_ has quit IRC (Remote host closed the connection)
16:23 🔗 Start-mob has joined #archiveteam
16:35 🔗 erbylnt has joined #archiveteam
16:40 🔗 Start has quit IRC (Disconnected.)
16:55 🔗 Start-mob has quit IRC (Remote host closed the connection)
16:55 🔗 aaaaaaaaa has joined #archiveteam
16:55 🔗 Start-mob has joined #archiveteam
16:58 🔗 Froggypwn has quit IRC (Read error: Connection reset by peer)
16:59 🔗 Froggypwn has joined #archiveteam
17:02 🔗 Start-mob has quit IRC (Ping timeout: 370 seconds)
17:28 🔗 scyther has joined #archiveteam
17:49 🔗 nertzy has quit IRC (Remote host closed the connection)
17:50 🔗 habi has joined #archiveteam
17:56 🔗 SimpBrain http://archive.ex.fm gone dark?
17:59 🔗 habi has left
18:07 🔗 nertzy has joined #archiveteam
18:21 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
18:32 🔗 nertzy has joined #archiveteam
18:44 🔗 Start has joined #archiveteam
18:45 🔗 Start has quit IRC (Read error: Connection reset by peer)
18:45 🔗 Start has joined #archiveteam
18:55 🔗 Start has quit IRC (Disconnected.)
19:03 🔗 mistym has quit IRC (Quit: Leaving)
19:07 🔗 Start has joined #archiveteam
19:17 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
19:20 🔗 mistym has joined #archiveteam
19:25 🔗 Start has quit IRC (Disconnected.)
19:26 🔗 FMecha has joined #archiveteam
19:26 🔗 FMecha just some notice about 4chan's stuff
19:27 🔗 FMecha first, /sp/ is now archived again by totally.not4plebs.org (spin-off of 4plebs)
19:27 🔗 FMecha second, archive.moe is no longer archiving /g/
19:28 🔗 FMecha they are archiving /fit/ from heinessen
19:29 🔗 monod has joined #archiveteam
19:38 🔗 cadbury_ has joined #archiveteam
19:43 🔗 FMecha has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
19:45 🔗 Selanda has quit IRC (Ping timeout: 255 seconds)
19:46 🔗 Start-mob has joined #archiveteam
19:49 🔗 SN4T14_ has joined #archiveteam
19:51 🔗 mistym has quit IRC (Remote host closed the connection)
19:53 🔗 SN4T14 has quit IRC (Ping timeout: 306 seconds)
19:59 🔗 Start-mob has quit IRC (Remote host closed the connection)
20:01 🔗 atomotic has joined #archiveteam
20:09 🔗 Start-mob has joined #archiveteam
20:12 🔗 mistym has joined #archiveteam
20:18 🔗 dashcloud has quit IRC (Ping timeout: 260 seconds)
20:18 🔗 Start-mob has quit IRC (Leaving)
20:18 🔗 Start-mob has joined #archiveteam
20:18 🔗 SimpBrain has quit IRC (Quit: Leaving)
20:20 🔗 dashcloud has joined #archiveteam
20:24 🔗 K4k has quit IRC (Quit: WeeChat 1.0.1)
20:26 🔗 nertzy has joined #archiveteam
20:32 🔗 Start has joined #archiveteam
20:33 🔗 Start-mob has quit IRC (Leaving)
20:34 🔗 Start-mob has joined #archiveteam
20:36 🔗 Start-mob has quit IRC (Remote host closed the connection)
20:44 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
20:54 🔗 atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
20:54 🔗 sankin has quit IRC (Leaving.)
20:56 🔗 atomotic has joined #archiveteam
20:59 🔗 scyther has quit IRC (Read error: Connection reset by peer)
20:59 🔗 Selanda has joined #archiveteam
21:06 🔗 acridAxid has quit IRC (Quit: Quitting)
21:09 🔗 acridAxid has joined #archiveteam
21:09 🔗 BlueMaxim has joined #archiveteam
21:12 🔗 dashcloud has quit IRC (Ping timeout: 260 seconds)
21:13 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:14 🔗 atomotic has joined #archiveteam
21:16 🔗 dashcloud has joined #archiveteam
21:21 🔗 Start has quit IRC (Read error: Connection reset by peer)
21:21 🔗 Start has joined #archiveteam
21:23 🔗 Start has quit IRC (Read error: Connection reset by peer)
21:23 🔗 Start has joined #archiveteam
21:26 🔗 Start-mob has joined #archiveteam
21:38 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:42 🔗 Start-mob has quit IRC (Ping timeout: 370 seconds)
21:46 🔗 Start_ has joined #archiveteam
21:46 🔗 Start has quit IRC (Read error: Connection reset by peer)
21:51 🔗 Start_ is now known as Start
21:53 🔗 nertzy has joined #archiveteam
21:53 🔗 Selanda has quit IRC (Read error: Operation timed out)
22:07 🔗 Selanda has joined #archiveteam
22:08 🔗 monod has quit IRC (Ping timeout: 512 seconds)
22:11 🔗 T31m_ has quit IRC (Read error: Connection reset by peer)
22:12 🔗 T31M has joined #archiveteam
22:20 🔗 Selanda has quit IRC (Quit: leaving)
22:23 🔗 Start has quit IRC (Disconnected.)
22:27 🔗 Start-mob has joined #archiveteam
22:27 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
22:33 🔗 Selanda has joined #archiveteam
22:34 🔗 Start-mob has quit IRC (Remote host closed the connection)
22:38 🔗 chfoo ===== When running the warrior, please don't use proxies or Tor with it. It may cause junk (such as Tor exit is blocked pages) to be archived instead of actual content. ======
22:41 🔗 nertzy has joined #archiveteam
22:55 🔗 Start has joined #archiveteam
23:07 🔗 philpem has quit IRC (Ping timeout: 260 seconds)
23:09 🔗 Start-mob has joined #archiveteam
23:12 🔗 wp494_ has joined #archiveteam
23:15 🔗 wp494 has quit IRC (Ping timeout: 740 seconds)
23:17 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
23:18 🔗 Start arkiver: i've got ops again in #helpus
23:24 🔗 Ymgve has quit IRC ()
23:35 🔗 nertzy has joined #archiveteam
23:58 🔗 primus has quit IRC (Read error: Operation timed out)

irclogger-viewer