#archiveteam-bs 2015-03-29,Sun

↑back Search

Time Nickname Message
00:00 🔗 joepie91_ I certainly haven't ever seen it :P
00:00 🔗 joepie91_ it's enough of an edge case to not care about a misnaming in that case
00:00 🔗 serapeum has joined #archiveteam-bs
00:00 🔗 joepie91_ if it's accurate in 99% of the cases, that's better than confusing in 100% of the cases.. :)
00:01 🔗 dan_ fair enough, exactly :)
00:08 🔗 joepie91_ dan_: https://gist.github.com/joepie91/09aed84c45dc44967699
00:09 🔗 joepie91_ a lot more consistent than the RFC
00:09 🔗 joepie91_ :P
00:10 🔗 dan_ aha yep, RFC had to deal with implementation-specific things tacked on over years though, so I sorta forgive it~
00:13 🔗 joepie91_ dan_: heh, this is 14x, it has no excuse
00:13 🔗 joepie91_ :)
00:14 🔗 c_b2 has joined #archiveteam-bs
00:14 🔗 c_b has quit IRC (Ping timeout: 260 seconds)
00:16 🔗 c_b2 is now known as c_b
00:21 🔗 joepie91_ dan_: hm. is there an equivalent of HTTP 400/500 in IRC?
00:21 🔗 joepie91_ "some error that I don't have an error code for"
00:22 🔗 joepie91_ oh
00:22 🔗 joepie91_ 400
00:22 🔗 joepie91_ heh
00:24 🔗 mistym_ has quit IRC (Remote host closed the connection)
00:24 🔗 mistym has joined #archiveteam-bs
00:25 🔗 primus has quit IRC (Read error: Operation timed out)
00:36 🔗 dan_ https://www.alien.net.au/irc/irc2numerics.html
00:36 🔗 dan_ all those conflicts :)
00:37 🔗 joepie91_ yep
00:37 🔗 joepie91_ that has been my goto numeric guide for a long tiem
00:37 🔗 joepie91_ lol
00:37 🔗 dan_ haha, (Last updated: Tue, 11 Jan 2005 22:30:30 GMT)
00:38 🔗 dan_ gotta love irc
00:38 🔗 joepie91_ and it's still accurate! heh
00:49 🔗 BlueMaxim has quit IRC (Quit: Leaving)
00:51 🔗 BlueMaxim has joined #archiveteam-bs
00:52 🔗 primus104 has quit IRC (Leaving.)
01:27 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
01:39 🔗 schbirid2 has joined #archiveteam-bs
01:45 🔗 wp494 has quit IRC (Read error: No route to host)
01:48 🔗 BlueMaxim what do you guys think of BetaArchive
01:49 🔗 * kyan thinks they're jackasses, because they won't let other sites mirror their collection — a single point of failure for a valuable chunk of history, with a bureaucratic attitude
01:51 🔗 chfoo logchfoo: off
01:51 🔗 logchfoo has left
01:52 🔗 logchfoo starts logging #archiveteam-bs at Sun Mar 29 01:52:11 2015
01:52 🔗 logchfoo has joined #archiveteam-bs
01:52 🔗 chfoo (sorry to interrupt, i wanted to remove ops from the log bot)
01:59 🔗 joepie91_ lol, wow: https://en.wikipedia.org/wiki/FoundationDB
01:59 🔗 joepie91_ On March 25, 2015 it was reported that Apple has acquired the company.[6] A notice on the FoundationDB web site indicated that the company has "evolved" its mission and would no longer offer downloads of the software.[7]
01:59 🔗 joepie91_ "ha ha fuck you now you can't download our software anymore that you've built your infra on"
02:00 🔗 joepie91_ looks like Apple may soon be joining Yahoo in the list of douchebag-acquisition companies
02:03 🔗 Rotab lol
02:03 🔗 aaaaaaaaa looks like they evolved from "extend" to "extinguish"
02:04 🔗 garyrh Gotta acquire 'em all!
02:31 🔗 wp494 has joined #archiveteam-bs
02:57 🔗 vitzli has joined #archiveteam-bs
03:27 🔗 necenzura has joined #archiveteam-bs
03:53 🔗 necenzura has quit IRC (Quit: Page closed)
04:00 🔗 mistym has quit IRC (Remote host closed the connection)
04:04 🔗 dashcloud has quit IRC (Read error: Operation timed out)
04:11 🔗 dashcloud has joined #archiveteam-bs
04:12 🔗 aaaaaaaaa has quit IRC (Leaving)
04:24 🔗 mistym has joined #archiveteam-bs
04:30 🔗 vitzli has quit IRC (Quit: Leaving)
04:31 🔗 vitzli has joined #archiveteam-bs
04:59 🔗 Start_ has joined #archiveteam-bs
04:59 🔗 Start has quit IRC (Read error: Connection reset by peer)
05:06 🔗 brayden has joined #archiveteam-bs
05:11 🔗 c_b has quit IRC (Quit: c_b)
05:43 🔗 mistym has quit IRC (Remote host closed the connection)
05:49 🔗 godane https://www.youtube.com/watch?v=aOOE7KrrCpE
06:25 🔗 primus104 has joined #archiveteam-bs
07:13 🔗 edsu has joined #archiveteam-bs
07:20 🔗 john has joined #archiveteam-bs
07:21 🔗 john Does wpull not support --no-clobber, despite --help listing it?
07:24 🔗 yipdw it's implemented
07:25 🔗 yipdw if you're writing WARCs you won't need to worry about it
07:25 🔗 john Really? Because for me it downloads everything again.
07:25 🔗 john And when I append --no-clobber it prints the usage and exits.
07:25 🔗 john I built it from git master today.
07:26 🔗 yipdw use a stable version
07:26 🔗 yipdw master is generally good enough for use but I haven't been tracking it
07:27 🔗 john Okay.
07:27 🔗 john I thought it'd be one of those projects where git master is always the reccomended version.
07:27 🔗 yipdw what gave you that impression
07:28 🔗 yipdw chfoo is generally pretty good about releases
07:28 🔗 yipdw http://wpull.readthedocs.org/en/master/changelog.html
07:28 🔗 ersi Who doesn't like bleeding edge? It should cut you, else it ain't good
07:28 🔗 ersi and new
07:29 🔗 yipdw FWIW, we don't use no-clobber anywhere in archivebot
07:29 🔗 yipdw I don't know what options you're passing, but download twice is not the default behavior
07:29 🔗 john Still doesn't work.
07:30 🔗 yipdw the list of options we pass is as follows: https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot/seesaw/wpull.py#L22-L57
07:30 🔗 john http_proxy="127.0.0.1:4444" wpull http://echelon.i2p/ --warc-file echelon.i2p --page-requisites --recursive --level inf --warc-max-size 5000000000 --no-clobber
07:30 🔗 john That's what I'm trying.
07:30 🔗 yipdw clobber doesn't occur with WARC writing
07:31 🔗 yipdw so you don't need to specify it
07:31 🔗 yipdw the data goes right into the WARC
07:32 🔗 john All right.
07:32 🔗 john But it still downloads everything again.
07:32 🔗 yipdw are you seeing duplicate HTTP requests or files along with WARC records
07:33 🔗 john Yes.
07:33 🔗 yipdw what the hell does that mean
07:34 🔗 john It means, it requests files that are already in the warc archive.
07:34 🔗 yipdw if the request comes from a redirect, that'll happen
07:35 🔗 yipdw wpull operates on URLs, not files
07:35 🔗 yipdw at least when doing websitse
07:35 🔗 yipdw es
07:36 🔗 john It's not just that, it will again fetch the robots.txt and index file too.
07:36 🔗 yipdw post the logs
07:36 🔗 john All right.
07:37 🔗 john http://sprunge.us/ZebB
07:38 🔗 yipdw that log looks normal, there's no duplicate fetches in there
07:40 🔗 yipdw are you resuming a stopped grab?
07:40 🔗 yipdw if so you need to record the results to a database with --database
07:41 🔗 yipdw otherwise wpull will use an in-memory database that goes away once the process exits
07:41 🔗 john Oh…
07:41 🔗 john So that's what that's for. All right.
07:41 🔗 yipdw http://wpull.readthedocs.org/en/master/usage.html#stopping-resuming
08:27 🔗 john I must say, I'm very happy with the web archive's new design. ^_^
09:10 🔗 schbirid2 has quit IRC (Leaving)
09:33 🔗 schbirid has joined #archiveteam-bs
09:33 🔗 schbirid anyone know a twitter bot that one can simply feed any text corpus to for funny markov chain tweets? all i found so far are based on your own tweet archive
09:36 🔗 vitzli has quit IRC (Quit: Leaving)
09:36 🔗 schbirid nvm, cant find the corpus i wanted to use as text anyways :(
11:49 🔗 godane i'm uploading Computer Power User 2014 pdfs
12:04 🔗 godane btw i'm also uploading Archival Outlook
12:04 🔗 godane from Society of American Archivists
12:05 🔗 godane i'm only doing that cause there is no collection of it on IA
12:06 🔗 godane and 2014 pdf are being put on bluetoad.org
12:08 🔗 godane https://archive.org/details/Archival_Outlook-2004-07
12:27 🔗 primus104 has quit IRC (Leaving.)
12:39 🔗 Smiley john: it was originally (afaik) blood for the blood god
12:40 🔗 Smiley schbirid: you wanted to use the sweary one?
12:40 🔗 john All right.
12:42 🔗 schbirid Smiley?
12:42 🔗 Smiley sweary corpus
12:45 🔗 lysobit has quit IRC (Quit: quit)
12:46 🔗 schbirid nah
12:49 🔗 Smiley aww
12:55 🔗 lysobit has joined #archiveteam-bs
14:29 🔗 godane so i found something called flightglobal.com
14:29 🔗 godane its has tons of Flight pdf
14:30 🔗 primus104 has joined #archiveteam-bs
14:32 🔗 godane i may have to convert pdf pages into one pdf
14:32 🔗 godane cause they put every page as its own pdf
14:35 🔗 john Hmm… that's weird.
14:35 🔗 john I thought the .com file extension was usually used for flat binary files.
14:39 🔗 primus104 has quit IRC (Leaving.)
14:49 🔗 joepie91_ john: a file extension is nothing but bytes
14:49 🔗 joepie91_ it doesn't define a file
14:49 🔗 joepie91_ it's just the name
14:49 🔗 john I know.
14:50 🔗 john Usually the header gives you a good idea, but even that can be decieving.
14:50 🔗 joepie91_ so it's probably an archive of a site named flightglobal.com :P
14:55 🔗 john Oh…
15:19 🔗 underscor has quit IRC (Ping timeout: 370 seconds)
15:28 🔗 underscor has joined #archiveteam-bs
15:28 🔗 swebb sets mode: +o underscor
15:28 🔗 primus104 has joined #archiveteam-bs
15:36 🔗 godane so i finally figured out way kbs korea culture news stopped at the end of Jan 2003
15:42 🔗 godane it looks like they just had high bit rate wmv between june 2002 to jan 2003
15:42 🔗 godane btw i'm getting something called Classic Odyssey
15:45 🔗 johtso anyone know of any very lenient regexes for matching URLs?
15:45 🔗 johtso ie. not requiring the protocol
15:46 🔗 johtso maybe even using a valid tld list..
15:47 🔗 underscor has quit IRC (Ping timeout: 370 seconds)
15:47 🔗 brayden has quit IRC (Ping timeout: 606 seconds)
15:54 🔗 joepie91_ johtso: "valid TLD list" became infeasible since ICANN went overboard with gTLDs
15:55 🔗 joepie91_ technically speaking, 'hi' is a valid URL if you want to ignore the protocol
15:55 🔗 johtso joepie91_, https://www.publicsuffix.org/list/effective_tld_names.dat
15:56 🔗 johtso just grab that and compile it into your regex :)
15:57 🔗 joepie91_ yeah, no
15:57 🔗 joepie91_ there's a number of issues with that list and you probably don't want a regex that large
15:57 🔗 johtso joepie91_, by URL I really mean publicly accessible web address
15:57 🔗 joepie91_ not to mention that this is NOT a complete list
15:57 🔗 joepie91_ yes
15:57 🔗 joepie91_ johtso: try ctrl+Fing that list for .onion
15:57 🔗 joepie91_ publicly accessible, just on a different network
15:57 🔗 joepie91_ not on the list
15:57 🔗 johtso mm, okay
15:58 🔗 johtso well, .onion wouldn't really be something I'd be looking for anyway ;)
15:58 🔗 johtso really I'm trying to extract file locker urls, but for my first pass I want to make sure I don't miss anything
15:58 🔗 joepie91_ extract from
15:59 🔗 joepie91_ ?
15:59 🔗 johtso the html content of blogger posts/comments
15:59 🔗 johtso and can't rely on the links being in html markup
16:00 🔗 joepie91_ and why without the protocol?
16:00 🔗 johtso just guessing that there must be *some* links out there that are missing the protocol
16:01 🔗 johtso I'd rather not miss them
16:05 🔗 joepie91_ just grab anything [\x21-\X7E-]+\.[\x21-\X7E-]+\/[\x21-\X7E-]+
16:06 🔗 joepie91_ chars<dot>chars<slash>chars
16:06 🔗 Start_ is now known as Start
16:06 🔗 joepie91_ you'll get a bunch of false positives I'm sure
16:06 🔗 joepie91_ but that's one HEAD away
16:06 🔗 johtso sounds like a great idea, seeing as I'm not interested in bare urls
16:06 🔗 joepie91_ :P
16:06 🔗 Start has quit IRC (Disconnected.)
16:06 🔗 Start has joined #archiveteam-bs
16:06 🔗 Start has quit IRC (Remote host closed the connection)
16:06 🔗 Start has joined #archiveteam-bs
16:06 🔗 johtso one HEAD away? :)
16:07 🔗 Sanqui the problem is you might be too greedy, and grab a period at the end -> 404
16:07 🔗 Sanqui or you might NOT grab the period -> 404
16:07 🔗 Sanqui ideally, you'd get both variations, but I don't think you can do that with a regex
16:08 🔗 joepie91_ johtso: HEAD request
16:08 🔗 joepie91_ requests headers, not body
16:08 🔗 johtso ah right!
16:08 🔗 joepie91_ so you get the status code
16:08 🔗 joepie91_ if it's 200, it's probably valid
16:08 🔗 johtso yeah, see if they're alive
16:08 🔗 joepie91_ Sanqui: that's postprocessing :)
16:10 🔗 dashcloud anyone have a good macro recording program so I can record me clicking a button, pressing a different button for a screenshot, and then closing any windows opened by my first button press?
16:11 🔗 joepie91_ OS?
16:14 🔗 dashcloud either windows or linux
16:14 🔗 dashcloud under linux, I'd be running the program under wine
16:16 🔗 schbirid simplescreenrecorder maybe?
16:16 🔗 schbirid i just use ffmpeg x11grab if i need to record something
16:16 🔗 schbirid ffmpeg -f x11grab -s 1280x800 -r 30 -i :0.0 -qscale 0 /tmp/x11grab4.mpg
16:16 🔗 schbirid oh ignore me
16:16 🔗 schbirid haha
16:17 🔗 joepie91_ dashcloud: windows, autohotkey
16:17 🔗 joepie91_ linux, nfi. I never do GUI automation other than some wmctrl hacks to make XBMC play nice with multiple monitors
16:17 🔗 joepie91_ :p
16:19 🔗 johtso dashcloud, I haven't used it, but you might want to check out http://www.sikuli.org/
16:21 🔗 dashcloud thanks!
16:27 🔗 brayden has joined #archiveteam-bs
16:48 🔗 Start have we grabbed the videos from joystiq yet?
16:49 🔗 Start it now redirects to engadget
16:52 🔗 ersi I think godane did a loot of them
16:52 🔗 Start ok
16:55 🔗 underscor has joined #archiveteam-bs
16:55 🔗 swebb sets mode: +o underscor
17:03 🔗 godane i uploaded the tuaw videos to Jason's ftp
17:04 🔗 godane but joystiq videos i didn't grab all yet
17:05 🔗 Start oh
17:05 🔗 Start how much did you get?
17:07 🔗 godane i really don't remember how much i got
17:07 🔗 godane but i want to say 400 to 500 videos
17:07 🔗 godane also joystiq youtube channel still has all of the videos
17:08 🔗 Start that's a relief
17:09 🔗 joepie91_ Facebook is killing their XMPP API on April 30: https://developers.facebook.com/docs/chat
17:12 🔗 xmc oh really, nice.
17:20 🔗 mistym has joined #archiveteam-bs
17:23 🔗 aaaaaaaaa has joined #archiveteam-bs
17:37 🔗 schbirid has quit IRC (Leaving)
17:37 🔗 schbirid has joined #archiveteam-bs
18:39 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
18:42 🔗 dashcloud has joined #archiveteam-bs
18:44 🔗 xtr-201 has quit IRC (Read error: Connection reset by peer)
18:49 🔗 aaaaaaaaa has quit IRC (Read error: Operation timed out)
19:14 🔗 SketchCow https://www.youtube.com/watch?v=uPVQMZ4ikvM
19:17 🔗 schbirid ffs, git/github privacy leaking is ridiculous
19:18 🔗 schbirid if you have more than one account, you are bound to accidentally post with random ones every now and then
19:20 🔗 underscor has quit IRC (Ping timeout: 370 seconds)
19:24 🔗 underscor has joined #archiveteam-bs
19:24 🔗 swebb sets mode: +o underscor
19:30 🔗 BlueMaxim has quit IRC (Ping timeout: 512 seconds)
19:31 🔗 BlueMaxim has joined #archiveteam-bs
19:48 🔗 SN4T14__ has joined #archiveteam-bs
19:50 🔗 aaaaaaaaa has joined #archiveteam-bs
19:51 🔗 lytv has quit IRC (Read error: Operation timed out)
19:51 🔗 lytv has joined #archiveteam-bs
19:55 🔗 SN4T14_ has quit IRC (Ping timeout: 512 seconds)
20:18 🔗 schbirid has quit IRC (Leaving)
20:41 🔗 useretail SketchCow: have they received their nobel prizes?
20:42 🔗 SketchCow They should!
20:43 🔗 useretail http://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/9848/VIDEO-Introducing-a-Fire-Extinguisher-Fuelled-by-Sound.aspx
20:43 🔗 useretail In fact, the Defense Advanced Research Agency (DARPA) developed a system back in 2012 that utilized sound to put out flames.
20:44 🔗 useretail However, this marks the first time engineers have created an actual extinguisher using sound.
20:44 🔗 joepie91_ "Engineering seniors Viet Tran and Seth Robertson now hold a preliminary patent application for their potentially revolutionizing device. "
20:44 🔗 joepie91_ well, was nice while it lasted
20:45 🔗 SketchCow Dude, inventors patent shit
20:46 🔗 useretail yep, patents are killing innovation
20:46 🔗 SketchCow changes topic to: Archive Team: https://i.imgur.com/d9dPE6s.gif
20:47 🔗 dashcloud has quit IRC (Read error: Operation timed out)
20:50 🔗 SketchCow I don't agree, but the latest rustling in the fuck drawer came up entry
20:50 🔗 SketchCow empty
20:50 🔗 SketchCow Also, roughly $2000 went out the door yesterday into bills and debt and I am not happy
20:56 🔗 dashcloud has joined #archiveteam-bs
21:06 🔗 mistym has quit IRC (Remote host closed the connection)
21:23 🔗 mistym has joined #archiveteam-bs
22:16 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:19 🔗 dashcloud has joined #archiveteam-bs
23:24 🔗 mistym has quit IRC (Remote host closed the connection)
23:24 🔗 mistym has joined #archiveteam-bs
23:24 🔗 mistym has quit IRC (Remote host closed the connection)
23:37 🔗 mistym has joined #archiveteam-bs
23:38 🔗 primus104 has quit IRC (Leaving.)
23:49 🔗 johtso still getting 503 trying to upload to IA :(
23:50 🔗 johtso hopefully they'll sort it out tomorrow

irclogger-viewer