#archiveteam-bs 2014-10-25,Sat

↑back Search

Time Nickname Message
01:29 🔗 Jonimus USB over IP has been available for a while, perhaps not in kernel but its been a thing you could do.
03:07 🔗 tfgbd Does warcextract.py actually extract files or does it do something else that is considered "extract"?
03:39 🔗 dashcloud tfgbd: I can't tell you how good this warcviewer is, but it does run on Windows, and I did use it before on a small warc: https://github.com/odie5533/WarcQtViewer
03:56 🔗 tfgbd Oh, thanks.
03:56 🔗 tfgbd will check it out
03:56 🔗 tfgbd I'm working on trying the proxy in ubuntu right now
03:57 🔗 tfgbd This may be even better, though
03:57 🔗 tfgbd will have to see how it handles 10GB warcs
03:57 🔗 tfgbd Can you put this on the Wiki?
03:57 🔗 tfgbd I'd have never found this thing on my own
04:11 🔗 tfgbd this darn proxy won't work on linux either
04:12 🔗 tfgbd I get further along but it still doesn't seem to work
04:12 🔗 tfgbd When i go to http://warc I just get
04:12 🔗 tfgbd Please make sure that the WARC server is installed and active.
04:12 🔗 tfgbd WARC server unreachable!
04:12 🔗 tfgbd You need to start the special WARC server before you can browse the past.
04:12 🔗 tfgbd lol, obviously the server is started or I wouldn't see a message saying the server can't be reached
04:13 🔗 tfgbd it is only supposed to be run form localhost?
04:14 🔗 tfgbd from*
04:15 🔗 tfgbd yup, that was it. seems to work on the local machine
04:16 🔗 tfgbd nm, that doesn't work right either
04:16 🔗 tfgbd or it doesn't like links, at least
04:20 🔗 tfgbd Okay, just tried WarcQtViewer.
04:20 🔗 tfgbd It works but it's a bit limited
04:20 🔗 tfgbd it seems to only let you extract one file at a time
04:20 🔗 tfgbd I need to get thousands
04:49 🔗 yipdw tfgbd: if you're seeing that message, you've probably got a cached page
04:49 🔗 yipdw also, what WARC is this
04:50 🔗 yipdw I think you are overcomplicating this
05:12 🔗 tfgbd Well, I'm using linux now
05:12 🔗 tfgbd Still having some issues with some of this stuff
05:12 🔗 tfgbd I'm trying warctozip.py on a network share atm
05:12 🔗 tfgbd lets see how far it goes
05:16 🔗 tfgbd this is the warc: https://archive.org/download/archiveteam_archivebot_go_150/wdl1.winworldpc.com-inf-20140913-144638-8m8pg-00001.warc.gz
05:23 🔗 yipdw tfgbd: ok, so that 00001 indicates a sequence number; ArchiveBot generates 10 GB WARCs
05:24 🔗 tfgbd So they ARE multi-part?
05:24 🔗 yipdw hold the fuck on
05:24 🔗 yipdw the reason for that is that it ensures that you have to repeat an upload of at most 10 GB on failure
05:24 🔗 yipdw they are not "multipart" in the sense that a single part is an incomplete archive
05:24 🔗 tfgbd So some of thest 500GB of archives may be repeats?
05:24 🔗 yipdw no, they're not repeats
05:24 🔗 tfgbd these*
05:25 🔗 tfgbd But I don't think the site is even 500GB
05:25 🔗 tfgbd why are there 500GB of warcs?
05:25 🔗 tfgbd Different crawl dates?
05:25 🔗 yipdw it probably contains extra stuff, like offsite page requisites
05:25 🔗 yipdw not different crawl dates; ArchiveBot goes continuously at a site until it finishes the queue or the job is aborted
05:25 🔗 BlueMaxim as someone who can barely upload a gigabyte in the space of six or seven hours
05:25 🔗 BlueMaxim 10GB a part seems a bit high
05:25 🔗 tfgbd what happens if a site blocks it?
05:26 🔗 yipdw then we abort the job when we notice and call it whatever
05:26 🔗 yipdw BlueMaxim: file an issue, it works for us
05:26 🔗 tfgbd Or, I dunno. Do you not mind using up like 500GB of bandwidth for this?
05:26 🔗 yipdw I don't care, I burn through 160 TB or so a month running ArchiveBot nodes
05:26 🔗 tfgbd I mean for the site owners
05:27 🔗 yipdw most people don't care
05:27 🔗 yipdw some do
05:27 🔗 tfgbd The site owner is right in #winboards and he was pissed at me for leaching even 20GB
05:27 🔗 yipdw [00:27:09] <GipsyDngr> yipdw: Job status: 19186 completed, 532 aborted, 137 failed, 48 in progress, 0 pending
05:27 🔗 yipdw [00:27:09] <yipdw> !status
05:27 🔗 yipdw most don't
05:27 🔗 tfgbd Do you do stuff to make it look like the requests come from different people and/or do the download slowly?
05:27 🔗 yipdw (1) no (2) depends on the job
05:27 🔗 tfgbd Or do you hit them as hard as possible with as many connections as it will let you? ;P
05:27 🔗 yipdw depends on the job
05:28 🔗 yipdw archivebot has configurable concurrency and delay
05:28 🔗 tfgbd so, how did you do winworldpc?
05:28 🔗 yipdw archivebot
05:28 🔗 yipdw hence the collection
05:28 🔗 tfgbd but I mean what were the settings
05:28 🔗 yipdw I don't recall the parameters because (1) they aren't stored in the job and (2) the settings may vary over time
05:28 🔗 tfgbd i see
05:28 🔗 chfoo wpull saves the parameters are saved in the warcinfo record
05:29 🔗 yipdw even when they change over time?
05:29 🔗 yipdw I don't recall seeing that in the warcinfo record, but I might just have missed it
05:29 🔗 chfoo no, just the command line args
05:29 🔗 tfgbd well, considering it is mosly a http file server with a bunch of zipped OS installs, what would you usually do for that sort of thing
05:29 🔗 yipdw I usually just throw it in archivebot
05:30 🔗 yipdw but you can also just crawl the site with wget or whatever
05:30 🔗 tfgbd welp, warctozip.py failed again even under linux
05:30 🔗 tfgbd Do I really need more than 300mb?
05:30 🔗 tfgbd ram*
05:31 🔗 chfoo does warcat not do what you want it to?
05:33 🔗 yipdw warcat works fine on that WARC
05:34 🔗 tfgbd Will warcat let me extract all the files into directories/subdirectories?
05:34 🔗 tfgbd I didn't get there yet
05:34 🔗 yipdw https://gist.github.com/anonymous/f8f92c6858259cb8c73c
05:34 🔗 yipdw yes
05:34 🔗 yipdw archivebot3@starnode:~/warctozip/extraction$ python3 -mwarcat extract ../wdl1.winworldpc.com-inf-20140913-144638-8m8pg-00001.warc.gz
05:34 🔗 yipdw for completion the command line
05:35 🔗 yipdw also that's still running so yes that gist is incomplete
05:35 🔗 yipdw 300 MB of RAM is probably pushing the low side
05:36 🔗 tfgbd how about 2GB?
05:36 🔗 tfgbd I tried with that and it still failed
05:36 🔗 yipdw warcat or warctozip
05:36 🔗 tfgbd warctozip
05:36 🔗 yipdw then try 2 GB with warcat and see what happens
05:36 🔗 tfgbd Will it work on NT?
05:37 🔗 yipdw I don't know
05:37 🔗 tfgbd Will try
05:37 🔗 tfgbd warctozip worked but it still failed about 100MB into the convert
05:37 🔗 tfgbd died even sooner on the linux install with 300MB of ram
05:37 🔗 yipdw ps reports memory rss spikes around 300-500 MB using warcat but it drops soon afterwards
05:38 🔗 tfgbd I'll give er a go
05:38 🔗 tfgbd In the end, after all the frustration, I at least get to learn a bit about python, I guess.. ;P
05:38 🔗 yipdw working fine here
05:38 🔗 yipdw the frustration is self-induced
05:39 🔗 tfgbd Well, it's not really frustration. I consider it fun.
05:39 🔗 tfgbd But i tried at least 4 tools by now
05:39 🔗 yipdw I'm not sure why you insisted on running warctozip in a configuration that few people here have ever used, demanded assistance with said configuration, and kept on running warctozip when warcat was recommended a while back
05:39 🔗 yipdw but if that's fun I guess whatever floats your boat
05:41 🔗 chfoo i would suggest in the future to look at the change dates of the software. generally i avoid software that hasn't been updated in years
05:42 🔗 tfgbd I saw no mention of warcat
05:42 🔗 tfgbd This is the first I've heard it
05:42 🔗 chfoo maybe a netsplit or a connection timeout
05:43 🔗 chfoo but the full list is here: http://archiveteam.org/index.php?title=WARC
05:43 🔗 tfgbd I can't even find this warcat with google
05:43 🔗 yipdw https://pypi.python.org/pypi/Warcat/
05:44 🔗 yipdw google customizes their search indexes according to factors unknown, but that's the first search hit I get for "warcat"
05:45 🔗 tfgbd I searched for mwarcat
05:45 🔗 yipdw why
05:45 🔗 tfgbd because: archivebot3@starnode:~/warctozip/extraction$ python3 -mwarcat extract
05:45 🔗 yipdw the package is called warcat, -m specifies a module
05:45 🔗 tfgbd without a space?
05:45 🔗 yipdw we don't call it mwarcat
05:46 🔗 yipdw -mwarcat and -m warcat both work fine
05:46 🔗 chfoo -m is an option
06:07 🔗 DFJustin you could also just try downloading the files through the wayback machine
06:07 🔗 DFJustin that's the primary way these archives are intended to be used
06:12 🔗 tfgbd There is nothing there
06:13 🔗 tfgbd i understood that -m was an option but not using a space confused me
06:13 🔗 tfgbd I find it strange nobody here has tested this on or uses Windows
06:14 🔗 tfgbd while 90% of the rest of the world does
06:14 🔗 DFJustin oh I see it blocks wayback
06:14 🔗 DFJustin well a) this is a self-selecting group of huge nerds and b) nobody is really testing anything a whole lot anywhere because nobody is being paid
06:15 🔗 tfgbd I see
06:15 🔗 tfgbd but this stuff isn't even yours
06:15 🔗 tfgbd It's random 3rd party stuff you made a list of, I guess
06:16 🔗 DFJustin I personally use windows but I mainly download things rather than actually using them
06:17 🔗 yipdw ?
06:17 🔗 yipdw warcat was developed by chfoo
06:17 🔗 yipdw warctozip was built by alard
06:17 🔗 yipdw none of this is random third-party stuff; the reason why it's mentioned is because people who hang out here made it
06:18 🔗 yipdw where did you get the opposite idea
06:32 🔗 tfgbd I suppose because the wiki has stuff by archive.org and other people too
06:35 🔗 tfgbd Do you "support" BSDs?
06:35 🔗 yipdw depends on the maintainer
06:39 🔗 tfgbd I wasn't really expecting "support" but just a few suggestions. The wiki page isn't that helpful and I was kind of expecting the tools I was using to just work
06:39 🔗 tfgbd I mean how should I know the warctozip web site didn't support large files?
06:41 🔗 yipdw you aren't supposed to know
06:42 🔗 DFJustin it's bad that it doesn't work but unless one of the small number of people who can fix it is paying attention there's no point in saying so more than once
06:43 🔗 yipdw I'm more put off by your assumptions; I read them as entitlement
06:43 🔗 DFJustin file a bug on github and move on
06:43 🔗 yipdw if we had paid staff maintaining this stuff ok fine
06:43 🔗 yipdw that's just me though, there are much calmer people here
06:44 🔗 tfgbd yeah, you seem a bit harsh
06:47 🔗 DFJustin I basically agree with you that the stuff is not nearly as working or user friendly as it needs to be but I can't do much about it so it's annoying to keep reading it
07:01 🔗 amerrykan i gather it's still early days for a lot of this stuff, it's largely built for nerds who use linux, don't mind the command-line and can tolerate breakage
07:01 🔗 amerrykan perhaps you can help drive those changes you want to see, others may follow your lead
08:45 🔗 Coderjoe I need a dectalk emulator FF addon...
11:53 🔗 joepie91 [08:13] <tfgbd> I find it strange nobody here has tested this on or uses Windows
11:53 🔗 joepie91 [08:14] <tfgbd> while 90% of the rest of the world does
11:53 🔗 joepie91 Windows is not a nice platform for scripting or archiving
11:54 🔗 joepie91 you'll find that lots of software, unless it *explicitly* targets non-techy end-users, won't run on Windows or won't be tested on Windows
11:54 🔗 joepie91 because the regular user group are advanced users and they've chosen to use Linux, OS X or some other Unix-like so that stuff won't break
11:55 🔗 joepie91 sure, 99%(?) of people playing AAA games will do so on Windows
11:55 🔗 tfgbd i get ya
11:55 🔗 tfgbd Windows is annoying with copying folder date stamps
11:55 🔗 joepie91 but that doesn't hold true for stuff like data processing (which is more or less what archiving is from a technical POV)
11:55 🔗 joepie91 I personally don't even bother testing *anything* I build in Windows or IE (I primarily do webdev)
11:56 🔗 joepie91 if it works, great - if it doesn't, so be it
11:56 🔗 joepie91 I only have an XP VM for reverse-engineering weird USB protocols :P
11:56 🔗 joepie91 tfgbd: not just that, it's annoying in lots of ways
11:57 🔗 joepie91 the perfect testcase is copying over some 3k files of 5kb
11:57 🔗 joepie91 on NTFS, it'll take you a few minutes
11:57 🔗 joepie91 on ext3/ext4, under a second
11:58 🔗 joepie91 if you work with lots of small files, that slowness gets very annoying very fast :)
11:58 🔗 joepie91 anyway, I was about to leave for the supermarket
12:02 🔗 ersi tfgbd: Just so you know, contributions are very welcome. ArchiveTeam is a small pack of individuals doing things. There's no backing up from anyone really. There's no money, no paid time etc.
12:17 🔗 schbirid uhm, anyone else being hammered with fake bt traffic?
12:18 🔗 schbirid brb
12:18 🔗 tfgbd Amazing you did all this then
12:22 🔗 schbirid that was unpleasant :o
12:24 🔗 schbirid was not AT related though
12:27 🔗 Kazzy schbirid: bittorrent, or bt the ISP?
12:27 🔗 schbirid torrent
12:27 🔗 schbirid someone ddosed my torrent port
12:28 🔗 schbirid well, tried to. or transmission did something bad before (which would not surprise me one bit)
12:28 🔗 schbirid got ~30 kilobytes/s of empty announces :)
12:28 🔗 schbirid no announces, i mean whatever things peers say to each other
12:29 🔗 Kazzy blame it on transmssion, useless client
13:10 🔗 godane so i'm grabbing more Japanology episodes
13:11 🔗 godane anyways i upload feb 01 to 10 2008 videos for funny or die collection
13:22 🔗 dashcloud joepie91: you're familiar with USB reverse engineering? if so, msg me- got a question for you
13:34 🔗 schbirid heh, 55k different IPs
13:34 🔗 schbirid that is not transmisssions fault
13:34 🔗 schbirid either what, torrentbytes or x264 is attacked
14:00 🔗 joepie91 dashcloud: not really, I just dicked around with it
14:01 🔗 joepie91 tfgbd: archiveteam infrastructure is really rather impressive and polished given the kind of operation it is
14:01 🔗 joepie91 it's not perfect, but I've gone "huh, neat" on a number of occasions
14:01 🔗 joepie91 :p
14:02 🔗 dashcloud thanks
14:02 🔗 joepie91 dashcloud: for context, http://cryto.net/~joepie91/areson
14:02 🔗 joepie91 haven't yet gotten further than that
15:04 🔗 godane SketchCow: so looks you guys at least got over half of the images from twitpic
15:05 🔗 godane 500 million number is what is on the globalnews.ca article: http://globalnews.ca/news/1633807/800-million-twitpic-photos-to-vanish-from-the-web-saturday/
15:06 🔗 antomatic Got 500m directly from Cloudfront, but it's literally 500 million unindexed pictures which, since Noah tweaked Twitpic to frustrate the efforts, will be hard to tie back to the actual twitpic URLs they originally lived at
15:06 🔗 antomatic [as I understand it]
15:07 🔗 antomatic It would have been easy to do - but he seemed to deliberately change the site to make it hard
15:17 🔗 ersi Since he's a douchebag
15:21 🔗 godane https://web.archive.org/web/20141025151927/http://twitpic.com/19jjry
15:21 🔗 godane we can still archive it
15:24 🔗 godane do anyone know of the blocks that you guys don't have?
15:59 🔗 midas http://twitpic.com/eabw9h funny.. his pictures still work
16:04 🔗 antomatic something has changed, midas, by the look of it
16:04 🔗 antomatic #quitpic
16:05 🔗 midas moving over
16:17 🔗 joepie91 [17:31] <antomatic> All those weeny commenters on HN, all "they can't afford the transfer costs" and stuff.
16:17 🔗 joepie91 link?
16:18 🔗 antomatic two ticks
16:18 🔗 joepie91 k
16:18 🔗 antomatic https://news.ycombinator.com/item?id=8472047
16:18 🔗 joepie91 process.nextTick(function(){ process.nextTick(function(){ getUrl(); });});
16:18 🔗 joepie91 :)
16:20 🔗 joepie91 "For those downvoting this - why don't you try paying for 800M images to be leached from the server of a failed company of yours with no hope of recouping a single dollar."
16:20 🔗 antomatic :)
16:20 🔗 joepie91 um, yeah, you mean like a number of individuals have been attempting to do for the past weeks?
16:21 🔗 joepie91 >.>
16:24 🔗 SketchCow The 500 million number came from me.
16:28 🔗 SketchCow Why do I keep visiting the HN page
16:30 🔗 SketchCow It's like it has a big banner over the tunnel entrance saying "Tunnel of Cheeto-fingered Libertarian Tech Cultists" and motherfuck if I don't just get into the goddamn boat going "Well, maybe I'll be able to extract some value from [music starts] IT'S A WORLD OF CRUELTY A WORLD OF FATE / IT'S A WORLD OF THE POWER OF DOLLARS AND HATE"
16:37 🔗 midas did you have the talk yet SketchCow ?
17:06 🔗 SketchCow No, that's Monday
17:06 🔗 * SketchCow was just chatting with Rick Prelinger, the god
19:17 🔗 joepie91 .tw https://twitter.com/joepie91/status/526089515119968256
19:17 🔗 botpie91 TwitPic gets acquired. Again. I think. Until it doesn't after all, I suppose. Really, who the fuck knows by now? http://blog.twitpic.com/2014/10/twitpics-future/ (@joepie91)
19:27 🔗 joepie91 pikhq: so perhaps we should continue here
19:27 🔗 joepie91 :P
19:28 🔗 joepie91 anyway, the whole hard-to-contact issue also exists for Google
19:28 🔗 pikhq Ya
19:28 🔗 joepie91 and a bunch of other companies
19:28 🔗 joepie91 it really bothers me
19:28 🔗 pikhq That's really weird.
19:28 🔗 joepie91 they basically just run a skeleton crew for consumer customer support
19:28 🔗 joepie91 responding to abusemails and that kind of crap
19:28 🔗 joepie91 if you want any actual support, you're shit out of luck
19:28 🔗 pikhq At least Google itself has mailing address and phone number findable.
19:29 🔗 joepie91 phone number for Google is useless
19:29 🔗 joepie91 it's just the corporate HQ
19:29 🔗 joepie91 they won't provide any support
19:29 🔗 joepie91 I suspect the same goes for mailing address
19:29 🔗 pikhq For the particular situation I was talking about, corporate HQ is what you *want*. :P
19:29 🔗 joepie91 they'll just tell you to "post on the Google Group for the product"
19:29 🔗 pikhq "We'd like to make off with a copy of your content before you die".
19:29 🔗 joepie91 pikhq: not according to them, you don't
19:30 🔗 joepie91 corporate HQ contact details are for things that make them money, and for unavoidable legal things
19:30 🔗 joepie91 everything else is end-user support
19:30 🔗 joepie91 :p
19:30 🔗 pikhq And fuck those guys.
19:30 🔗 joepie91 basically
19:30 🔗 joepie91 Google doesn't have actual customer support unless you're an enterprise customer
21:11 🔗 Smiley So I'm upto 'Bongo' on adding metadata...
21:12 🔗 Smiley This thing is huge D:
21:12 🔗 Smiley SketchCow: the actual 'internet arcade' collection itself doesn't have any description, I don't know if that's deliberate (and I can't add one anyway).
21:19 🔗 dashcloud Smiley: you're working on arcade metadata as well?
21:20 🔗 Smiley Nod
21:23 🔗 dashcloud I've been doing it, and I'm up to Loderunner- maybe I'll start from the bottom and work up then
21:23 🔗 Smiley hmmm can do
21:23 🔗 Smiley whatever works
21:26 🔗 BlueMaxim what collection is this? just curious
21:28 🔗 Smiley pm BlueMaxim
21:28 🔗 BlueMaxim OK
21:33 🔗 Smiley you got my mesg rigbht?
21:51 🔗 BlueMaxim no smiley
21:52 🔗 Smiley and now?
21:52 🔗 BlueMaxim yup :P
21:53 🔗 Smiley odd
21:53 🔗 BlueMaxim wow this looks like a nice collection
21:59 🔗 Smiley Nod
22:37 🔗 ersi Jebus, #-bs day today
22:57 🔗 joepie91 ersi: it's been overall busy and lively today
22:57 🔗 joepie91 no doubt thanks to Jason publicly declaring archival war on Twatpic
22:57 🔗 joepie91 :)
22:58 🔗 ersi or people reading the same news and splattering out over #at and starting the same discussion again
22:58 🔗 ersi Twitpic got acquired! etc
22:59 🔗 joepie91 but haha not really etc
23:10 🔗 godane so looks like Canada had a power outage like i had in 2008
23:11 🔗 godane based on News Hour Toronto for dec 2013
23:12 🔗 godane my power outage post: http://godane.wordpress.com/2008/12/21/out-of-power-for-9-days/
23:52 🔗 SketchCow Smiley: I'm actually fixing that.
23:53 🔗 SketchCow The more people who help with the arcde thing the better
23:53 🔗 SketchCow I redid the structure of the software library: http://archive.org/details/softwarelibrary

irclogger-viewer