[00:00] *** goekesmi has joined #archiveteam
[00:01] <arkiver> We started downloading what we could find of RapidShare!!
[00:01] <arkiver> RapidShare has some limits, so help us!
[00:01] <arkiver> #rapidscare
[00:04] *** GLaDOS has quit IRC (Read error: Operation timed out)
[00:05] *** GLaDOS has joined #archiveteam
[00:08] <dashcloud> this is a pure vanity thing, but wayback v1 had an option to sort your uploads by downloads/views- I don't see that option in v2. Is that permanently gone, or planned to come back at some point?
[00:18] <garyrh> https://archive.org/search.php?query=uploader:youremail@blah.com&sort=-downloads works
[00:20] <dashcloud> thank you!
[00:20] *** primus104 has quit IRC (Leaving.)
[00:21] *** Ymgve has quit IRC (Ping timeout: 506 seconds)
[00:27] *** mistym has quit IRC (Remote host closed the connection)
[00:34] *** SN4T14 has joined #archiveteam
[00:35] *** espes__ has quit IRC (Read error: Operation timed out)
[00:35] *** espes__ has joined #archiveteam
[00:35] *** thechip__ has joined #archiveteam
[00:35] *** khaoohs has joined #archiveteam
[00:35] *** svchfoo1 has quit IRC (Read error: Operation timed out)
[00:35] *** thechip_ has quit IRC (Read error: Operation timed out)
[00:36] *** sivoais has quit IRC (Read error: Operation timed out)
[00:36] *** khaoohs_ has quit IRC (Read error: Operation timed out)
[00:36] *** okeuday has quit IRC (Read error: Operation timed out)
[00:36] *** nertzy2 has joined #archiveteam
[00:36] *** wp494_ has joined #archiveteam
[00:37] *** T31M has joined #archiveteam
[00:37] *** okeuday has joined #archiveteam
[00:38] *** svchfoo1 has joined #archiveteam
[00:39] *** nertzy has quit IRC (Read error: Operation timed out)
[00:39] *** T31m_ has quit IRC (Read error: Operation timed out)
[00:39] *** SN4T14__ has quit IRC (Ping timeout: 369 seconds)
[00:39] *** svchfoo2 sets mode: +o svchfoo1
[00:39] *** wp494 has quit IRC (Read error: Operation timed out)
[00:45] *** lytv has quit IRC (Ping timeout: 306 seconds)
[00:48] *** lytv has joined #archiveteam
[00:48] *** mistym has joined #archiveteam
[00:54] *** patricko- is now known as patrickod
[00:57] *** sivoais has joined #archiveteam
[00:59] *** Ravenloft has joined #archiveteam
[01:04] *** wp494_ is now known as wp494
[01:22] *** signius has quit IRC (Read error: Operation timed out)
[01:23] *** patrickod is now known as patricko-
[01:23] *** mutoso has joined #archiveteam
[01:26] *** patricko- is now known as patrickod
[01:32] *** patrickod is now known as patricko-
[01:33] <dashcloud> I just uploaded a new CD image+scan of the CD, and there wasn't any scaling applied to the image- it shows the picture at what must be native size (huge)
[01:35] *** signius has joined #archiveteam
[01:47] <kyan> It did that to me with one item when it picked a 100+MB PNG as the featured image... I just crossed my fingers that no one looked at that page too much
[01:48] <xmc> hah smooth
[01:50] <dashcloud> so I guess it's not supposed to happen then?
[01:50] <kyan> well, I guess it "scaled" it inasmuch as it had the browser resize it
[01:51] <kyan> for the one I'm thinking of
[01:52] <dashcloud> here's one I just did: https://archive.org/details/Veloc128
[01:59] <dashcloud> does it show the picture ridiculously large for you?
[01:59] *** GLaDOS has quit IRC (Read error: Operation timed out)
[02:00] *** GLaDOS has joined #archiveteam
[02:01] <xmc> the picture takes up the whole screen yes
[02:01] <xmc> but that's ok
[02:03] <DFJustin> it is loading a 3mb png though
[02:03] <DFJustin> not very mobile-friendly
[02:10] *** GLaDOS has quit IRC (Ping timeout: 260 seconds)
[02:11] *** GLaDOS has joined #archiveteam
[02:26] *** GLaDOS has quit IRC (Read error: Operation timed out)
[02:26] *** GLaDOS has joined #archiveteam
[02:28] <SketchCow> So, in terms of that
[02:28] <SketchCow> It's best to just have the system have originals
[02:28] <SketchCow> And if it doesn't do the right thing, bring it up to info@archive.org.
[02:32] <dashcloud> so, is that considered a bug? if it is, I'll send it in
[02:33] <SketchCow> I think if you go "holy fuck, I can't work here" because it murders your machine, it's worth mentioning
[02:33] <SketchCow> tracey might not agree or will agree.
[02:36] <dashcloud> personally I liked the v1 style better because it lets you see the content without needing to scroll, but it doesn't cause my machine any more trouble than usual
[02:37] <SketchCow> The thing about v2
[02:37] <SketchCow> Is that v2 is actually this case of shooting forward literally 16 years
[02:37] <SketchCow> So first, they had to redo the back-end, and make it work in the system
[02:38] <SketchCow> Now that it's there, improvements and changes are not traumatic at all.
[02:38] <SketchCow> And as seen in the changelog, things are ramping up hella quick
[02:38] <SketchCow> By the way, the web screenshotter's been chugging along.
[02:39] *** GLaDOS has quit IRC (Read error: Operation timed out)
[02:42] <SketchCow> And in some cases, REALLY chugging.
[02:42] *** GLaDOS has joined #archiveteam
[02:49] <SketchCow> http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ continues (for the moment) to link to the workspace.
[02:50] <kyan> Here's the one with the massive MASSIVE PNG file that got chose as the default: https://archive.org/details/CHORALECONCERT111613.ripped17jan2014
[02:52] *** GLaDOS has quit IRC (Ping timeout: 260 seconds)
[02:55] *** GLaDOS has joined #archiveteam
[03:01] *** dan_ has quit IRC (Ping timeout: 260 seconds)
[03:01] <xmc> v1 didn't downscale images either, it just put them in a tiny little <table>
[03:06] *** dan_ has joined #archiveteam
[03:12] <DFJustin> well it did a super-tiny jpeg thumb if the derive ran successfully
[03:12] <xmc> oh
[03:13] <DFJustin> and if the image wasn't a gif because all gifs are animations right
[03:13] <SketchCow> I STILL think of xmc and DFJustin as the same guy
[03:13] <SketchCow> And I STILL flip out when I see you two talk
[03:13] <xmc> how the hell
[03:13] <SketchCow> it's like, "you ok, buddy?"
[03:13] <SketchCow> I know
[03:13] <SketchCow> ALso, it's 4am here
[03:13] <xmc> now i'm fucking insulted
[03:13] <xmc> i'm about to leave the channel forever unless you apologize to me publicly
[03:13] <DFJustin> wow mean
[03:14] <SketchCow> .___/\          _________                           
[03:14] <SketchCow> |   )/_____    /   _____/ __________________ ___.__.
[03:14] <SketchCow> |   |/     \   \_____  \ /  _ \_  __ \_  __ <   |  |
[03:14] <SketchCow> |   |  Y Y  \  /        (  <_> )  | \/|  | \/\___  |
[03:14] <SketchCow> |___|__|_|  / /_______  /\____/|__|   |__|   / ____|
[03:14] <SketchCow>           \/          \/                     \/  
[03:14] * xmc mollified
[03:15] *** SimpBrain has quit IRC (Read error: Connection reset by peer)
[03:18] <SketchCow> Ah, I have to wake up in 3 hours
[03:18] <SketchCow> Good thing I got a jump on that.
[03:18] *** Ara_ has joined #archiveteam
[03:24] *** brayden has quit IRC (Read error: Operation timed out)
[03:34] *** mistym has quit IRC (Remote host closed the connection)
[03:58] *** mistym has joined #archiveteam
[04:00] *** Ara_ has quit IRC (Read error: Connection reset by peer)
[04:08] *** aaaaaaaaa has quit IRC (Leaving)
[04:42] *** balrog has quit IRC (Ping timeout: 260 seconds)
[04:53] *** rejon has joined #archiveteam
[04:58] *** balrog has joined #archiveteam
[04:58] *** swebb sets mode: +o balrog
[05:03] *** qwebirc10 has joined #archiveteam
[05:04] *** qwebirc10 has quit IRC (Client Quit)
[05:19] *** brayden has joined #archiveteam
[06:10] *** GLaDOS has quit IRC (Read error: Operation timed out)
[06:11] *** GLaDOS has joined #archiveteam
[06:28] *** wp494 has quit IRC (Ping timeout: 740 seconds)
[06:53] *** mistym has quit IRC (Remote host closed the connection)
[06:53] *** wp494 has joined #archiveteam
[06:55] *** techapj_ has joined #archiveteam
[07:14] <Nemo_bis> http://www.museumofplay.org/online-collections/search/index.php?object_id=&title=&name=&subject=&artist=&manufacturer=&material=&credit_line=Jason+Scott
[07:17] *** thisismyn has quit IRC (Ping timeout: 260 seconds)
[07:22] *** primus104 has joined #archiveteam
[07:37] *** Jonimus has quit IRC (Ping timeout: 370 seconds)
[07:37] *** dashcloud has quit IRC (Ping timeout: 260 seconds)
[07:37] *** dashcloud has joined #archiveteam
[07:48] *** bmcginty_ has joined #archiveteam
[07:48] *** bmcginty has quit IRC (Read error: Connection reset by peer)
[07:58] *** Stilett0 has quit IRC (Read error: Operation timed out)
[08:15] *** dashcloud has quit IRC (Read error: Operation timed out)
[08:18] *** dashcloud has joined #archiveteam
[08:23] *** T31M has quit IRC (Quit: Leaving)
[08:26] *** schbirid has joined #archiveteam
[08:26] *** SimpBrain has joined #archiveteam
[08:27] *** techapj_ has quit IRC (Ping timeout: 240 seconds)
[08:33] *** techapj_ has joined #archiveteam
[08:36] <arkiver> Kenshin: would you like to help us on the rapidshare grab? :)
[08:51] <SketchCow> The speech went well
[08:51] <godane> hey SketchCow
[08:51] <SketchCow> Hey, godane.
[08:51] <godane> i'm uploading more led zeppelin bootlegs to your ftp
[08:52] <godane> april 8, 1970 shows right now
[08:52] *** techapj_ has quit IRC (Quit: Page closed)
[08:53] <xmc> wooooo
[08:54] <xmc> i suspect that led zeppelin bootlegs will be more broadly interesting than the complete archives of glenn beck
[08:54] <xmc> but it's important to make sure that the full breadth of glenn beck's opinions are preserved regardless :)
[08:55] <godane> i have his radio shows going back to august 2005
[08:56] <godane> there are more links before that but there all dead
[08:57] <xmc> well, ok :)
[08:57] <godane> sort of surprise that i got that much since it wasn't on his main website
[08:57] <godane> that only goes back to 2008
[08:57] <godane> also some of the rtmp stream still works last time i checked
[08:57] *** techapj_ has joined #archiveteam
[08:58] <godane> but none for the radio shows
[08:58] <godane> i'm also grabbing more south korea news
[08:58] <godane> SBS network now
[08:59] <godane> tons of dial-up videos and some stock video
[09:01] <godane> ok now this is interesting
[09:01] <godane> i'm getting a full 78 minute (news) show in dial-up
[09:03] <godane> i think its some sort of debate show
[09:03] <godane> or game show
[09:04] <godane> i think these videos are from 1998ish now
[09:04] <godane> or 2001
[09:05] <godane> it was flash dates all from 1997
[09:05] <godane> then 2001.1.23
[09:20] *** philpem has joined #archiveteam
[09:40] *** JMC_ has joined #archiveteam
[09:42] *** wp494 has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** Start has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** JMC has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** SadDM has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** xtr-201 has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** garyrh has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** pikhq has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** useretail has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** matthusby has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** will has quit IRC (ircd.shaw.ca irc.shaw.ca)
[09:42] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
[09:44] *** dashcloud has joined #archiveteam
[10:00] *** habi has joined #archiveteam
[10:02] *** habi has left 
[10:02] *** wp494 has joined #archiveteam
[10:02] *** Start has joined #archiveteam
[10:02] *** SadDM has joined #archiveteam
[10:02] *** garyrh has joined #archiveteam
[10:02] *** pikhq has joined #archiveteam
[10:02] *** useretail has joined #archiveteam
[10:02] *** matthusby has joined #archiveteam
[10:02] *** will has joined #archiveteam
[10:02] *** irc.shaw.ca sets mode: +o SadDM
[10:02] *** swebb sets mode: +o SadDM
[10:05] *** Ymgve has joined #archiveteam
[10:29] *** cloudmons has quit IRC (Read error: Connection reset by peer)
[10:29] *** cloudmons has joined #archiveteam
[11:17] *** w0rp has quit IRC (Ping timeout: 265 seconds)
[11:17] *** Ravenloft has quit IRC (Ping timeout: 265 seconds)
[11:18] *** lytv has quit IRC (Ping timeout: 265 seconds)
[11:18] *** dx- has quit IRC (Ping timeout: 265 seconds)
[11:18] *** ryan has joined #archiveteam
[11:18] *** dx has joined #archiveteam
[11:19] *** w0rp has joined #archiveteam
[11:20] *** jk[[SVP]] has joined #archiveteam
[11:21] *** lytv has joined #archiveteam
[11:21] *** SketchCo1 has joined #archiveteam
[11:21] *** swebb sets mode: +o SketchCo1
[11:21] *** Kazzy_ has joined #archiveteam
[11:22] *** BlueMaxim has quit IRC (Read error: Connection reset by peer)
[11:24] *** Sk2d has joined #archiveteam
[11:26] *** Sk1d has quit IRC (hub.se efnet.portlane.se)
[11:26] *** jk[SVP] has quit IRC (hub.se efnet.portlane.se)
[11:26] *** Gfy has quit IRC (hub.se efnet.portlane.se)
[11:26] *** underscor has quit IRC (hub.se efnet.portlane.se)
[11:26] *** filippo has quit IRC (hub.se efnet.portlane.se)
[11:26] *** ryan__ has quit IRC (hub.se efnet.portlane.se)
[11:26] *** Deewiant has quit IRC (hub.se efnet.portlane.se)
[11:26] *** Kazzy has quit IRC (hub.se efnet.portlane.se)
[11:26] *** SketchCow has quit IRC (hub.se efnet.portlane.se)
[11:27] *** techapj_ has quit IRC (Ping timeout: 240 seconds)
[11:28] *** Gfy_ has joined #archiveteam
[11:42] *** Gfy_ is now known as Gfy
[11:42] *** jk[[SVP]] is now known as jk[SVP]
[11:42] *** brayden has quit IRC (Quit: Leaving)
[11:42] *** underscor has joined #archiveteam
[11:42] *** swebb sets mode: +o underscor
[11:42] *** Kazzy_ is now known as Kazzy
[11:42] *** Sk2d is now known as Sk1d
[11:56] *** yan has joined #archiveteam
[11:57] *** filippo has joined #archiveteam
[12:02] *** T31M has joined #archiveteam
[12:08] *** primus104 has quit IRC (Leaving.)
[12:24] *** dashcloud has quit IRC (Read error: Operation timed out)
[12:31] *** dashcloud has joined #archiveteam
[12:36] *** habi has joined #archiveteam
[12:42] *** sankin has joined #archiveteam
[12:49] *** Jonimus has joined #archiveteam
[12:52] *** brayden has joined #archiveteam
[13:09] *** habi has left 
[13:40] *** ionpulse has quit IRC (Read error: Connection reset by peer)
[13:46] *** Start has quit IRC (Disconnected.)
[13:47] *** ionpulse has joined #archiveteam
[14:31] *** T31M has quit IRC (Ping timeout: 1221 seconds)
[14:34] *** mistym has joined #archiveteam
[14:34] *** Start has joined #archiveteam
[14:35] *** Start_ has joined #archiveteam
[14:35] *** Start has quit IRC (Read error: Connection reset by peer)
[14:39] *** mistym has quit IRC (Remote host closed the connection)
[14:43] *** primus104 has joined #archiveteam
[14:46] <midas> http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=RssLanding&cat=news&id=2028891
[14:46] *** Start_ has quit IRC (Read error: Connection reset by peer)
[14:46] <midas> amazon has unlimited storage for 60 bucks a year
[14:46] *** Start has joined #archiveteam
[14:47] *** Stiletto has joined #archiveteam
[14:56] *** mistym has joined #archiveteam
[14:58] <Smiley> Oooo ooo ooo
[14:58] <Smiley> archiving emails
[14:58] <Smiley> promotional emails, do we do it? SHOULD WE? HELLLLLL YEAAAAH
[15:01] <Smiley> but how?
[15:01] <Smiley> Oh, how about a account that receieves email and renders htem out and uploads?
[15:01] <Sanqui> sign up for ALL the newsletters
[15:01] <Sanqui> sounds like a fairly simple project
[15:01] <Sanqui> Project Newsletter
[15:02] <Sanqui> just advertise the email address and tell people to sign it up for as many newsletters as possible
[15:04] <Sanqui> would be even better if an entire domain was dedicated to this
[15:05] <Sanqui> so you could sign up with something like nytimes@projectneswletter.org
[15:05] <Sanqui> this is sounding awesome already
[15:05] <Sanqui> Smiley, what do you think?
[15:05] <Smiley> sounds good
[15:05] <Smiley> get on it ;)
[15:06] <Smiley> Sanqui: it's possible it's already done, hense the discussion :)
[15:06] <Sanqui> yeah..  now if only I had the money for the domain and a server to dedicate to this :P
[15:10] <johtso> Not just newsletters, archive the spam!
[15:10] *** primus104 has quit IRC (Leaving.)
[15:10] *** habi has joined #archiveteam
[15:10] *** habi has left 
[15:10] <johtso> would be interesting to look back and see how 419 scam emails have changed over time
[15:11] <johtso> http://untroubled.org/spam/
[15:14] <Sanqui> with some newsletters, the line gets blurry
[15:14] <Sanqui> also, you sign up for newsletters and get signed up for unrelated spam
[15:15] <johtso> Many newsletters aren't open and are only sent to members of a particular service
[15:15] <johtso> some would be easier to sign up for than others
[15:16] <achip> might need a way to "confirm" subscription too, view the inbox for the newsletter and click the confirm link, archiving mailing lists would be good too
[15:17] <Sanqui> right, you'd need to confirm (or auto-confirm based on some heuristics)
[15:17] *** schbird has joined #archiveteam
[15:17] <schbird> hey, i just noticed that germany's main tabloid blocks everyone but search engine robots
[15:17] <schbird> bild.de
[15:17] <Sanqui> i think there are already some mailing list archiving services
[15:18] <schbird> if there is any way that we could get it crawled and safed regularly, it would be great. even if not public or in non-waybackmachine warcs
[15:18] *** Start has quit IRC (Disconnected.)
[15:19] <schbird> also http://web.archive.org/save/http://www.mopo.de/panorama/flug-4u9525-zerschellte-am-berg-extra-schub-der-triebwerke-sorgte-fuer-grosse-explosion,5066860,30220524.html fails
[15:19] <Sanqui> didn't fail for me
[15:20] <schbird> nice
[15:20] <schbird> told me the page was not available live
[15:24] <johtso> What's the best way to go about archiving file locker downloads "manually"?
[15:25] <johtso> If I've scraped some collection of links, in what format should I upload them?
[15:25] <johtso> I'm thinking associating the uploads in some way with the original URL would be a good idea
[15:25] *** Start has joined #archiveteam
[15:26] <johtso> uploading them in their archives seems a bit pointless though
[15:26] *** Start has quit IRC (Read error: Connection reset by peer)
[15:26] <johtso> and if it happens to be all links from a certain blog for example, it would probably make sense to upload them all as a group.
[15:28] *** Start has joined #archiveteam
[15:35] *** scyther has joined #archiveteam
[15:42] *** dashcloud has quit IRC (Read error: Operation timed out)
[15:47] *** dashcloud has joined #archiveteam
[15:51] *** garyrh has quit IRC (Remote host closed the connection)
[15:51] *** Start has quit IRC (Disconnected.)
[16:01] *** VADemon has joined #archiveteam
[16:08] *** schbird has quit IRC (Quit: Leaving)
[16:16] *** habi has joined #archiveteam
[16:20] *** habi has left 
[16:28] *** Start has joined #archiveteam
[16:29] *** Start_ has joined #archiveteam
[16:29] *** Start has quit IRC (Read error: Connection reset by peer)
[16:29] *** Start_ is now known as Start
[16:32] <Start> arkiver: now that rapidshare's up and running, let's get started on music unlimited
[16:32] <Start> 3 days left
[16:32] *** primus104 has joined #archiveteam
[16:36] <SimpBrain> o.O
[16:45] *** Start has quit IRC (Disconnected.)
[16:55] *** patricko- is now known as patrickod
[16:56] *** aaaaaaaaa has joined #archiveteam
[17:01] <kyan> johtso: Definitely include the original archive files — it's much better to provide unmodified copies of data. 
[17:02] <johtso> kyan, but then the data can't be explored on IA can it?
[17:02] <kyan> ZIP files and ISOs can be browsed like folders on IA
[17:02] <kyan> and tars I think
[17:02] <johtso> oh nice
[17:03] <johtso> kyan, and relating it to the relevant URL?
[17:03] <kyan> Do bear in mind that for some reason IA's upload form rejects RAR files, so if you have those it would make sense to put them in another file eg "rar-files-packed-2015March26"
[17:03] <johtso> oh, that's annoying..
[17:04] <kyan> I'm not sure about that, I don't know that IA has facilities for referencing things other than WARC files by URL
[17:05] <kyan> Probbably the best way to do that would be to make an HTML page that lists the original links, with hyperlinks to the new archived locations. I'm pretty sure it's possible to link directly to a file within a ZIP/ISO/tar on IA
[17:05] <kyan> Alternatively, for uploading RARs you could make a torrent file of them, and then upload the torrent file. IA will automatically download the contents of torrent files, and it seems to accept RARs that way.
[17:07] <kyan> Here's an example of the ZIP file viewer: https://archive.org/download/AElTantamusFuturamerlinID5194/2719.zip/
[17:08] *** patrickod is now known as patricko-
[17:08] <DFJustin> upload the original archive and the extracted files both
[17:08] <DFJustin> also you can upload rar via command line or s3
[17:09] <arkiver> Start: yes, having a look at it atm
[17:12] <DFJustin> here's an example I did from a video game music blog https://archive.org/details/Saladedemais_-_Follin_Project_-_Amiga
[17:20] <johtso> DFJustin, ah interesting
[17:20] <johtso> struggling to work out how that archive is structured.. IA uploads do have a file structure don't they?
[17:21] *** mistym has quit IRC (Remote host closed the connection)
[17:28] <johtso> kyan, thanks for the torrent file tip, that sounds like a very convenient solution!
[17:28] <johtso> kyan, does it respect folder structures?
[17:28] <DFJustin> it may be clearer to look at the raw directory https://archive.org/download/Saladedemais_-_Follin_Project_-_Amiga/
[17:28] <johtso> ah nice, yes that is clearer
[17:29] <DFJustin> you can have folder structures but it's not really visible in the ui and it's probably better to split things up usually
[17:29] <johtso> DFJustin, hmm, it's a bit tricky, because I'm dealing with bulk here
[17:30] <johtso> scraping all the archives linked to from a website.. each one of those archives pertaining to some album of music
[17:30] <DFJustin> ah so folders would be like disc 1, disc 2?
[17:31] <DFJustin> generally one album = one IA item
[17:31] <johtso> DFJustin, take this site for example
[17:32] <johtso> so many archives, each containing a directory of mp3 files
[17:32] *** scyther has quit IRC (Read error: Connection reset by peer)
[17:33] <johtso> obviously the pertinent thing here is getting the data preserved, and doing an upload per download would most likely make it too time consuming
[17:34] <DFJustin> there are ways of doing batch uploads
[17:34] <johtso> DFJustin, oops, forgot the link: http://braingoreng.blogspot.co.uk/
[17:36] <johtso> DFJustin, but what about getting the metadata right for the upload? Is there really any point in having individual uploads if I don't have the time to give each one a correct name etc.?
[17:36] *** mistym has joined #archiveteam
[17:37] <DFJustin> if you don't have time, uploading one item with a big zip or tar of all the archives is better than nothing
[17:38] <johtso> DFJustin, I think that's probably the way to go, better off uploading more stuff before the files disappear than spending too long on each item
[17:39] <johtso> I could possibly just add a list of all the linked that were scraped somewhere in the description/metadata
[17:39] <johtso> just so that full text searches would find it
[17:39] <johtso> *links
[17:43] *** SketchCo1 is now known as SketchCOw
[17:43] *** SketchCOw is now known as SketchCow
[17:44] <SketchCow> Hi,  hi.
[17:44] <DFJustin> god dag
[17:47] <SketchCow> Going to have a little social hour with Gothenberg locals.
[17:49] <johtso> Is it possible to incrementally update an IA entry using bittorrent?
[17:50] <SketchCow> It is possible to upload data into an IA item and then regenerate the torrent.
[17:50] *** signius has quit IRC (Read error: Operation timed out)
[17:56] *** Start has joined #archiveteam
[17:57] <johtso> SketchCow, I was thinking more along the lines of seeding a torrent that IA is downloading, and then incrementally adding files, but that obviously goes against how torrents work
[17:58] <SketchCow> Right
[18:06] *** signius has joined #archiveteam
[18:06] *** tom_ has joined #archiveteam
[18:17] <SketchCow> Oh, other screenshotter follies:
[18:17] <SketchCow> There was a motherdubbing 315 GB WARC in an item
[18:18] <yipdw> heh I remember those days when we didn't split
[18:18] <yipdw> those were awesome
[18:29] *** ryan has quit IRC (Ping timeout: 260 seconds)
[18:30] *** ryan_ has joined #archiveteam
[18:37] *** Start has quit IRC (Disconnected.)
[18:39] <arkiver> Fusl: uploading in 50G packs through torrents https://archive.org/details/Wallbase.ccArchive01
[18:59] *** patricko- is now known as patrickod
[19:01] <Atluxity> johtso: your thought made think about a self-written torrent tracker that a system like IA could poll with regular interwall, and predetirmened torrent names based on the current unixtime, then one could make a torrent with a filename one would know would get polled at a certain time, and then the system like IA could use it for download
[19:03] <johtso> Atluxity, really you just want to be able to rsync to somewhere..
[19:03] *** Start has joined #archiveteam
[19:15] *** Stiletto has quit IRC (Ping timeout: 306 seconds)
[19:19] *** thechip has quit IRC (Leaving...)
[19:25] *** Start has quit IRC (Disconnected.)
[19:29] *** patrickod is now known as patricko-
[19:32] *** Start has joined #archiveteam
[19:33] *** Start_ has joined #archiveteam
[19:33] *** Start has quit IRC (Read error: Connection reset by peer)
[19:36] *** Start_ is now known as Start
[19:36] <Start> Sanqui: i have an idea for project newsletter
[19:39] <Start> rather than having one email adress per newsletter, have a general purpose account (perhaps projectnewsletter@archiveteam.org) and have archivebot-like go packs separated by the sender's email
[19:41] <Sanqui> it makes sense to have one email address per newsletter
[19:41] <Sanqui> because that way you can more easily group together the "source"
[19:42] <Sanqui> it's more information
[19:42] <Sanqui> if they start sending emails from a different address, they'll still be sent to the same YOUR address
[19:42] <Sanqui> or, if they sell the email to spamming companies, you'll know who did it ;)
[19:45] *** Start_ has joined #archiveteam
[19:45] *** Start has quit IRC (Read error: Connection reset by peer)
[19:45] *** Start_ is now known as Start
[19:45] <Start> perhaps would we have a page where people could create an email address for a newsletter
[19:46] *** patricko- is now known as patrickod
[19:46] <schbirid> catchall and ask people to signup with site/service specific addresses
[19:46] <schbirid> not sure how you would handle the confirmation links though
[19:47] *** SN4T14_ has joined #archiveteam
[19:48] <Start> maybe if the words such as confirmation, confirm, verify, or verification are detected in the first email on an account, someone on irc gets notified
[19:48] <Start> we could have a password protected area where the confirmation links could be clicked
[19:50] <achip> could just be a simple interface that just displays the last 5 emails into the inbox for confirming
[19:53] *** patrickod is now known as patricko-
[19:55] *** tom_ has quit IRC (Ping timeout: 240 seconds)
[19:55] <aaaaaaaaa> just issue a get request for every url that doesn't have "unsubscribe" or the like in the anchor.
[19:56] *** SN4T14 has quit IRC (Ping timeout: 512 seconds)
[20:01] <joepie91_> [20:41] <Sanqui> it makes sense to have one email address per newsletter
[20:01] <joepie91_> there is a very nice Python libraery
[20:01] <joepie91_> library *
[20:01] <joepie91_> that makes it easy to accept email on lots of different addresses
[20:01] <joepie91_> dynamically
[20:01] <joepie91_> hold
[20:02] <achip> could just do it with virtuals in postfix as well
[20:02] <joepie91_> where's the damn thing..
[20:02] <joepie91_> sure, if you like pain :P
[20:04] <joepie91_> not what I was thinking of, but https://github.com/bcoe/smtproutes seems similar
[20:04] <joepie91_> AH
[20:05] <joepie91_> found it!
[20:05] <joepie91_> https://github.com/kennethreitz/inbox.py
[20:05] <joepie91_> cc Start Sanqui schbirid
[20:05] <joepie91_> confirmation links are not generally hard to do - first email you receive, follow whatever is the longest link that isn't an unsubscribe link
[20:05] <joepie91_> chances are it's the confirm link
[20:06] *** caber has quit IRC (Read error: Operation timed out)
[20:09] *** caber has joined #archiveteam
[20:22] *** Start has quit IRC (Disconnected.)
[20:29] *** mistym has quit IRC (Remote host closed the connection)
[20:32] *** habi has joined #archiveteam
[20:33] *** habi has left 
[20:51] *** mistym has joined #archiveteam
[20:53] <achip> got a POC working with that python lib, pretty slick. should probably move to a new chan ya?
[20:59] *** schbirid has quit IRC (Leaving)
[20:59] *** sankin has quit IRC (Leaving.)
[21:01] *** primus has joined #archiveteam
[21:03] *** schbirid has joined #archiveteam
[21:04] *** schbirid has quit IRC (Client Quit)
[21:10] *** schbirid has joined #archiveteam
[21:12] <Sanqui> achip: i like project newsletter, but it's not very catchy/funny
[21:12] <Sanqui> gj, by the way
[21:15] <achip> spewsletter newslater spamcan I'm not good at these haha
[21:22] <johtso> email archive
[21:23] *** mistym has quit IRC (Remote host closed the connection)
[21:27] <johtso> DFJustin, will I not fall foul of "If you upload .zip, .rar, non-audio formats (like .exe), or password-protected files, they may be removed by our moderators."
[21:27] *** yan has quit IRC (Quit: bye)
[21:34] *** BlueMaxim has joined #archiveteam
[21:36] <xmc> to where?
[21:37] <johtso> xmc, internet archive
[21:38] <johtso> xmc, was talking earlier about rescuing filelocker downloads from music blogs, and what the best format was for uploading to IA
[21:38] *** mistym has joined #archiveteam
[21:39] <xmc> they'll be fiiiiine
[21:39] <xmc> upload whatever you downloaded, don't modify it
[21:40] <johtso> awesome.
[21:40] <johtso> jdownloader2 really is an impressive bit of software.. wish there was a nice way it could be incorporated into some kind of automated pipeline
[21:41] <xmc> hmmmm
[21:44] <SketchCow> http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/bewareofmpreg.tumblr.com-inf-20140902-230752-dp7fc-00000.warc.gz.png
[21:47] <schbirid> johtso: there are some similar tools for the commandline, much less feature though and of course captchas are a problem
[21:50] <Sanqui> SketchCow: some amazing things could be done with the snapshot project
[21:50] <Sanqui> I'm thinking a video with several website pictures per second
[21:51] <Sanqui> but there's a lot that can be done
[21:51] <Sanqui> another idea: order sites by hue and make a rainbow
[21:52] <midas> well that image wasnt strange at all
[21:55] <joepie91_> johtso: see plowshare
[21:56] *** schbirid has quit IRC (Leaving)
[21:56] <joepie91_> [22:27] <johtso> DFJustin, will I not fall foul of "If you upload .zip, .rar, non-audio formats (like .exe), or password-protected files, they may be removed by our moderators."
[21:56] <joepie91_> where'd you see that?
[21:57] <xmc> probably some very old faq item
[21:57] <SketchCow> You got a friend in the business
[22:00] *** patricko- is now known as patrickod
[22:02] *** patrickod is now known as patricko-
[22:15] <johtso> joepie91_, https://archive.org/about/faqs.php#236
[22:20] <johtso> joepie91_, nice, hadn't come across plowshare
[22:21] <johtso> there's also pyload, which isn't command line, but is open source python
[22:21] <johtso> the issue with these kinds of tools is that they break ALL THE TIME
[22:21] <johtso> so unless it's a very actively maintained project it can be very frustrating
[22:23] <balrog> SketchCow: the password-protected files thing might still be somewhat valid, right?
[22:25] <kyan> I've uploaded quite a few password protected things that are in copyright, etc. with plans to leave the password in my will...
[22:25] <kyan> I don't think any of them have been taken down.
[22:25] <xmc> why not upload them in plaintext and request darking instead?
[22:26] <johtso> xmc, I was wondering if something like that was possible
[22:26] <kyan> meh, it's easier since I can automate the process — I find something to archive, I drop it into a folder, run a terminal command and voi la away it goes
[22:26] <johtso> Oh, I suppose the downside to that is that you can't download them again yourself
[22:26] <kyan> That too
[22:27] <johtso> would be good if you could "dark" something and still have access to it
[22:28] <xmc> there's a collection flag that makes the items in it not downloadable by anyone other than the uploader
[22:28] <xmc> but they are still visible
[22:29] <xmc> e.g. https://archive.org/details/bellsystempractices
[22:32] <joepie91_> johtso: I still intend to build a well-maintained Node.js module for this kind of thing
[22:32] <VADemon> does anybody know of a url grabber for google? ~75k search results
[22:32] <joepie91_> but I have a bunch of other things to get done first
[22:32] <joepie91_> :)
[22:32] <joepie91_> VADemon: forget it, heh
[22:33] <xmc> VADemon: google won't give you anything actually past the 1000th result
[22:33] <xmc> and they make it damn hard to get there anyway
[22:33] <joepie91_> VADemon: it's extremely unlikely that you'll manage to automate your way to even 10 pages without getting canned
[22:33] <joepie91_> they're VERY aggressive towards bots
[22:33] <johtso> meanies
[22:34] <xmc> you could probably browse manually through a warc-writing proxy and scrape the results from the cache though
[22:34] <joepie91_> they do crap like comparing user agents to known unusual behaviour for those specific useragents, to detect fakes
[22:34] <joepie91_> it's ridiculous
[22:34] <joepie91_> even if you phantomjs your way in, they'll likely still know
[22:34] <VADemon> I am surprised but its not surprising %)
[22:35] <VADemon> I will GUI FIREFOX MY WAY IN THEN
[22:35] <joepie91_> I've had very limited success but certainly nowhere near 75k results
[22:35] *** Stiletto has joined #archiveteam
[22:39] <johtso> probably better off trying to break into their datacenters
[22:39] *** Start has joined #archiveteam
[22:39] <johtso> or getting a job at google :P
[22:40] <VADemon> it's easier to start WW3 than this ^ & ^^
[22:44] <Start> i'm going to create a page for project newsletter
[22:45] <Start> i'm kinda feeling meh on the name though
[22:45] <Start> anyone got a better idea for the name?
[22:49] *** Stiletto has quit IRC (Read error: Operation timed out)
[22:52] <VADemon> bad name incoming: rss -> resser. pure creativity
[22:58] <VADemon> got google captcha after 7 pages 100links each ;)
[22:59] <johtso> Start, is it just newsletters? or all kinds of automated email?
[22:59] <Start> i suppose it could be for both
[23:01] <Start> article has been created: http://archiveteam.org/index.php?title=Project_Newsletter
[23:03] *** SimpBrain has quit IRC (Quit: Leaving)
[23:03] <achip> there could be a catchall domain(s) that's semi-public that isn't a big deal if it catches spam, but I like the idea of creating the addresses on another domain(s) for each or a set of newsletters to cut out spam
[23:13] <arkiver> Start: How do you think we can best do music unlimited?
[23:14] <arkiver> the site will totally not show up nicely in wayback machine
[23:14] <Start> as long as the content is archived
[23:15] <Start> if anything needs an account i could create one
[23:18] *** fool8 has joined #archiveteam
[23:18] <fool8> Can I upload a 20GB+ WARC to archive.org?
[23:19] <fool8> I forgot --warc-max-size
[23:21] <arkiver> fool8: yes, please upload it
[23:21] <arkiver> paste a link to the item here and we'll move it to a web collection
[23:21] *** philpem has quit IRC (Ping timeout: 260 seconds)
[23:22] <fool8> I'll come back later or tomorrow
[23:22] *** fool8 has quit IRC (foo bar baz qux)
[23:24] <arkiver> Start: I think we'll just grab all the pages and links on the pages that were discovered
[23:24] <arkiver> the website is totally filled with javascript
[23:24] <arkiver> chfoo: do you think we are able to do a full grab of https://music.sonyentertainmentnetwork.com/ ?
[23:28] <arkiver> For the newsletter archiving I think there should definitely be some sort of human intervention or check before new newsletters are added
[23:30] <Start> !a https://music.sonyentertainmentnetwork.com --phantomjs
[23:30] <Start> wrong channel
[23:31] <Start> arkiver: how did archivebot handle music unlimited
[23:33] <arkiver> running now
[23:34] <Start> i'll carefully watch it and see how effective it is
[23:40] <arkiver> #limitedmusic