#archiveteam 2015-03-26,Thu

↑back Search

Time Nickname Message
00:00 🔗 goekesmi has joined #archiveteam
00:01 🔗 arkiver We started downloading what we could find of RapidShare!!
00:01 🔗 arkiver RapidShare has some limits, so help us!
00:01 🔗 arkiver #rapidscare
00:04 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
00:05 🔗 GLaDOS has joined #archiveteam
00:08 🔗 dashcloud this is a pure vanity thing, but wayback v1 had an option to sort your uploads by downloads/views- I don't see that option in v2. Is that permanently gone, or planned to come back at some point?
00:18 🔗 garyrh https://archive.org/search.php?query=uploader:youremail@blah.com&sort=-downloads works
00:20 🔗 dashcloud thank you!
00:20 🔗 primus104 has quit IRC (Leaving.)
00:21 🔗 Ymgve has quit IRC (Ping timeout: 506 seconds)
00:27 🔗 mistym has quit IRC (Remote host closed the connection)
00:34 🔗 SN4T14 has joined #archiveteam
00:35 🔗 espes__ has quit IRC (Read error: Operation timed out)
00:35 🔗 espes__ has joined #archiveteam
00:35 🔗 thechip__ has joined #archiveteam
00:35 🔗 khaoohs has joined #archiveteam
00:35 🔗 svchfoo1 has quit IRC (Read error: Operation timed out)
00:35 🔗 thechip_ has quit IRC (Read error: Operation timed out)
00:36 🔗 sivoais has quit IRC (Read error: Operation timed out)
00:36 🔗 khaoohs_ has quit IRC (Read error: Operation timed out)
00:36 🔗 okeuday has quit IRC (Read error: Operation timed out)
00:36 🔗 nertzy2 has joined #archiveteam
00:36 🔗 wp494_ has joined #archiveteam
00:37 🔗 T31M has joined #archiveteam
00:37 🔗 okeuday has joined #archiveteam
00:38 🔗 svchfoo1 has joined #archiveteam
00:39 🔗 nertzy has quit IRC (Read error: Operation timed out)
00:39 🔗 T31m_ has quit IRC (Read error: Operation timed out)
00:39 🔗 SN4T14__ has quit IRC (Ping timeout: 369 seconds)
00:39 🔗 svchfoo2 sets mode: +o svchfoo1
00:39 🔗 wp494 has quit IRC (Read error: Operation timed out)
00:45 🔗 lytv has quit IRC (Ping timeout: 306 seconds)
00:48 🔗 lytv has joined #archiveteam
00:48 🔗 mistym has joined #archiveteam
00:54 🔗 patricko- is now known as patrickod
00:57 🔗 sivoais has joined #archiveteam
00:59 🔗 Ravenloft has joined #archiveteam
01:04 🔗 wp494_ is now known as wp494
01:22 🔗 signius has quit IRC (Read error: Operation timed out)
01:23 🔗 patrickod is now known as patricko-
01:23 🔗 mutoso has joined #archiveteam
01:26 🔗 patricko- is now known as patrickod
01:32 🔗 patrickod is now known as patricko-
01:33 🔗 dashcloud I just uploaded a new CD image+scan of the CD, and there wasn't any scaling applied to the image- it shows the picture at what must be native size (huge)
01:35 🔗 signius has joined #archiveteam
01:47 🔗 kyan It did that to me with one item when it picked a 100+MB PNG as the featured image... I just crossed my fingers that no one looked at that page too much
01:48 🔗 xmc hah smooth
01:50 🔗 dashcloud so I guess it's not supposed to happen then?
01:50 🔗 kyan well, I guess it "scaled" it inasmuch as it had the browser resize it
01:51 🔗 kyan for the one I'm thinking of
01:52 🔗 dashcloud here's one I just did: https://archive.org/details/Veloc128
01:59 🔗 dashcloud does it show the picture ridiculously large for you?
01:59 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
02:00 🔗 GLaDOS has joined #archiveteam
02:01 🔗 xmc the picture takes up the whole screen yes
02:01 🔗 xmc but that's ok
02:03 🔗 DFJustin it is loading a 3mb png though
02:03 🔗 DFJustin not very mobile-friendly
02:10 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
02:11 🔗 GLaDOS has joined #archiveteam
02:26 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
02:26 🔗 GLaDOS has joined #archiveteam
02:28 🔗 SketchCow So, in terms of that
02:28 🔗 SketchCow It's best to just have the system have originals
02:28 🔗 SketchCow And if it doesn't do the right thing, bring it up to info@archive.org.
02:32 🔗 dashcloud so, is that considered a bug? if it is, I'll send it in
02:33 🔗 SketchCow I think if you go "holy fuck, I can't work here" because it murders your machine, it's worth mentioning
02:33 🔗 SketchCow tracey might not agree or will agree.
02:36 🔗 dashcloud personally I liked the v1 style better because it lets you see the content without needing to scroll, but it doesn't cause my machine any more trouble than usual
02:37 🔗 SketchCow The thing about v2
02:37 🔗 SketchCow Is that v2 is actually this case of shooting forward literally 16 years
02:37 🔗 SketchCow So first, they had to redo the back-end, and make it work in the system
02:38 🔗 SketchCow Now that it's there, improvements and changes are not traumatic at all.
02:38 🔗 SketchCow And as seen in the changelog, things are ramping up hella quick
02:38 🔗 SketchCow By the way, the web screenshotter's been chugging along.
02:39 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
02:42 🔗 SketchCow And in some cases, REALLY chugging.
02:42 🔗 GLaDOS has joined #archiveteam
02:49 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ continues (for the moment) to link to the workspace.
02:50 🔗 kyan Here's the one with the massive MASSIVE PNG file that got chose as the default: https://archive.org/details/CHORALECONCERT111613.ripped17jan2014
02:52 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
02:55 🔗 GLaDOS has joined #archiveteam
03:01 🔗 dan_ has quit IRC (Ping timeout: 260 seconds)
03:01 🔗 xmc v1 didn't downscale images either, it just put them in a tiny little <table>
03:06 🔗 dan_ has joined #archiveteam
03:12 🔗 DFJustin well it did a super-tiny jpeg thumb if the derive ran successfully
03:12 🔗 xmc oh
03:13 🔗 DFJustin and if the image wasn't a gif because all gifs are animations right
03:13 🔗 SketchCow I STILL think of xmc and DFJustin as the same guy
03:13 🔗 SketchCow And I STILL flip out when I see you two talk
03:13 🔗 xmc how the hell
03:13 🔗 SketchCow it's like, "you ok, buddy?"
03:13 🔗 SketchCow I know
03:13 🔗 SketchCow ALso, it's 4am here
03:13 🔗 xmc now i'm fucking insulted
03:13 🔗 xmc i'm about to leave the channel forever unless you apologize to me publicly
03:13 🔗 DFJustin wow mean
03:14 🔗 SketchCow .___/\ _________
03:14 🔗 SketchCow | )/_____ / _____/ __________________ ___.__.
03:14 🔗 SketchCow | |/ \ \_____ \ / _ \_ __ \_ __ < | |
03:14 🔗 SketchCow | | Y Y \ / ( <_> ) | \/| | \/\___ |
03:14 🔗 SketchCow |___|__|_| / /_______ /\____/|__| |__| / ____|
03:14 🔗 SketchCow \/ \/ \/
03:14 🔗 * xmc mollified
03:15 🔗 SimpBrain has quit IRC (Read error: Connection reset by peer)
03:18 🔗 SketchCow Ah, I have to wake up in 3 hours
03:18 🔗 SketchCow Good thing I got a jump on that.
03:18 🔗 Ara_ has joined #archiveteam
03:24 🔗 brayden has quit IRC (Read error: Operation timed out)
03:34 🔗 mistym has quit IRC (Remote host closed the connection)
03:58 🔗 mistym has joined #archiveteam
04:00 🔗 Ara_ has quit IRC (Read error: Connection reset by peer)
04:08 🔗 aaaaaaaaa has quit IRC (Leaving)
04:42 🔗 balrog has quit IRC (Ping timeout: 260 seconds)
04:53 🔗 rejon has joined #archiveteam
04:58 🔗 balrog has joined #archiveteam
04:58 🔗 swebb sets mode: +o balrog
05:03 🔗 qwebirc10 has joined #archiveteam
05:04 🔗 qwebirc10 has quit IRC (Client Quit)
05:19 🔗 brayden has joined #archiveteam
06:10 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
06:11 🔗 GLaDOS has joined #archiveteam
06:28 🔗 wp494 has quit IRC (Ping timeout: 740 seconds)
06:53 🔗 mistym has quit IRC (Remote host closed the connection)
06:53 🔗 wp494 has joined #archiveteam
06:55 🔗 techapj_ has joined #archiveteam
07:14 🔗 Nemo_bis http://www.museumofplay.org/online-collections/search/index.php?object_id=&title=&name=&subject=&artist=&manufacturer=&material=&credit_line=Jason+Scott
07:17 🔗 thisismyn has quit IRC (Ping timeout: 260 seconds)
07:22 🔗 primus104 has joined #archiveteam
07:37 🔗 Jonimus has quit IRC (Ping timeout: 370 seconds)
07:37 🔗 dashcloud has quit IRC (Ping timeout: 260 seconds)
07:37 🔗 dashcloud has joined #archiveteam
07:48 🔗 bmcginty_ has joined #archiveteam
07:48 🔗 bmcginty has quit IRC (Read error: Connection reset by peer)
07:58 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
08:15 🔗 dashcloud has quit IRC (Read error: Operation timed out)
08:18 🔗 dashcloud has joined #archiveteam
08:23 🔗 T31M has quit IRC (Quit: Leaving)
08:26 🔗 schbirid has joined #archiveteam
08:26 🔗 SimpBrain has joined #archiveteam
08:27 🔗 techapj_ has quit IRC (Ping timeout: 240 seconds)
08:33 🔗 techapj_ has joined #archiveteam
08:36 🔗 arkiver Kenshin: would you like to help us on the rapidshare grab? :)
08:51 🔗 SketchCow The speech went well
08:51 🔗 godane hey SketchCow
08:51 🔗 SketchCow Hey, godane.
08:51 🔗 godane i'm uploading more led zeppelin bootlegs to your ftp
08:52 🔗 godane april 8, 1970 shows right now
08:52 🔗 techapj_ has quit IRC (Quit: Page closed)
08:53 🔗 xmc wooooo
08:54 🔗 xmc i suspect that led zeppelin bootlegs will be more broadly interesting than the complete archives of glenn beck
08:54 🔗 xmc but it's important to make sure that the full breadth of glenn beck's opinions are preserved regardless :)
08:55 🔗 godane i have his radio shows going back to august 2005
08:56 🔗 godane there are more links before that but there all dead
08:57 🔗 xmc well, ok :)
08:57 🔗 godane sort of surprise that i got that much since it wasn't on his main website
08:57 🔗 godane that only goes back to 2008
08:57 🔗 godane also some of the rtmp stream still works last time i checked
08:57 🔗 techapj_ has joined #archiveteam
08:58 🔗 godane but none for the radio shows
08:58 🔗 godane i'm also grabbing more south korea news
08:58 🔗 godane SBS network now
08:59 🔗 godane tons of dial-up videos and some stock video
09:01 🔗 godane ok now this is interesting
09:01 🔗 godane i'm getting a full 78 minute (news) show in dial-up
09:03 🔗 godane i think its some sort of debate show
09:03 🔗 godane or game show
09:04 🔗 godane i think these videos are from 1998ish now
09:04 🔗 godane or 2001
09:05 🔗 godane it was flash dates all from 1997
09:05 🔗 godane then 2001.1.23
09:20 🔗 philpem has joined #archiveteam
09:40 🔗 JMC_ has joined #archiveteam
09:42 🔗 wp494 has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 Start has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 JMC has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 SadDM has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 xtr-201 has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 garyrh has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 pikhq has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 useretail has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 matthusby has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 will has quit IRC (ircd.shaw.ca irc.shaw.ca)
09:42 🔗 dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
09:44 🔗 dashcloud has joined #archiveteam
10:00 🔗 habi has joined #archiveteam
10:02 🔗 habi has left
10:02 🔗 wp494 has joined #archiveteam
10:02 🔗 Start has joined #archiveteam
10:02 🔗 SadDM has joined #archiveteam
10:02 🔗 garyrh has joined #archiveteam
10:02 🔗 pikhq has joined #archiveteam
10:02 🔗 useretail has joined #archiveteam
10:02 🔗 matthusby has joined #archiveteam
10:02 🔗 will has joined #archiveteam
10:02 🔗 irc.shaw.ca sets mode: +o SadDM
10:02 🔗 swebb sets mode: +o SadDM
10:05 🔗 Ymgve has joined #archiveteam
10:29 🔗 cloudmons has quit IRC (Read error: Connection reset by peer)
10:29 🔗 cloudmons has joined #archiveteam
11:17 🔗 w0rp has quit IRC (Ping timeout: 265 seconds)
11:17 🔗 Ravenloft has quit IRC (Ping timeout: 265 seconds)
11:18 🔗 lytv has quit IRC (Ping timeout: 265 seconds)
11:18 🔗 dx- has quit IRC (Ping timeout: 265 seconds)
11:18 🔗 ryan has joined #archiveteam
11:18 🔗 dx has joined #archiveteam
11:19 🔗 w0rp has joined #archiveteam
11:20 🔗 jk[[SVP]] has joined #archiveteam
11:21 🔗 lytv has joined #archiveteam
11:21 🔗 SketchCo1 has joined #archiveteam
11:21 🔗 swebb sets mode: +o SketchCo1
11:21 🔗 Kazzy_ has joined #archiveteam
11:22 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
11:24 🔗 Sk2d has joined #archiveteam
11:26 🔗 Sk1d has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 jk[SVP] has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 Gfy has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 underscor has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 filippo has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 ryan__ has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 Deewiant has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 Kazzy has quit IRC (hub.se efnet.portlane.se)
11:26 🔗 SketchCow has quit IRC (hub.se efnet.portlane.se)
11:27 🔗 techapj_ has quit IRC (Ping timeout: 240 seconds)
11:28 🔗 Gfy_ has joined #archiveteam
11:42 🔗 Gfy_ is now known as Gfy
11:42 🔗 jk[[SVP]] is now known as jk[SVP]
11:42 🔗 brayden has quit IRC (Quit: Leaving)
11:42 🔗 underscor has joined #archiveteam
11:42 🔗 swebb sets mode: +o underscor
11:42 🔗 Kazzy_ is now known as Kazzy
11:42 🔗 Sk2d is now known as Sk1d
11:56 🔗 yan has joined #archiveteam
11:57 🔗 filippo has joined #archiveteam
12:02 🔗 T31M has joined #archiveteam
12:08 🔗 primus104 has quit IRC (Leaving.)
12:24 🔗 dashcloud has quit IRC (Read error: Operation timed out)
12:31 🔗 dashcloud has joined #archiveteam
12:36 🔗 habi has joined #archiveteam
12:42 🔗 sankin has joined #archiveteam
12:49 🔗 Jonimus has joined #archiveteam
12:52 🔗 brayden has joined #archiveteam
13:09 🔗 habi has left
13:40 🔗 ionpulse has quit IRC (Read error: Connection reset by peer)
13:46 🔗 Start has quit IRC (Disconnected.)
13:47 🔗 ionpulse has joined #archiveteam
14:31 🔗 T31M has quit IRC (Ping timeout: 1221 seconds)
14:34 🔗 mistym has joined #archiveteam
14:34 🔗 Start has joined #archiveteam
14:35 🔗 Start_ has joined #archiveteam
14:35 🔗 Start has quit IRC (Read error: Connection reset by peer)
14:39 🔗 mistym has quit IRC (Remote host closed the connection)
14:43 🔗 primus104 has joined #archiveteam
14:46 🔗 midas http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=RssLanding&cat=news&id=2028891
14:46 🔗 Start_ has quit IRC (Read error: Connection reset by peer)
14:46 🔗 midas amazon has unlimited storage for 60 bucks a year
14:46 🔗 Start has joined #archiveteam
14:47 🔗 Stiletto has joined #archiveteam
14:56 🔗 mistym has joined #archiveteam
14:58 🔗 Smiley Oooo ooo ooo
14:58 🔗 Smiley archiving emails
14:58 🔗 Smiley promotional emails, do we do it? SHOULD WE? HELLLLLL YEAAAAH
15:01 🔗 Smiley but how?
15:01 🔗 Smiley Oh, how about a account that receieves email and renders htem out and uploads?
15:01 🔗 Sanqui sign up for ALL the newsletters
15:01 🔗 Sanqui sounds like a fairly simple project
15:01 🔗 Sanqui Project Newsletter
15:02 🔗 Sanqui just advertise the email address and tell people to sign it up for as many newsletters as possible
15:04 🔗 Sanqui would be even better if an entire domain was dedicated to this
15:05 🔗 Sanqui so you could sign up with something like nytimes@projectneswletter.org
15:05 🔗 Sanqui this is sounding awesome already
15:05 🔗 Sanqui Smiley, what do you think?
15:05 🔗 Smiley sounds good
15:05 🔗 Smiley get on it ;)
15:06 🔗 Smiley Sanqui: it's possible it's already done, hense the discussion :)
15:06 🔗 Sanqui yeah.. now if only I had the money for the domain and a server to dedicate to this :P
15:10 🔗 johtso Not just newsletters, archive the spam!
15:10 🔗 primus104 has quit IRC (Leaving.)
15:10 🔗 habi has joined #archiveteam
15:10 🔗 habi has left
15:10 🔗 johtso would be interesting to look back and see how 419 scam emails have changed over time
15:11 🔗 johtso http://untroubled.org/spam/
15:14 🔗 Sanqui with some newsletters, the line gets blurry
15:14 🔗 Sanqui also, you sign up for newsletters and get signed up for unrelated spam
15:15 🔗 johtso Many newsletters aren't open and are only sent to members of a particular service
15:15 🔗 johtso some would be easier to sign up for than others
15:16 🔗 achip might need a way to "confirm" subscription too, view the inbox for the newsletter and click the confirm link, archiving mailing lists would be good too
15:17 🔗 Sanqui right, you'd need to confirm (or auto-confirm based on some heuristics)
15:17 🔗 schbird has joined #archiveteam
15:17 🔗 schbird hey, i just noticed that germany's main tabloid blocks everyone but search engine robots
15:17 🔗 schbird bild.de
15:17 🔗 Sanqui i think there are already some mailing list archiving services
15:18 🔗 schbird if there is any way that we could get it crawled and safed regularly, it would be great. even if not public or in non-waybackmachine warcs
15:18 🔗 Start has quit IRC (Disconnected.)
15:19 🔗 schbird also http://web.archive.org/save/http://www.mopo.de/panorama/flug-4u9525-zerschellte-am-berg-extra-schub-der-triebwerke-sorgte-fuer-grosse-explosion,5066860,30220524.html fails
15:19 🔗 Sanqui didn't fail for me
15:20 🔗 schbird nice
15:20 🔗 schbird told me the page was not available live
15:24 🔗 johtso What's the best way to go about archiving file locker downloads "manually"?
15:25 🔗 johtso If I've scraped some collection of links, in what format should I upload them?
15:25 🔗 johtso I'm thinking associating the uploads in some way with the original URL would be a good idea
15:25 🔗 Start has joined #archiveteam
15:26 🔗 johtso uploading them in their archives seems a bit pointless though
15:26 🔗 Start has quit IRC (Read error: Connection reset by peer)
15:26 🔗 johtso and if it happens to be all links from a certain blog for example, it would probably make sense to upload them all as a group.
15:28 🔗 Start has joined #archiveteam
15:35 🔗 scyther has joined #archiveteam
15:42 🔗 dashcloud has quit IRC (Read error: Operation timed out)
15:47 🔗 dashcloud has joined #archiveteam
15:51 🔗 garyrh has quit IRC (Remote host closed the connection)
15:51 🔗 Start has quit IRC (Disconnected.)
16:01 🔗 VADemon has joined #archiveteam
16:08 🔗 schbird has quit IRC (Quit: Leaving)
16:16 🔗 habi has joined #archiveteam
16:20 🔗 habi has left
16:28 🔗 Start has joined #archiveteam
16:29 🔗 Start_ has joined #archiveteam
16:29 🔗 Start has quit IRC (Read error: Connection reset by peer)
16:29 🔗 Start_ is now known as Start
16:32 🔗 Start arkiver: now that rapidshare's up and running, let's get started on music unlimited
16:32 🔗 Start 3 days left
16:32 🔗 primus104 has joined #archiveteam
16:36 🔗 SimpBrain o.O
16:45 🔗 Start has quit IRC (Disconnected.)
16:55 🔗 patricko- is now known as patrickod
16:56 🔗 aaaaaaaaa has joined #archiveteam
17:01 🔗 kyan johtso: Definitely include the original archive files — it's much better to provide unmodified copies of data.
17:02 🔗 johtso kyan, but then the data can't be explored on IA can it?
17:02 🔗 kyan ZIP files and ISOs can be browsed like folders on IA
17:02 🔗 kyan and tars I think
17:02 🔗 johtso oh nice
17:03 🔗 johtso kyan, and relating it to the relevant URL?
17:03 🔗 kyan Do bear in mind that for some reason IA's upload form rejects RAR files, so if you have those it would make sense to put them in another file eg "rar-files-packed-2015March26"
17:03 🔗 johtso oh, that's annoying..
17:04 🔗 kyan I'm not sure about that, I don't know that IA has facilities for referencing things other than WARC files by URL
17:05 🔗 kyan Probbably the best way to do that would be to make an HTML page that lists the original links, with hyperlinks to the new archived locations. I'm pretty sure it's possible to link directly to a file within a ZIP/ISO/tar on IA
17:05 🔗 kyan Alternatively, for uploading RARs you could make a torrent file of them, and then upload the torrent file. IA will automatically download the contents of torrent files, and it seems to accept RARs that way.
17:07 🔗 kyan Here's an example of the ZIP file viewer: https://archive.org/download/AElTantamusFuturamerlinID5194/2719.zip/
17:08 🔗 patrickod is now known as patricko-
17:08 🔗 DFJustin upload the original archive and the extracted files both
17:08 🔗 DFJustin also you can upload rar via command line or s3
17:09 🔗 arkiver Start: yes, having a look at it atm
17:12 🔗 DFJustin here's an example I did from a video game music blog https://archive.org/details/Saladedemais_-_Follin_Project_-_Amiga
17:20 🔗 johtso DFJustin, ah interesting
17:20 🔗 johtso struggling to work out how that archive is structured.. IA uploads do have a file structure don't they?
17:21 🔗 mistym has quit IRC (Remote host closed the connection)
17:28 🔗 johtso kyan, thanks for the torrent file tip, that sounds like a very convenient solution!
17:28 🔗 johtso kyan, does it respect folder structures?
17:28 🔗 DFJustin it may be clearer to look at the raw directory https://archive.org/download/Saladedemais_-_Follin_Project_-_Amiga/
17:28 🔗 johtso ah nice, yes that is clearer
17:29 🔗 DFJustin you can have folder structures but it's not really visible in the ui and it's probably better to split things up usually
17:29 🔗 johtso DFJustin, hmm, it's a bit tricky, because I'm dealing with bulk here
17:30 🔗 johtso scraping all the archives linked to from a website.. each one of those archives pertaining to some album of music
17:30 🔗 DFJustin ah so folders would be like disc 1, disc 2?
17:31 🔗 DFJustin generally one album = one IA item
17:31 🔗 johtso DFJustin, take this site for example
17:32 🔗 johtso so many archives, each containing a directory of mp3 files
17:32 🔗 scyther has quit IRC (Read error: Connection reset by peer)
17:33 🔗 johtso obviously the pertinent thing here is getting the data preserved, and doing an upload per download would most likely make it too time consuming
17:34 🔗 DFJustin there are ways of doing batch uploads
17:34 🔗 johtso DFJustin, oops, forgot the link: http://braingoreng.blogspot.co.uk/
17:36 🔗 johtso DFJustin, but what about getting the metadata right for the upload? Is there really any point in having individual uploads if I don't have the time to give each one a correct name etc.?
17:36 🔗 mistym has joined #archiveteam
17:37 🔗 DFJustin if you don't have time, uploading one item with a big zip or tar of all the archives is better than nothing
17:38 🔗 johtso DFJustin, I think that's probably the way to go, better off uploading more stuff before the files disappear than spending too long on each item
17:39 🔗 johtso I could possibly just add a list of all the linked that were scraped somewhere in the description/metadata
17:39 🔗 johtso just so that full text searches would find it
17:39 🔗 johtso *links
17:43 🔗 SketchCo1 is now known as SketchCOw
17:43 🔗 SketchCOw is now known as SketchCow
17:44 🔗 SketchCow Hi, hi.
17:44 🔗 DFJustin god dag
17:47 🔗 SketchCow Going to have a little social hour with Gothenberg locals.
17:49 🔗 johtso Is it possible to incrementally update an IA entry using bittorrent?
17:50 🔗 SketchCow It is possible to upload data into an IA item and then regenerate the torrent.
17:50 🔗 signius has quit IRC (Read error: Operation timed out)
17:56 🔗 Start has joined #archiveteam
17:57 🔗 johtso SketchCow, I was thinking more along the lines of seeding a torrent that IA is downloading, and then incrementally adding files, but that obviously goes against how torrents work
17:58 🔗 SketchCow Right
18:06 🔗 signius has joined #archiveteam
18:06 🔗 tom_ has joined #archiveteam
18:17 🔗 SketchCow Oh, other screenshotter follies:
18:17 🔗 SketchCow There was a motherdubbing 315 GB WARC in an item
18:18 🔗 yipdw heh I remember those days when we didn't split
18:18 🔗 yipdw those were awesome
18:29 🔗 ryan has quit IRC (Ping timeout: 260 seconds)
18:30 🔗 ryan_ has joined #archiveteam
18:37 🔗 Start has quit IRC (Disconnected.)
18:39 🔗 arkiver Fusl: uploading in 50G packs through torrents https://archive.org/details/Wallbase.ccArchive01
18:59 🔗 patricko- is now known as patrickod
19:01 🔗 Atluxity johtso: your thought made think about a self-written torrent tracker that a system like IA could poll with regular interwall, and predetirmened torrent names based on the current unixtime, then one could make a torrent with a filename one would know would get polled at a certain time, and then the system like IA could use it for download
19:03 🔗 johtso Atluxity, really you just want to be able to rsync to somewhere..
19:03 🔗 Start has joined #archiveteam
19:15 🔗 Stiletto has quit IRC (Ping timeout: 306 seconds)
19:19 🔗 thechip has quit IRC (Leaving...)
19:25 🔗 Start has quit IRC (Disconnected.)
19:29 🔗 patrickod is now known as patricko-
19:32 🔗 Start has joined #archiveteam
19:33 🔗 Start_ has joined #archiveteam
19:33 🔗 Start has quit IRC (Read error: Connection reset by peer)
19:36 🔗 Start_ is now known as Start
19:36 🔗 Start Sanqui: i have an idea for project newsletter
19:39 🔗 Start rather than having one email adress per newsletter, have a general purpose account (perhaps projectnewsletter@archiveteam.org) and have archivebot-like go packs separated by the sender's email
19:41 🔗 Sanqui it makes sense to have one email address per newsletter
19:41 🔗 Sanqui because that way you can more easily group together the "source"
19:42 🔗 Sanqui it's more information
19:42 🔗 Sanqui if they start sending emails from a different address, they'll still be sent to the same YOUR address
19:42 🔗 Sanqui or, if they sell the email to spamming companies, you'll know who did it ;)
19:45 🔗 Start_ has joined #archiveteam
19:45 🔗 Start has quit IRC (Read error: Connection reset by peer)
19:45 🔗 Start_ is now known as Start
19:45 🔗 Start perhaps would we have a page where people could create an email address for a newsletter
19:46 🔗 patricko- is now known as patrickod
19:46 🔗 schbirid catchall and ask people to signup with site/service specific addresses
19:46 🔗 schbirid not sure how you would handle the confirmation links though
19:47 🔗 SN4T14_ has joined #archiveteam
19:48 🔗 Start maybe if the words such as confirmation, confirm, verify, or verification are detected in the first email on an account, someone on irc gets notified
19:48 🔗 Start we could have a password protected area where the confirmation links could be clicked
19:50 🔗 achip could just be a simple interface that just displays the last 5 emails into the inbox for confirming
19:53 🔗 patrickod is now known as patricko-
19:55 🔗 tom_ has quit IRC (Ping timeout: 240 seconds)
19:55 🔗 aaaaaaaaa just issue a get request for every url that doesn't have "unsubscribe" or the like in the anchor.
19:56 🔗 SN4T14 has quit IRC (Ping timeout: 512 seconds)
20:01 🔗 joepie91_ [20:41] <Sanqui> it makes sense to have one email address per newsletter
20:01 🔗 joepie91_ there is a very nice Python libraery
20:01 🔗 joepie91_ library *
20:01 🔗 joepie91_ that makes it easy to accept email on lots of different addresses
20:01 🔗 joepie91_ dynamically
20:01 🔗 joepie91_ hold
20:02 🔗 achip could just do it with virtuals in postfix as well
20:02 🔗 joepie91_ where's the damn thing..
20:02 🔗 joepie91_ sure, if you like pain :P
20:04 🔗 joepie91_ not what I was thinking of, but https://github.com/bcoe/smtproutes seems similar
20:04 🔗 joepie91_ AH
20:05 🔗 joepie91_ found it!
20:05 🔗 joepie91_ https://github.com/kennethreitz/inbox.py
20:05 🔗 joepie91_ cc Start Sanqui schbirid
20:05 🔗 joepie91_ confirmation links are not generally hard to do - first email you receive, follow whatever is the longest link that isn't an unsubscribe link
20:05 🔗 joepie91_ chances are it's the confirm link
20:06 🔗 caber has quit IRC (Read error: Operation timed out)
20:09 🔗 caber has joined #archiveteam
20:22 🔗 Start has quit IRC (Disconnected.)
20:29 🔗 mistym has quit IRC (Remote host closed the connection)
20:32 🔗 habi has joined #archiveteam
20:33 🔗 habi has left
20:51 🔗 mistym has joined #archiveteam
20:53 🔗 achip got a POC working with that python lib, pretty slick. should probably move to a new chan ya?
20:59 🔗 schbirid has quit IRC (Leaving)
20:59 🔗 sankin has quit IRC (Leaving.)
21:01 🔗 primus has joined #archiveteam
21:03 🔗 schbirid has joined #archiveteam
21:04 🔗 schbirid has quit IRC (Client Quit)
21:10 🔗 schbirid has joined #archiveteam
21:12 🔗 Sanqui achip: i like project newsletter, but it's not very catchy/funny
21:12 🔗 Sanqui gj, by the way
21:15 🔗 achip spewsletter newslater spamcan I'm not good at these haha
21:22 🔗 johtso email archive
21:23 🔗 mistym has quit IRC (Remote host closed the connection)
21:27 🔗 johtso DFJustin, will I not fall foul of "If you upload .zip, .rar, non-audio formats (like .exe), or password-protected files, they may be removed by our moderators."
21:27 🔗 yan has quit IRC (Quit: bye)
21:34 🔗 BlueMaxim has joined #archiveteam
21:36 🔗 xmc to where?
21:37 🔗 johtso xmc, internet archive
21:38 🔗 johtso xmc, was talking earlier about rescuing filelocker downloads from music blogs, and what the best format was for uploading to IA
21:38 🔗 mistym has joined #archiveteam
21:39 🔗 xmc they'll be fiiiiine
21:39 🔗 xmc upload whatever you downloaded, don't modify it
21:40 🔗 johtso awesome.
21:40 🔗 johtso jdownloader2 really is an impressive bit of software.. wish there was a nice way it could be incorporated into some kind of automated pipeline
21:41 🔗 xmc hmmmm
21:44 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/bewareofmpreg.tumblr.com-inf-20140902-230752-dp7fc-00000.warc.gz.png
21:47 🔗 schbirid johtso: there are some similar tools for the commandline, much less feature though and of course captchas are a problem
21:50 🔗 Sanqui SketchCow: some amazing things could be done with the snapshot project
21:50 🔗 Sanqui I'm thinking a video with several website pictures per second
21:51 🔗 Sanqui but there's a lot that can be done
21:51 🔗 Sanqui another idea: order sites by hue and make a rainbow
21:52 🔗 midas well that image wasnt strange at all
21:55 🔗 joepie91_ johtso: see plowshare
21:56 🔗 schbirid has quit IRC (Leaving)
21:56 🔗 joepie91_ [22:27] <johtso> DFJustin, will I not fall foul of "If you upload .zip, .rar, non-audio formats (like .exe), or password-protected files, they may be removed by our moderators."
21:56 🔗 joepie91_ where'd you see that?
21:57 🔗 xmc probably some very old faq item
21:57 🔗 SketchCow You got a friend in the business
22:00 🔗 patricko- is now known as patrickod
22:02 🔗 patrickod is now known as patricko-
22:15 🔗 johtso joepie91_, https://archive.org/about/faqs.php#236
22:20 🔗 johtso joepie91_, nice, hadn't come across plowshare
22:21 🔗 johtso there's also pyload, which isn't command line, but is open source python
22:21 🔗 johtso the issue with these kinds of tools is that they break ALL THE TIME
22:21 🔗 johtso so unless it's a very actively maintained project it can be very frustrating
22:23 🔗 balrog SketchCow: the password-protected files thing might still be somewhat valid, right?
22:25 🔗 kyan I've uploaded quite a few password protected things that are in copyright, etc. with plans to leave the password in my will...
22:25 🔗 kyan I don't think any of them have been taken down.
22:25 🔗 xmc why not upload them in plaintext and request darking instead?
22:26 🔗 johtso xmc, I was wondering if something like that was possible
22:26 🔗 kyan meh, it's easier since I can automate the process — I find something to archive, I drop it into a folder, run a terminal command and voi la away it goes
22:26 🔗 johtso Oh, I suppose the downside to that is that you can't download them again yourself
22:26 🔗 kyan That too
22:27 🔗 johtso would be good if you could "dark" something and still have access to it
22:28 🔗 xmc there's a collection flag that makes the items in it not downloadable by anyone other than the uploader
22:28 🔗 xmc but they are still visible
22:29 🔗 xmc e.g. https://archive.org/details/bellsystempractices
22:32 🔗 joepie91_ johtso: I still intend to build a well-maintained Node.js module for this kind of thing
22:32 🔗 VADemon does anybody know of a url grabber for google? ~75k search results
22:32 🔗 joepie91_ but I have a bunch of other things to get done first
22:32 🔗 joepie91_ :)
22:32 🔗 joepie91_ VADemon: forget it, heh
22:33 🔗 xmc VADemon: google won't give you anything actually past the 1000th result
22:33 🔗 xmc and they make it damn hard to get there anyway
22:33 🔗 joepie91_ VADemon: it's extremely unlikely that you'll manage to automate your way to even 10 pages without getting canned
22:33 🔗 joepie91_ they're VERY aggressive towards bots
22:33 🔗 johtso meanies
22:34 🔗 xmc you could probably browse manually through a warc-writing proxy and scrape the results from the cache though
22:34 🔗 joepie91_ they do crap like comparing user agents to known unusual behaviour for those specific useragents, to detect fakes
22:34 🔗 joepie91_ it's ridiculous
22:34 🔗 joepie91_ even if you phantomjs your way in, they'll likely still know
22:34 🔗 VADemon I am surprised but its not surprising %)
22:35 🔗 VADemon I will GUI FIREFOX MY WAY IN THEN
22:35 🔗 joepie91_ I've had very limited success but certainly nowhere near 75k results
22:35 🔗 Stiletto has joined #archiveteam
22:39 🔗 johtso probably better off trying to break into their datacenters
22:39 🔗 Start has joined #archiveteam
22:39 🔗 johtso or getting a job at google :P
22:40 🔗 VADemon it's easier to start WW3 than this ^ & ^^
22:44 🔗 Start i'm going to create a page for project newsletter
22:45 🔗 Start i'm kinda feeling meh on the name though
22:45 🔗 Start anyone got a better idea for the name?
22:49 🔗 Stiletto has quit IRC (Read error: Operation timed out)
22:52 🔗 VADemon bad name incoming: rss -> resser. pure creativity
22:58 🔗 VADemon got google captcha after 7 pages 100links each ;)
22:59 🔗 johtso Start, is it just newsletters? or all kinds of automated email?
22:59 🔗 Start i suppose it could be for both
23:01 🔗 Start article has been created: http://archiveteam.org/index.php?title=Project_Newsletter
23:03 🔗 SimpBrain has quit IRC (Quit: Leaving)
23:03 🔗 achip there could be a catchall domain(s) that's semi-public that isn't a big deal if it catches spam, but I like the idea of creating the addresses on another domain(s) for each or a set of newsletters to cut out spam
23:13 🔗 arkiver Start: How do you think we can best do music unlimited?
23:14 🔗 arkiver the site will totally not show up nicely in wayback machine
23:14 🔗 Start as long as the content is archived
23:15 🔗 Start if anything needs an account i could create one
23:18 🔗 fool8 has joined #archiveteam
23:18 🔗 fool8 Can I upload a 20GB+ WARC to archive.org?
23:19 🔗 fool8 I forgot --warc-max-size
23:21 🔗 arkiver fool8: yes, please upload it
23:21 🔗 arkiver paste a link to the item here and we'll move it to a web collection
23:21 🔗 philpem has quit IRC (Ping timeout: 260 seconds)
23:22 🔗 fool8 I'll come back later or tomorrow
23:22 🔗 fool8 has quit IRC (foo bar baz qux)
23:24 🔗 arkiver Start: I think we'll just grab all the pages and links on the pages that were discovered
23:24 🔗 arkiver the website is totally filled with javascript
23:24 🔗 arkiver chfoo: do you think we are able to do a full grab of https://music.sonyentertainmentnetwork.com/ ?
23:28 🔗 arkiver For the newsletter archiving I think there should definitely be some sort of human intervention or check before new newsletters are added
23:30 🔗 Start !a https://music.sonyentertainmentnetwork.com --phantomjs
23:30 🔗 Start wrong channel
23:31 🔗 Start arkiver: how did archivebot handle music unlimited
23:33 🔗 arkiver running now
23:34 🔗 Start i'll carefully watch it and see how effective it is
23:40 🔗 arkiver #limitedmusic

irclogger-viewer