[00:03] *** mutoso has joined #archiveteam-bs [00:11] *** kyan has joined #archiveteam-bs [00:37] *** RichardG has quit IRC (Ping timeout: 250 seconds) [00:39] so we are down to 160tb of free space on IA [00:39] ? [01:01] *** wyatt8740 has quit IRC (Read error: Operation timed out) [01:02] *** wyatt8740 has joined #archiveteam-bs [01:17] Possibly [01:18] i'm grabbing the sound cloud stuff from theblaze [01:19] *** JW_work21 has quit IRC (Read error: Operation timed out) [01:25] *** slyphic|a is now known as slyphic [01:38] *** RichardG has joined #archiveteam-bs [01:46] *** superkuh has quit IRC (Ping timeout: 252 seconds) [01:50] *** RichardG has quit IRC (Ping timeout: 633 seconds) [01:52] *** RichardG has joined #archiveteam-bs [01:52] *** JesseW has joined #archiveteam-bs [02:12] *** brayden_ has joined #archiveteam-bs [02:12] *** swebb sets mode: +o brayden_ [02:15] *** vitzli has joined #archiveteam-bs [02:17] *** brayden has quit IRC (Read error: Operation timed out) [03:10] *** snape has joined #archiveteam-bs [04:19] *** kyan has quit IRC (Leaving) [04:27] *** kyan has joined #archiveteam-bs [04:41] *** Microguru has quit IRC (Read error: Connection reset by peer) [04:54] *** kyan has quit IRC (Leaving) [04:59] *** superkuh has joined #archiveteam-bs [05:06] *** JetBalsa has quit IRC (Read error: Connection reset by peer) [05:11] *** Infreq_ has quit IRC (Ping timeout: 252 seconds) [05:12] *** Infreq has joined #archiveteam-bs [05:24] *** acridAxid has quit IRC (Quit: marauder) [05:26] *** acridAxid has joined #archiveteam-bs [05:40] *** Sk1d has quit IRC (Ping timeout: 200 seconds) [05:46] *** Sk1d has joined #archiveteam-bs [05:52] *** JW_work2 has joined #archiveteam-bs [05:52] *** vitzli has quit IRC (Leaving) [05:55] *** JW_work2 has quit IRC (Client Quit) [06:31] *** nickname_ has quit IRC (Ping timeout: 300 seconds) [06:40] *** GLaDOS has quit IRC (Read error: Operation timed out) [06:41] *** GLaDOS has joined #archiveteam-bs [07:27] *** vitzli has joined #archiveteam-bs [07:39] *** yipdw has quit IRC (Ping timeout: 633 seconds) [07:57] *** JesseW has quit IRC (Quit: Leaving.) [08:06] *** yipdw has joined #archiveteam-bs [08:12] *** yipdw has quit IRC (Ping timeout: 246 seconds) [08:19] *** JesseW has joined #archiveteam-bs [08:43] *** RichardG has quit IRC (Ping timeout: 250 seconds) [09:01] *** schbirid has joined #archiveteam-bs [09:01] *** yipdw has joined #archiveteam-bs [09:18] TIL that some of IA's items lack a listed "uploader", e.g. https://archive.org/details/CaseofSp1940 (from the prelinger collection, uploaded in 2002 apparently) [09:25] some items has "is_darked": True and no any metadata regarding files and checksums [09:26] I think I saw the one with no uploader too [09:49] Hm, I thought most with is_darked gave nothing for metadata. [09:49] vitzli: example? [09:51] https://github.com/jjjake/internetarchive/pull/123 <- neat little function I just wrote to display likely spam [10:01] JesseW, only 4 or 5 in IA.BAK collections, one example would be harvardclassics02elio [10:02] oops, correct key "is_dark" [10:03] another is Commodore_MicroComputer_Issue_37_1985_Sep_Oct [10:04] vitzli: https://archive.org/metadata/harvardclassics02elio looks like a normal item -- no is_dark visible [10:04] same with https://archive.org/metadata/Commodore_MicroComputer_Issue_37_1985_Sep_Oct [10:05] hm [10:06] what data are you seeing it in? [10:06] the IA.BAK copies? [10:07] no, the census on IA.BAK items I did [10:08] hm, that's very odd [10:09] it's fine now [10:09] vitzli: pastebin your census entry for one of the ones above? [10:09] https://paste.ee/p/cKZFK [10:10] ah, yeah, I have seen that too [10:11] e.g. https://archive.org/metadata/0002Mistakes [10:12] but my list of darked ones doesn't include either of those [10:13] vitzli: when is that paste from? [10:13] i.e. when did you run your census? [10:14] 2016-02-03 [10:14] that's bizzare, because https://archive.org/history/Commodore_MicroComputer_Issue_37_1985_Sep_Oct shows no changes since 2015-08 [10:16] the created and uniq values are also different [10:17] maybe something went really wrong [10:19] yeah -- also, just as I was watching, a 2nd copy showed up [10:19] your data shows it only on ia902600 -- and that's what I saw a few minutes ago, but now it is also on ia802600 [10:21] no is_dark items on internetarchivebooks collection [10:21] I'll drop my result and redo the census [10:21] vitzli: also, the created value in your data is from the same day you did it. [10:21] Feb 3 [10:21] (the created value is in unix epoch) [10:22] how many identifiers are you doing in your census? [10:22] BTW, I've uploaded some of mine to FOS -- it should get up to IA pretty soon now. [10:23] 142462 non-unique identifiers [10:23] 106054 uniques [10:23] I thought you were just doing the IA.BAK stuff -- that's much smaller. [10:23] I was going to redo census anyway, because of duplicates/missing ids [10:24] all items in all collections listed on iabak page [10:24] That's still much less than 10 million identifiers, I think. [10:24] I don't understand about 10 million part [10:25] Ah, that was a mistake by my eyes. I misread 106,054 as 10,060,054 or something like that. [10:26] Yeah, about a hundred thousand sounds about right for IA.BAK collections. [10:27] The total number generated by jake back in March 2015 was 14,926,080. [10:27] (with one duplicate, I think) [10:29] so 1% then, not bad considering it took less than an hour [10:30] wow, that is quick [10:31] If you have the space, you might consider downloading the old (or my new, once it gets there) census data and extracting identifier lists from that, as it does include collections information [10:32] that would be the next step or one more step away, I still need to fix duplicates in my data [10:32] in any case, I should sleep [10:32] 2:30 AM where I am... [10:32] good night, it's 16:32 here [10:33] ah, just late afternoon [10:35] *** JesseW has quit IRC (Quit: Leaving.) [11:35] *** dan- has quit IRC (Read error: Operation timed out) [11:49] *** oldcad has joined #archiveteam-bs [12:10] JesseW, could you run ia-mine on sahfgtb2007-10-31.sbd.flac item, please? I get empty string with 'ia-mine', but 'ia metadata' return json [13:03] *** dan- has joined #archiveteam-bs [13:35] *** nickname_ has joined #archiveteam-bs [13:42] *** Stilett0 has joined #archiveteam-bs [13:44] *** Stiletto has quit IRC (Read error: Operation timed out) [14:00] *** RichardG has joined #archiveteam-bs [14:10] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [14:10] *** GLaDOS has joined #archiveteam-bs [14:35] *** GLaDOS has quit IRC (Read error: Operation timed out) [14:37] *** GLaDOS has joined #archiveteam-bs [14:54] *** zyphlar_ has quit IRC (Quit: Connection closed for inactivity) [14:59] *** Start has quit IRC (Quit: Disconnected.) [15:48] *** Start has joined #archiveteam-bs [16:56] *** vitzli has quit IRC (Leaving) [17:01] *** JesseW has joined #archiveteam-bs [17:04] *** VADemon has joined #archiveteam-bs [17:07] *** Start has quit IRC (Quit: Disconnected.) [17:09] *** JesseW has quit IRC (Quit: Leaving.) [17:09] *** JW_work2 has joined #archiveteam-bs [17:20] *** JW_work2 has quit IRC (Leaving.) [17:35] SketchCow: i'm up to 2008-03-31 with kpfa [17:36] Fantastic [17:37] *** JW_work2 has joined #archiveteam-bs [17:38] *** JesseW has joined #archiveteam-bs [17:40] *** JW_work2 has quit IRC (Client Quit) [17:42] attic init /mnt/backup/erotica [17:57] *** JesseW has quit IRC (Quit: Leaving.) [17:57] *** slyphic is now known as slyphic|a [18:12] haha abusing youtube as storage sounds fun D: [18:12] antomatic: hmmm drat, true. [18:12] put a cloud in a cloud in a cloud [18:12] but contentid stuff is generally _already_ at some somewhere [18:13] at least somewhere* [18:13] *** weles has joined #archiveteam-bs [18:13] problem is there's no way to tell - so if you upload 30,000 files and more than 3 of them get flagged/taken down, they're all lost [18:14] unless you distribute it across lots and lots of channels [18:14] oh right [18:14] hmmm [18:14] or unless everything is encoded in such a way that the representation is not, in itself, a copyvio [18:14] cept I thought I had more than 3 [18:14] because lots of gaming stuff gets flagged [18:14] convert bits to dots, add as frames for a videos [18:14] not that it'd stop google if they thought that people were abusing their generous free peanut offer [18:15] and add some repair stuff in case a video is deleted [18:15] actually contentID flags are OK, it's takedowns that count for the '3 strikes' purposes. [18:15] Normally mine just says 'this content is not allowed in some countries or something:/ [18:15] Dear google, please archive soundcloud, kthx [18:15] Ah ok nope, no takedowns on me [18:26] SketchCow: Can you have a look at https://archive.org/details/microsoft_word_5.5_german before I upload more DOS software? And yes, I fucked up mediatype. Can this be fixed? [18:27] Fixed it. [18:27] Other than mediatype, it worked. [18:31] Thanks. [18:31] Screenshots displaying under it are this tricky stupid business. [18:31] basically, make the screenshot a GIF and name it .gif [18:32] Upload that, it'll be under it. [18:32] Or I can run my screenshotter on it. [18:32] I’ll convert the .png [18:35] Hm, Word itself does not seem to work though. Stuck at the copyright dialog for me. [18:45] *** Stilett0 has quit IRC (Read error: Operation timed out) [18:48] *** Start has joined #archiveteam-bs [18:50] *** beardicus has quit IRC (bye) [19:00] Really? I got right in. [19:00] *** mismatch has joined #archiveteam-bs [19:01] *** phuzion has quit IRC (Remote host closed the connection) [19:02] *** phuzion has joined #archiveteam-bs [19:02] *** Start has quit IRC (Quit: Disconnected.) [19:18] Firefox 38 here. [19:20] Nope, mouse not working, fullscreen blanks the window. [19:22] PurpleSym, crashed Opera [19:24] Works on Chrome 48 and Firefox 44 for me [19:25] Ok, thanks for testing. [19:25] works chrome 47 linux [19:25] *** Start has joined #archiveteam-bs [19:36] *** RichardG has quit IRC (Ping timeout: 492 seconds) [19:46] *** Stiletto has joined #archiveteam-bs [20:02] *** RichardG has joined #archiveteam-bs [20:37] *** slyphic|a is now known as slyphic [20:43] *** atlogbot has joined #archiveteam-bs [20:45] *** Start has quit IRC (Quit: Disconnected.) [20:48] *** signius has quit IRC (Read error: Operation timed out) [20:59] *** signius has joined #archiveteam-bs [20:59] *** weles has quit IRC (Read error: Operation timed out) [21:06] *** mismatch has quit IRC (Ping timeout: 260 seconds) [21:38] *** bzc6p has joined #archiveteam-bs [21:38] *** swebb sets mode: +o bzc6p [21:47] SketchCow: i'm grabbing metadata that could be used to grab pacifica Radio Archives [21:49] So the other day I've written my first "please let us archive content of your site before it possibly gets all deleted", and I could experience being upset for being refused. [21:49] Especially cute that the reply was doubly ambiguous thanks to my beautiful language. [21:51] heh dialogue is the first step towards something good [21:51] The transcript would be like "we're (unable|unwilling) to [let you] archive our site", so it may have four meanings, but basically all say GTFO. [21:51] im in discussion with the friendsreunited guy, currently going good [21:51] Being polite enough, however, to say "thank you for the suggestion" and "we ask for your understanding". [21:51] got a long email to create tomorrow with ideas and such [21:52] Good luck for that. [21:52] yeah a helpful tone goes a long way [21:53] I wrote a very kind and long letter too. But what to do if it doesn't match their business policy. [21:54] you win some, you lose some [21:54] Also, one can't do anything to archive this site, as it's down since July. They say it's a database problem but stuff is not lost. And, most importantly, [21:54] WE ARE CONTINUOUSLY WORKING ON IT [21:54] Yeah, since July. Continuously. [21:55] Sorry for flooding with this, but felt good to tell my first time experienceű [21:56] either the company wants to do something helpful, or they just want to trash it [21:56] especially when upset for 20 million photos probably being lost while they pretend "Oh, don't worry, we'll fix it." [22:02] I hate such temporary states. [22:05] Their conscience may not let them say "Well, we'll just delete all your stuff", but they don't want to tire with fixing the database. (Can a database go sooo wrong that they say every day "yuck, maybe tomorrow"?) [22:05] Sometimes I say that too. The difference is that I'm not sitting on tens of terabytes of user data. [22:06] Okay, I've done with my outrage. [22:08] *** slyphic is now known as slyphic|a [22:14] bzc6p, I once had to try to migrate a forum between an eight-year-old long-abandoned software and something newer. MySQL DB was a mere 600MB or so. Each attempt to convert it over into a newer more standard format took about seventy hours on a middle-of-the-road dedicated server. And it took about ten tries to get it right, not gonna lie. >.< [22:15] *** espes__ has quit IRC (Read error: Operation timed out) [22:17] SketchCow: starting to upload 2008-04 of kpfa [22:20] Those who have run a site for ten years should have the knowledge and time to do that. Also, it seems they could just re-create a new database from the directory structure (maybe some info would be lost but at least images could be shown per user). Also, WHERE IS THEIR DATABASE BACKUP? Also, why haven't they done anything for months? [22:22] *** SN4T14 has joined #archiveteam-bs [22:22] *** SN4T14 has quit IRC (Connection closed) [22:22] I can't see any viable excuse. I consider this a nasty way of shutting down. [22:23] * joepie91 agress [22:23] agrees* [22:25] *** Stiletto has quit IRC (Read error: Connection reset by peer) [22:25] I am agress. [22:26] *** Stiletto has joined #archiveteam-bs [22:29] Give 'em grudging respect for not selling the "assets" (i.e. user info) to some sleazy marketing corp, at least. Cough, Myspace, cough. [22:30] userdata is $$$ [22:39] *** bzc6p has left [22:47] *** wyatt8740 has quit IRC (Read error: Operation timed out) [23:00] *** bzc6p has joined #archiveteam-bs [23:00] *** swebb sets mode: +o bzc6p [23:03] I know I've spoke enough of that service, but there are a few more laughable details I just discovered and can't help telling it. [23:05] 1. I wrote directly to the company's contact: the company name was on the main page (now contains just plain text about the error) as kind of a signature. Right after they received my mail, they removed their company name from there. [23:05] 2. The only contact info left there now is just an email address, which, by the way, is not working. [23:06] * bzc6p doesn't know if to laugh or to cry [23:09] Maybe it's time to tell I speak about fotoalbum.hu [23:10] *** Stiletto has quit IRC (Read error: Operation timed out) [23:11] joepie91: I thought you were the most quick-tempered here in ArchiveTeam. Maybe not. [23:11] Sorry everyone again for writing too much. I'll try not to appear in a while. [23:12] bzc6p: don't worry about it, I think we all silently share the frustration :p [23:13] * xmc nods quietly [23:13] i wrote to a url shortener years ago about sharing their db [23:13] they said "fuck off i'm running this forever" [23:13] they went bust months later? [23:14] six months ago i registered the domain because they let it lapse and nobody cared enough to squat it [23:14] It seems all of us have their own stories. [23:18] url shorteners are a dime a dozen. [23:18] lispurl.com was unique :( [23:20] Sure it was. :) [23:21] you'd get things like http://lispurl.com/caadadaaadddar [23:23] Here's my list from about a year ago that I created from the twitter stream: https://gist.github.com/scumola/5216839 [23:23] That's the top-1000 url shorteners used on twitter in 2014. [23:23] I'd bet money that 1/2 of them are not working anymore. [23:27] *** nickname_ has quit IRC (Ping timeout: 300 seconds) [23:29] I like this one - really short: http://v.ht/ [23:35] *** bzc6p has left [23:35] Not sure if anyone's interested, but there's a photo/file host in the gulf with no discernable download limits and sequential numeric IDs; mrkzgulf.com/do.php?img=NNNNNN, where NNNNNN is roughly 190000 or below. Most pics appear to be Daesh propaganda; most non-pic files appear to be split rars of older scene movie releases. O.o [23:39] *** nickname_ has joined #archiveteam-bs [23:44] hm, that could be a good idea to snag [23:44] a single-person wpull project, even :P [23:48] I did an exploratory scrape of IDs 187950-189950 just for the hell of it (2k items), and it would up being about 23GB, I think the total was. If my math is right that'd make the whole thing probably under 2TB altogether, perhaps a lot less depending on what the starting ID is, which I haven't explored. [23:50] *** SN4T14 has joined #archiveteam-bs [23:50] *** SN4T14 has quit IRC (Connection closed)