[00:00] *** Jon has quit IRC (Quit: ZNC - http://znc.in) [00:01] *** RichardG_ has joined #archiveteam-bs [00:01] *** RichardG has quit IRC (Read error: Connection reset by peer) [00:06] *** icedice2 has quit IRC (Read error: Operation timed out) [00:17] *** robink has quit IRC (Quit: No Ping reply in 210 seconds.) [00:19] *** tuluu_ has quit IRC (Quit: No Ping reply in 180 seconds.) [00:20] *** kristian_ has quit IRC (Quit: Leaving) [00:21] *** tuluu has joined #archiveteam-bs [00:22] *** robink has joined #archiveteam-bs [00:29] *** dd0a13f37 has quit IRC (Quit: Connection closed for inactivity) [00:33] *** robink has quit IRC (Read error: Connection reset by peer) [00:35] *** zgrant has quit IRC (Quit: Leaving.) [00:35] *** zgrant has joined #archiveteam-bs [00:37] *** robink has joined #archiveteam-bs [00:47] *** robink has quit IRC (Read error: Connection reset by peer) [00:52] *** robink has joined #archiveteam-bs [01:06] SketchCow: i'm starting to upload my random captures of qvc japan [01:21] *** kyounko has quit IRC (Read error: Operation timed out) [01:23] *** Ceryn has joined #archiveteam-bs [01:28] *** wp494 has quit IRC (Read error: Operation timed out) [01:34] *** zgrant has quit IRC (Quit: Leaving.) [01:34] *** zgrant has joined #archiveteam-bs [01:35] *** zgrant has quit IRC (Client Quit) [01:35] *** wp494 has joined #archiveteam-bs [01:35] *** zgrant has joined #archiveteam-bs [01:38] hurrah [01:40] i'm going to capture a ton of qvc japan over the next few days [01:46] *** wp494_ has joined #archiveteam-bs [01:49] *** wp494 has quit IRC (Read error: Operation timed out) [02:14] *** pizzaiolo has quit IRC (Remote host closed the connection) [02:25] *** wp494_ is now known as wp494 [03:15] *** Odd0002 has quit IRC (Ping timeout: 600 seconds) [03:24] *** Odd0002 has joined #archiveteam-bs [03:34] *** Odd0002 has quit IRC (Ping timeout: 506 seconds) [03:44] *** Odd0002 has joined #archiveteam-bs [04:24] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [04:25] *** wp494 has joined #archiveteam-bs [04:39] *** Odd0002 has quit IRC (Ping timeout: 248 seconds) [04:55] *** qw3rty118 has joined #archiveteam-bs [04:56] *** BlueMaxim has joined #archiveteam-bs [05:01] *** qw3rty117 has quit IRC (Read error: Operation timed out) [05:17] *** zgrant has left [05:30] *** Odd0002 has joined #archiveteam-bs [06:17] *** Odd0002 has quit IRC (Quit: ZNC - http://znc.in) [06:19] *** Odd0002 has joined #archiveteam-bs [06:32] *** Valentin- has joined #archiveteam-bs [06:33] *** Valentine has quit IRC (Ping timeout: 506 seconds) [06:45] *** jschwart has joined #archiveteam-bs [06:50] *** jschwart has quit IRC (Client Quit) [07:18] *** Mateon1 has quit IRC (Ping timeout: 260 seconds) [07:18] *** Mateon1 has joined #archiveteam-bs [07:32] *** kimmer12 has joined #archiveteam-bs [07:38] *** kimmer1 has quit IRC (Ping timeout: 633 seconds) [07:42] *** kimmer1 has joined #archiveteam-bs [07:48] *** kimmer12 has quit IRC (Ping timeout: 633 seconds) [07:53] *** kimmer1 has quit IRC (Ping timeout: 633 seconds) [08:06] *** schbirid has joined #archiveteam-bs [10:15] SketchCow, godane could either of you put 30 new 750GB 2.5" Momentus Hybrid SSHD Drives to use? (on at/ia related things) [10:28] odemg: yes, make a rsync box [10:36] *** Mateon1 has quit IRC (Remote host closed the connection) [10:36] *** Mateon1 has joined #archiveteam-bs [10:43] jrwr, yeah I'll be getting these to whoever can do something like with them, document what they did and talk about it in a r/DataHoarder post [10:44] nice [11:03] *** pizzaiolo has joined #archiveteam-bs [11:23] odemg: Interesting. I'll happily upfront most of a server build for an rsync target in the US [11:23] Ideally in California [11:23] fuck [11:23] i hate it when i miss some upload to a youtube channel i've been following [11:24] *when an upload gets deleted [11:24] Igloo, exactly what we need [11:24] I'm not US based though, We'd need someone to help look at colo [11:25] *** Specular has joined #archiveteam-bs [11:27] so I've read flash memory cards have a terrible shelf life for retaining data without corruption. Hope my two year old backup of something isn't fucked. Will be buying a HDD tomorrow and transfer the contents. [11:27] Igloo: there are a few out there, the best is if we can get hands on it there at the colo [11:36] btw, for hdd brands it seems HGST is more reliable in those famouse BackBlaze results, but it's harder to compare to WD drives since they lack the same volume. I've always used WD without problems, but is HGST more reliable for long-term storage (even their 2.5" drives for ex)? [11:50] Omfg this is a thing now https://www.reddit.com/r/spacex/comments/7lez5n/elon_musks_midnight_cherry_tesla_roadster/ [11:51] That's the cat ON the payload mount [11:51] Car [12:11] jrwr: Yep. We'd need someone who either lives local to be able to do it or some sort of IPKVM / power bar solution [12:11] hmm [12:11] jrwr sending the car to mars huh, interesting [12:12] I know where.. but they arent cheap at all [12:12] Define not cheap [12:14] not sure of an exact figure. I was thinking Psychz but they arent cheap [12:16] General Button Pressing $0 [12:16] How much is specific button pressing in order? [12:21] a million quid [12:21] per button [12:21] extra if the button is very deep [12:21] £5 million if it needs a paperclip [12:34] *** icedice has joined #archiveteam-bs [12:36] I've asked for pricing for a total of 4u HCross2. 2 x 2u SuperMicro chassis, one 24 bay for storage / s3 and one 16 bay with disks and a load of NVME drives for the megawarc factory [12:36] If you then back to back the servers over 10G it'll be stupid fast. [12:36] ah nice [12:37] Lets see. Be a good case and maybe a chance to use some of the crowd sourced funds that asparagirl was looking at [12:49] *** icedice has quit IRC (Read error: Connection reset by peer) [12:55] *** icedice has joined #archiveteam-bs [13:06] SketchCow: so when is your new box of tapes getting shipped? [13:07] to me [13:10] Oh man a new FOS [13:22] Specular: Backblaze's stats are interesting but essentially irrelevant for almost every use case. The drives are way outside the specs in those storage pods... [13:23] jrwr: Yeah, Elon's sending the car because nobody wanted to send a real payload due to the risk involved. At least that's what I read. [13:23] that is correct [13:24] im 100% happy that he IS sending something cool [13:24] JAA, by that do you mean that they're being used far more than intended? [13:25] Specular: They're exposed to way more vibration, in particular. [13:25] Because there are so many other drives nearby. [13:26] Most of the drives they used aren't even rated for NAS usage. And if they are, it's usually only for up to 6 drives or something like that. [13:26] they use* [13:28] it's interesting that I haven't seen that brought up in discussion of BB's results, but vibration would be a considerable factor for sure [13:29] It's mentioned every time someone posts the stats on /r/DataHoarder, at least. [13:47] *** BlueMaxim has quit IRC (Quit: Leaving) [14:35] *** Stilett0 has quit IRC (Read error: Operation timed out) [14:48] *** Stilett0 has joined #archiveteam-bs [15:50] *** ola_norsk has joined #archiveteam-bs [15:52] *** MrDignity has quit IRC (Remote host closed the connection) [15:58] *** Odd0002 has quit IRC (Read error: Operation timed out) [16:18] *** Odd0002 has joined #archiveteam-bs [16:56] *** zgrant has joined #archiveteam-bs [17:00] *** Specular has quit IRC (Quit: Leaving) [17:36] *** icedice has quit IRC (Read error: Connection reset by peer) [17:41] *** beardicus has quit IRC (bye) [17:45] *** beardicus has joined #archiveteam-bs [18:01] *** zgrant has quit IRC (Quit: Leaving.) [18:04] JAA: its partially datahoarder meme (a bit like ECC on ZFS). what BB actually said is: we dont know if vibration of a lot of nearby drives matters (it probably does), what we DO know that consumer vs "nas rated" enterprise BOTH fail at the same rate in "hostile pod environment" - https://www.backblaze.com/blog/enterprise-drive-reliability/ [18:05] Odd0002: when in sqlite db, would perhaps having a seperate 'text' content table, to prevent storing identical texts in another table be benifical; Or could the operating of keeping an index of it all make it slower? [18:05] Odd0002: i'm guessing quite a huge percentage of tweets contain the exact same text content [18:06] ez: Right. They don't have many enterprise drives though. I'd like to see a proper statistical analysis of the data. [18:06] The uncertainty range on the enterprise drives would be huge. [18:06] (Or confidence interval, if we want to go by statistical lingo.) [18:07] JAA: its also quite possible that the ent rated drives ARE better, when subjected to milder conditions [18:07] and theres no difference only when subjected to extremes [18:07] True [18:09] I just remembered this analysis, by the way: https://hackernoon.com/applying-medical-statistics-to-the-backblaze-hard-drive-stats-36227cfd5372 [18:09] anyhow, given that price per gb drops by what, 15%-20% a year? i guess the higher MTBF could worth the additional density [18:09] s/higher/worse/ [18:09] Not sure if that includes any enterprise drives, too lazy to check all the model numbers right now. [18:10] Haha, price drops per GB, I wish... [18:10] it does, its just flatting out sigmoid [18:10] Prices here haven't changed much in years. It only started again in the recent months. [18:10] it is approaching limit at ever decreasing rate, but it does drop [18:10] Here != US, just in case. [18:11] I'm curious to see what the next few years will bring though with MAMR and HAMR. [18:11] yea, theres a lot of weird markup and outright exotic market events [18:11] like those 4-6TB drives in USB frames [18:11] being way cheaper than the same thing, standalone [18:11] Yeah [18:13] One random article from Germany mentions that hard drive prices dropped from 0.09€ to 0.06€ per GB between 2012 and 2017. So apparently it did drop a bit (I didn't really notice that, but I also haven't bought many HDDs in the last few years), but definitely not 15-20% per year. [18:14] SSDs dropped from 0.99€ to 0.17€ over the same timeframe, by the way. [18:16] JAA: ssd and hdd are on different part of the sigmoid [18:16] Yeah yeah, I know. [18:17] *** kimmer1 has joined #archiveteam-bs [18:19] JAA: as for MAMR, that alone is supposed to flatten the sigmoid too, to the point it will be comparable to ssd perhaps [18:20] trouble being of course nobody does mamr yet, and second, mamr writes are slow, like really slow [18:21] https://regmedia.co.uk/2017/10/12/wdc_mamr_hdd_vs_ssds.jpg [18:21] its WD marketing department, so i'd take it with grain of salt, but doesnt smell of complete bullshit either [18:24] Odd0002: here's the database 'schema' i have now https://imgur.com/a/gNxNh , could it be done even better perhaps? [18:25] *** jschwart has joined #archiveteam-bs [18:26] ola_norsk: why splitting the tweet text? [18:26] also, you need a table of who-follows-who to make that data useful [18:27] ez: I'm just hoping that the prices actually decrease in Europe again as well. [18:27] prices per GB, that is. [18:27] EU retail prices are kinda insane, yea [18:27] of everything computer basically [18:28] Here's a compilation of prices per GB over the last ~20 years: https://blog.tralios.de/wp-content/uploads/2016/03/Festplattenpreise2016.png [18:28] In Germany [18:28] ez: the separate text/content table i think could prevent storing identical tweet content [18:28] i call it "vat abuse" [18:28] sure, we DO have vat [18:28] Prices now are essentially the same as ca. 2010. [18:28] but that does not explain the 30% markup on top of that [18:28] most of places have [18:28] What does it have to do with VAT? [18:28] ez: i'm not sure though, but e.g tweets containing just 'LOL!!' etc [18:29] ez: instead of storing a multiple of 'LOL!!" tweets, i mean [18:29] JAA: consumers who compare prices with US think "oh, thats just VAT, thats why its more expensive in EU" [18:30] but the reality is that we have simply much higher markups, too [18:31] Ah, yeah. [18:32] ola_norsk: the bloated index wont make up for such a "compression". you do, however, want to store retweets as reference in some way [18:33] retweets are usually stored as separate table of 'tweetid, whoretweetedit'. its somewhat awkward as you dont get a "timeline" of events in one table, but it is compact and query that way [18:34] s/query/fast to query/ [18:35] ez: hmmm..might be able to pick out '@username' from the tweet texts. I think tweep does pick the first @' as sender, then include the others in content text [18:36] oh yea, indexing threads is nice to have. [18:36] basically for each tweet you need to get 1) whom it refers to (multiple) 2) who retweeted it (multiple) [18:37] but things like the actual author of a tweet can be trivially part of the tweet itself, as well the other data you have now separate [18:38] i have to go by the output of 'tweep' for now i think https://ia801505.us.archive.org/24/items/tweeptestcrash/tweets.txt [18:42] ola_norsk: yea, with that its easier to just store one line = one row [18:42] especially if you use sqlite and usernames can be indexed simply be the text [18:43] ok [18:44] the retweet stuff is important only when doing full scrape. hashtag search will shows only original tweets [18:44] so no way the entries could be duplicate [18:46] ez: what i mean if theres found multiple tweets containing the exact same text content (example "43942641632403456 2017-12-21 20:33:58 CET Ur welcome " [18:46] ola_norsk: also try NetNeutralty and NetNeutralty [18:46] ez if some other tweet also is just "Ur welcome " [18:46] the typos virtually always correspond to legibility of the users, ie you get cher-like tweets [18:47] NetNetruality and NetNeutralty i mean [18:47] maybe there are other common typos [18:47] ola_norsk: again, its "deduping" a problem which isnt [18:47] ez: but could storing uique texts prevent a lot of duplicates? [18:47] ok [18:55] *** kimmer12 has joined #archiveteam-bs [18:58] *** odemg has quit IRC (Remote host closed the connection) [19:01] *** kimmer1 has quit IRC (Ping timeout: 633 seconds) [19:02] *** kimmer1 has joined #archiveteam-bs [19:03] *** icedice has joined #archiveteam-bs [19:05] after looking trough a couple of tweeted links, it seems by sending them all to waybackmachchine might cause some 'dubious' sites and pictues to be stored; Is that a problem? :] [19:06] dubious, as in adult content of various sorts :D [19:06] e.g what's the policy of IA of storing nudes? [19:08] * ola_norsk don't exactly want to be banned for piping pr0n :D [19:09] *** kimmer12 has quit IRC (Read error: Operation timed out) [19:12] *** kristian_ has joined #archiveteam-bs [19:13] if _accidentally_ waybacking' adult material, does IA blame the submitter or the site that got waybacked? [19:14] e.g if a dickpick from a twitter feed got in there.. [19:22] What bit rate should I convert a 256 kb/s MP3 to AAC at for it to be somewhat lossless (yeah, I know it's lossy, I just want the best quality I can get out of that export) [19:22] For best quality, highest bitrate available in AAC. [19:22] But really, why do you want to do that? [19:22] 1kb lower with constant bitrate? [19:23] what JAA said, why re-encode [19:23] 320 kb/s is max in Adobe Premiere Pro [19:23] Because I want to export the video [19:23] I can't export it as an MP3 track [19:24] increasing bitrate on already compressed audio is futile [19:24] Yeah, I figured [19:24] just export them seperate, and use e.g ffmpeg to merge the two [19:24] Yep, even with 320 kb/s AAC, you'll lose quality, but you get a larger file. [19:24] I'll just go with 256 kb/s [19:25] How would it be compability-wise? It's going into an MP4 container. [19:25] icedice: if you export the video, and have the original mp3, ffmpeg can merge them [19:25] Where do you want to play it? [19:25] idk [19:26] If it's a computer, you can find a software that can handle it for sure. [19:26] Embedded systems, *shrug* [19:26] It wouldn't surprise me if the person who is going to get the video uploads it to YouTube and/or Facebook [19:26] they recode anyway i think [19:26] I doubt they're into Vimeo or Dailymotion at least [19:26] Yeah [19:26] YouTube is AAC [19:27] And I figured that they do a crappier job than Adobe Premiere Pro [19:27] *** schbirid has quit IRC (Quit: Leaving) [19:27] They probably spent a ton of time optimising their transcoding. [19:28] Whether it's "maximum quality" is a different question though. [19:28] YouTube has shit quality though [19:28] Yep [19:28] I tell that to my girlfriend all the time, but she insists on using it anyway. [19:28] (For listening to music) [19:28] anyway, i'm no expert by far, but i'd say #1 export the video (without audio) then use ffmpeg to merge the untouched audio with the video stream as e.g .MKV, and upload that [19:28] or mvk [19:28] I've cut out parts of the audio track, so I'd need to do that again in like Audacity and then resave it [19:29] i think working on it as wav is then the best [19:29] Which I don't feel like redoing (and would have trouble getting exactly right compared to the video) [19:29] Yeah [19:29] I think I'll just go with AAC [19:30] The assignment is late enough as it is and my teacher seemed pretty pissed at me for uploading the project file instead of the video file as filler to stall the whole thing [19:31] i doubt a teacher would piss on audioquality if it's less than 100% shitty :D [19:31] not* [19:31] The video is fucked enough anyway [19:33] My phone stopped filming the 1 hour when it ran out of space [19:33] My classmate then started filming it with his phone [19:34] But the selfiestick/tripod hybrid pressed the power button and shut down his phone [19:34] So another classmate continued filming [19:35] TL;DR: There are two spaces in the video track that I filled with two video screenshots each since there's no video of it [19:35] Luckily the external mic worked well and got all of the audio [19:35] if your teacher can hear the difference between 256kb/s audio and 320kb/s.. :D [19:36] There's no 320 kb/s though [19:36] just 256 kb/s and bloated 256 kb/s [19:36] i mean if you're planning to reencode the audio [19:36] Yeah [19:36] I'm going with 256 kb/s though [19:37] 128kb is CD quality.. [19:37] aye [19:37] I though that was 192 kb/s [19:37] I was just wondering if it would have been possible to go with less kb/s than 256 and still achieve the same audio quality since AAC is better quality-wise than MP3 [19:38] there's not point in going up in kbs on an already compressed audio file though. It would just make a bigger file, with same (or even a bit shittier) quality [19:39] but, there's no need to even touch the audio if you have the best possible copy of it [19:40] JAA: Well, YouTube is convienient. I use it for music listening as well. Even if I had a headset that was high-end enough for me to hear the difference between YouTube audio and uncompressed audio I probably wouldn't notice anyway. [19:40] Yeah, well I was wondering about going down in kb/s [19:40] Like 256 kb/s MP3 = how much kb/s in AAC [19:40] https://superuser.com/questions/277642/how-to-merge-audio-and-video-file-in-ffmpeg [19:40] Quality-wise [19:41] I'm pretty sure I have that in a .txt document from before [19:41] It's just having to cut it again that is a pain in the ass [19:42] I guess it could be possible in Avidemux if I had the time [19:42] '-c:a copy' would keep the audio from being recoded [19:42] I still need to cut out extra shit from the audio track [19:42] And it has to match the video track exactly, otherwise the lecturer will look like she's lip syncing [19:43] icedice: Well yeah, it's not just that, also that you have to rely on internet connectivity and the music you want to listen to being available on YouTube. [19:43] The main reason for me personally is quality though. [19:44] icedice: if you're plannint to use Youtube to show it anyway, it's going to be recoded no matter what at playback [19:44] planning [19:44] transcoded*, you mean, right? [19:44] that [19:47] and i'd be frightened of a lecturer who would exclaim "Whait just a damn minute! ..This audio is 244kb/s, not 320!!" [19:47] that'd be golden-ears deluxe [19:55] JAA: You can download it and demux it to AAC/M4A using JDownloader 2 or Youtube-DLG though. I do that sometimes. [19:56] And yeah, I should search for the FLAC files, I'm just a bit lazy with that atm [19:56] I'll have some FLAC torrenting marathon some day when I have time though [19:56] with 'youtube-dl -k' it keeps the video and audio [19:57] https://github.com/MrS0m30n3/youtube-dl-gui [19:57] ^ I was talking about that youtube-dl GUI [20:00] youtube-dl -f bestaudio -o out.m4a [20:00] its horrible tho, 128k aac iirc [20:00] i just use -k / --keep-fragments [20:00] why? [20:01] to keep best audio present since youtube-dl often merges and deletes [20:01] 251 webm audio only DASH audio 142k , opus @160k, 3.59MiB [20:01] oh neat [20:02] youtube-dl -f bestaudio -o out.opus then [20:02] ola_norsk: huh? [20:02] no [20:03] ytdl preserves the bitstream unless you tell it to do something stupid, like output mp3 [20:04] e.g 'youtube-dk -k ' will keep the audio and video seperate, whitout deleting them when merging into container [20:04] 'youtube-dl -k'* [20:05] yea [20:05] and if you let it do its thing, that is just set output container compatible with the dash track, it will mux it from fragments into a single file [20:05] this is *not* transcoding [20:05] i find it useful when archiving a youtube video that's pretty must intersting speach/talk [20:06] oh [20:06] e.g this item https://archive.org/details/Tay_Zonday_Net_Neutrality_talk [20:06] your intent actually is to keep a/v separate [20:06] yea, that probably makes sense for talk show [20:06] aye [20:07] that way it's possible just to listen to it, since it's just talk anyway with no important visual stuff [20:21] speaching if which, it would be cool if IA made secondary 'mediatype' possible for items, like with 'test item' that is also listed as both main 'community ' and 'Test Collection' :D https://archive.org/details/superfunky59_Series_of_Tubes_Music_Video [20:22] this item is audio, but _does_ contain video [20:23] Btw, has Internet Archive gone to Canada yet? [20:23] They'd set up a backup facility there, right? [20:23] i think so [20:24] someone posted a link here once whre the infrastructure could be viewed [20:25] ola_norsk: That's not what the mediatype is about though. What you mean is that an item can be in multiple collections, and that's already the case I think. [20:25] ah ok [20:26] it's not in collection other than 'Community Audio', but i'm guessing IA also goes by filetypes [20:26] I don't get it why'd they'd set up shop in Canada [20:27] Not counting that it's a Five Eyes country it's geographically the US' neighbour and it piggybacks on the US' soon to be much worse Internet [20:27] if it was made available, they should though [20:28] Switzerland or something would have been better imo. Very good privacy laws and geographically distant and safe. [20:28] i've been sending some emails around here in norway [20:29] But I'm just a random guy on the Internet with >opinions, so what do I know [20:29] problem is, i kind of need some sort of 'presentation' [20:30] i'm just a drunk fuck on an island on the westcoast of norway; So, it would need someone with a bit more 'umph' behind to be anything more [20:32] when i got the question (translated); 'have you talked to Brewster Kahle?' , i was damn close to writing back 'who the fuck do you think i am? All i asked you was your stance on a question!' [20:33] also, it seems to be a common belief that IA is merely 'waybackmachine' [20:34] so, a kind of offical 'pitch deck' would be very nice to present [20:35] but, according to 'Norsk Dataforening' (which is not small)..a Norwegian mirror is 'i like the idea' [20:36] followed up with 'what more can you tell me about it'..and there's my problem [20:37] Are you pitching a Norwegian Internet Archive mirror or am I misunderstanding something? [20:37] so if someone with a bit more 'panash' .. http://www.dataforeningen.no/in-english.128921.no.html [20:37] Because that would be pretty sweet [20:37] icedice: aye [20:38] I remember seeing a video on YouTube of some old mine in Norway that they had made into a data center [20:38] Looked pretty sweet [20:38] according to their representative it's a 'good idea'..but i can not be the one carrying it further :D [20:41] EU IA is interesting dillema [20:41] on one hand, storage hardware almost twice as expensive [20:41] on another hand, bandwidth is about 4x cheaper [20:41] Norway is member of EU, we're kind of strange like that :D [20:41] i guess ia is more constrained by storage than bw tho [20:41] not* [20:42] aye, but e.g here in norway there's not only focus on preservation of old shit, but also 'green power' [20:43] thats probably fine, green power often means cheaper power these days [20:43] its funny for norway to be green obsessed, when they're arguably the biggest source of CO2 of all eu countries [20:43] not directly, but they originate that much oil none the less [20:44] hehe, that might be true, but the gasoline is still more expensive here than the same gasoline when it's exported :/ [20:44] same source though..wierd how that works :D [20:44] youre a nordic country [20:44] nordic means insane taxes [20:44] The Netherlands, Germany, and France is pretty cheap for hosting though [20:44] except iceland (?) for some reason [20:45] they were never true vikings to begin with [20:45] dei e bedre enn svenska ;) [20:45] hehe, but damn, this is getting way off topic [20:46] back to topic: I've envisioned e.g https://greenmountain.no/ [20:47] Yeah, that's one of the data centers I was just looking at [20:47] that's where i started to nag first, on twitter.. ~7 months ago, never got a response..so, it needs someone bigger..And Dataforeningen is midly put quite big [20:48] https://www.youtube.com/watch?v=gYrvRMWiZCA [20:48] https://www.youtube.com/watch?v=aTjF2hJiack [20:48] https://www.youtube.com/watch?v=oN9on73BmSs [20:50] all i know is that when an official representative of NCS writes back 'I like the idea, what more can you tell me about it?'..that's no small thing [20:51] ".The Norwegian Computer Society turned 50 years in 2003" [20:52] it's needs a response with equeal punch though..sadly, i can not provide that :/ [20:52] basiclly, every IT company of Norway is member of NCS [20:53] probably Green Mountain AS as well [20:54] at the very least, i need some sort of presentation endorsed by IA, or someone in IA, to send [20:55] ola_norsk: As I think I mentioned the last time you brough this up -- you *do NOT need any permission* to mirror a whole bunch of IA. [20:55] can't just say 'i like backups!' :/ [20:55] Somebody2: i need a 'pitch' though [20:55] Hm, not sure what you mean. [20:55] Somebody2: a kind of 'this is Internet Archive, and this is why our work is important' [20:56] *** kristian_ has quit IRC (Ping timeout: 360 seconds) [20:56] whether it be video, article or powerpoint slides [20:56] Ah. Does the existing pitch currently be displayed in a large banner on every IA page not suffice? [20:56] What about textfiles's 30-days-of-neat-stuff-on-IA tweets? [20:57] Also, you could identify particular collections on IA that you want to suggest NCS mirror, and make a pitch based on those. [20:58] I think it would be FABULOUS if NCS dedicated a few dozen petabytes to mirroring some of the publically downloadable parts of IA. [20:58] And they could do that without ANY coordination or permission from IA. Just do it, then send an email afterward going, ... [20:59] "Hi, thought you should know we've made a mirror of all this, if you want to direct people to it. [20:59] in my thought NCS would be the ones that swayed Norwegian Government to make sure a complete mirror exist [20:59] "And we'd be glad to mirror some of your restricted stuff, too, now that we've shown we can do the job." [21:00] Having NCS lobby the Norwegian government does make sense, yes. [21:00] aye [21:00] My point is just that doing that does NOT require any coordination or involvement by IA. [21:00] At least for the initial dozen petabytes of mirrored data. [21:01] wether it be educational department or culture/historical department, both of which have say in the matter [21:02] I'd reply back to the person who said it was a good idea, informing them that it doesn't require any coordination with IA, and ... [21:02] Somebody2: in case of iabak got some serious traction, is it possible to count ia's *support* of that endeavor? [21:02] ez: Yes, IA is supportive of mirrors, as I understand (I don't have any formal connection to them, though). [21:03] namely, better access to the current snapshot of ias data. the current query api works nice for individual items, but it gets awkward quick when things are done to be on scale this massive [21:03] ez: Well, the IA census seems to work well enough. [21:03] You are familiar with that, right? [21:03] Somebody2: do you have an email where i might forward the emails to? [21:04] ola_norsk: what, the ones from NCS? Why do you want to forward them? [21:04] Somebody2: of the emails/replies i had with the person in NCS [21:04] Again, why forward them? You DO NOT NEED ANY HELP FROM archive.org FOR THIS. [21:05] Somebody2: yes, im familiar with iamine from todd's effort to timestamp it [21:05] ez: good [21:06] ola_norsk: the next step is for you (and/or the person at NCS) to write up a proposal to the Norwegian government to fund storage. [21:07] ola_norsk: then just use the existing torrents provided by IA to mirror a bunch of stuff. [21:07] (storing it on the storage paid for by the Norwegian government) [21:08] Somebody2: i have no connection to IA, no real say in the matter. Basically, when it comes to being a 'middle man' of getting established a complete active mirror of archive.org..I might not that middle-man that's needed. :D [21:08] There IS NO MIDDLE-MAN needed, as I keep telling you. [21:09] Somebody2: also, i didnt see this explicitly stated anywhere, but is this data for archive.org/web/* as well [21:09] You don't need the permission, knowledge, or connection to ANYONE at IA to do this. [21:10] ez: The IA census does include hashes for the Wayback Machine data, in the "private" section (since the files aren't directly downloadable). [21:10] neat [21:10] Somebody2: how can a full copy of IA be made, functioning as a 'node' then ? [21:11] ola_norsk: Once a mirror of the publically downloadable data has been made (and paid for), *THEN* reach out to IA about mirroring the rest. [21:11] so basically "serious iabak" would amount to 1) better access to WBM data 2) a bit saner query api to query diffs from the last time. iamine is kinda slow, and i dont see any reason for it to be [21:11] its just a fairly straighforward database dump [21:12] ez: Eh, data on IA really shouldn't be changing regularly, so no, I don't think better access to diffs is much of a problem. [21:13] Somebody2: i mean delta since the last time [21:13] ideally there would be some IA's official append log structure, not for people to awkwardly reconstruct it every time [21:14] As I see it, the main next step for IA.BAK is clients for more platforms, that are easier to install, and a bunch of promotion to get lots more people to sign up. [21:14] Somebody2: that's kind of the problem as i see it.. ME, alone, reaching out is useless. I can barely reach the toilet in time when i have to take a piss. IA, like someone here said, is not a small thing. [21:14] *** sep332 has quit IRC (Read error: Operation timed out) [21:14] ez: A better log structure would certainly be nice, but I don't think it's a blocker for IA.BAK. [21:15] its a blocker to do this in serverless fashion [21:15] Somebody2: i think the maximum of my effort and use would be to get someone in NCS and IA to contact eachother and talk further [21:15] iabak doesnt need to be centrally coordinated, it works perfectly fine as a stochastic endeavor [21:16] provided the input to the backed up space is uniform [21:16] which it isnt atm [21:16] *** odemg has joined #archiveteam-bs [21:16] ola_norsk: Why, given that NCS *does not need any help from IA* in order to mirror a bunch of the content? [21:16] ola_norsk: I think the good use for your time and effort is to *inform* the person at NCS who you spoke to that they don't need IA's permission to mirror. [21:17] And encourage them to write up a grant proposal for server space. [21:18] ez: Let's not let the possiblity of a serverless architecture block progress on an existing backup. [21:18] Somebody2: given that current iabak stands at 0.5% progress to backup IA, its a bit premature optimization [21:18] im stipulating that looser coordination would yield better number than that [21:18] im not interested in "better" platform support, im interested in a client which doesnt need to coordinate at all [21:19] ez: OK, but can you write such a client without any change to IA's existing infrastructure? If no, it's not as good as one that CAN be written that way. [21:20] i can provided ia provides authoritative snapshot over the domain so the randomly picked items are uniformly random [21:20] basically current server architecture is p much result of IA not doing that [21:20] Yes, but they don't (yet). [21:21] I also don't understand what you mean by "randomly picked items are uniformly random" [21:21] they must be [21:21] ? [21:21] We should also take this to the #iabak channel [21:21] ah [21:21] er, #internetarchive.bak [21:34] Somebody2: here, the email exchange, https://archive.org/details/InboxStordabuenprotonmail-temp-item ..I'm neither an orginazer, spokes person or lobbyer of any imaginable sort. So, if someone are able to bring it further, that would be cool. [21:35] Ha. OK, well thanks for opening up the dialog in any case. [21:36] hopefully there's someone more eloquent than me to keep it going though [21:36] *** BlueMaxim has joined #archiveteam-bs [21:41] ola_norsk: Could you at least reply once more to let them know they can move forward WITHOUT any coordination with IA? [21:41] I think that's not expected (most organizations keep much tigher hold of their materials than IA does) [21:42] so letting the person you spoke to at NCS know that would be good. [21:42] Somebody2: like i mentioned before, that person, like many other seems to think IA is just waybackmachine.."I'm quite aware of Waybackmachine" [21:43] ola_norsk: Sure; so informing them that there are petabytes of material on IA that they could arrange to mirror without ... [21:43] ... any coordination with IA is really good to inform them of! [21:44] see, that's the way beyond the level of complexity of projects, where i 'peace out!' :D [21:47] ola_norsk: Wait, just writing a single email saying "You can download a lot of IA without permission" is a high level of complexity? [21:47] How is that any more complex than the emails you already wrote? [21:49] because after the reply to 'could i get some more concrete information about the idea?'..i did not get response back... [21:49] ola_norsk: I see. [21:50] so sending a 2 email without having gotten response..that's no good in my book :D [21:50] 2nd* [21:51] Ha. Well, I don't feel like I should write to them, because I don't speak Norwegian and I don't live in Norway. :-( [21:52] pretty sure they know english ;) [21:54] heck, even i know proper english when i put my beer soaked mind to it, watching out for typod and coloquial terms [21:54] ola_norsk: Yeah, but I feel like they'd respect me less. [21:54] then we have a problem.. [21:55] so, change.org then? [21:55] Yep, if you feel you've worn out your welcome, and we don't have any other Norwegians interested in stepping up... [21:55] i've heard rumours there was another one..but alas, no more than that :/ [21:56] Oh hell, I suppose I'll write something quick up. [21:56] can't get shittier than mine :D [21:57] :-P [21:58] a smashingly, awesome, mezmerizing 'pitch deck' though..the kind that could say the harshest of wallstreet investors.. [21:58] sway* [21:58] that would be best [21:59] voice over by Alex Jones.. "Here's why data preservation is important for survival of the human specie!!!" [22:00] *** sep332 has joined #archiveteam-bs [22:00] but yeah, anyone else is better than me [22:01] has there ever been a discussion of ia trying to infer locations from javascript (in before: halting problem)? [22:02] you mean a crawler with headless browser? [22:03] its resource intensive, but a lot of crawlers already do it. i f i were to guess, its not done as it would slow down crawl speed a lot. [22:04] ez: well, possibly a way for the crawler to dump asts, then look if someone registered a way to get file locations from that ast. it's not even obfuscated nor really 'computation', just a format that currently blinds ia [22:05] oh and that crawler would need to time travel too :/ [22:05] *** sep332 has quit IRC (Read error: Operation timed out) [22:05] i want to look up an image output format that my national weather service dropped when they "moved to open data" :((( [22:06] and they only had it on ftp and clients had it on pages w/ javascript animations [22:06] generally, the only reliable way to run js these days is headless browser [22:06] phantomjs or headless chrome [22:07] 'halting problem' is not the issue as such, its more like 'insane, obtusely baroque web platform as a whole problem' [22:09] ola_norsk: OK, here's what I plan to write: https://0bin.net/paste/2+yIgWRGt6IUpjbp#twBLyDm6cHRXGXp-BYmeALC83pScu4311jR94QtyvNk [22:09] Please let me know any comments you have. [22:09] this is pretty much 1990's style code (it just uses >dom0 because it needs to cache images): http://web.archive.org/web/20110119230906/http://wetter.tagesschau.de:80/radarbilder/ [22:09] "mirroring some" ? [22:11] those were just re-scaled versions of stuff from https://www.dwd.de/DE/leistungen/gds/gds.html, which is phased out in favour of https://opendata.dwd.de/ [22:11] ola_norsk: Yes, that's how I read the response... [22:11] That they liked the idea of mirroring parts of archive.org [22:11] Ideally, all of it. [22:11] yeah [22:12] the email is perfect [22:12] But I certainly didn't see anything in their response that suggested they were *opposed* to starting by mirroring parts of it. [22:12] OK, cool, sending now. [22:13] i think IA needs some kind of public fact sheet, that shows that it's not just WayBackMachine :D [22:14] ola_norsk: Yeah, that would probably be good. [22:14] or preferably a video where kahle and scott fingers the storage while pointing it out :D [22:15] email sent. [22:15] "here's the U's of wayback, here's the U's of videos and news' [22:15] Somebody2: did you send just the person who responded? [22:15] .oO( here's the U's of backups of old scene releases ;) [22:15] * Harzilein runs [22:15] ola_norsk: Yes. [22:15] ..and that [22:16] ok [22:16] Christian Torp. [22:16] yes he responed [22:17] i'm not sure what the title is called, 1 sec [22:18] It doesn't matter. [22:19] "Chief Operating Officer (COO)" [22:20] tone dalen did not respond when i wrote, but he did [22:21] welcome to beurocracy, i guess :D [22:22] * Somebody2 shrug [22:22] the benefit of it though, is there so many instances to nag to :D [22:25] other than DCS there's also 'Arts Council' who also have a lot of say in such matters [22:25] http://www.kulturradet.no/english [22:29] Somebody2: let me know if you get a response on the email, though it's christmas now so it might take a while [22:30] *** schbirid has joined #archiveteam-bs [22:30] fucking hell, i need to gooder up my formal english writing if you do :/ [22:31] it's sad i'm the only norwegian presently here :/ [22:34] there's not even a swede or a dane around? [22:35] seriously though, i hope to hear when/if you get a response [22:37] Harzilein: mirroring scene releases is doable. just rent a DC-hut in marshall islands (one of the few real-countries with no copyright laws) [22:38] then again, unauthorized copies kinda tend to "mirror" themselves [22:40] ez: huh? [22:40] I wouldn't be surprised if IA had its own stash of those. Darked, for obvious reasons. [22:41] that's what i'm talking about [22:41] there's oldschool ones in unsystematic blobs. they are far more interesting ones than those with the nice emulator frontends :) [22:43] bbs era isnt that challenging yea. for starters, 20 years of data produced back then equals to week of data produced now. [22:46] anyway, my angle at this was its hard to get to "our" 10000 feet view that this is just "niche" data like any other, despite the providence [22:46] +across [22:46] -across [22:47] ola_norsk a Dane here.. cheers [22:48] Harzilein: most of it is garbage like current warez. the important bits (demo and tracker scene) is how IA actually started, didnt it? [22:49] kimmer1: skål :) [22:50] Harzilein: if were talking cultural heritage wrt piracy, there are certain niches in that niche where archiving would be of very high value. things like St.GIGA games. [22:50] its a bit like "pirate" recordings of tv shows which otherwise are long lost in the history. [22:50] ola_norsk: Swedish speaking Finn here [22:51] ola_norsk: perkele :D [22:51] oops [22:51] That's my reaction to Finnish [22:51] well, that's twice a good as an actual swede :D [22:52] anyway, if there's some south americans and some asians, the globe is covered :D [22:53] ez: Private Layer is what a lot of pirate sites use for hosting. It's a Panamanian company that has servers in Switzerland (which is a pirate) [22:53] 's paradise) [22:54] icedice: yes, there are few of shady isps catering to the unsavory markets [22:54] its funny how the actual scene on one hand shuns commercial sites (hosted in places like you mention), and on second hand it thrives on it [22:55] icedice: the supposed ethos is to stay under the radar. being herded by a provider "look, you can host your botnet/whatever here" is p much the opposite of that. [22:55] markets can sure play out in fun way [22:57] This is getting too offtopic for this channel. Mind moving to #archiveteam-ot? [22:58] They're a bit too shady for my taste nowadays though: https://www.lowendtalk.com/discussion/71510/grupo-panaglobal-15-s-a-private-layer-drama-allegedly-james-reed-mccreary-alpha-red [22:58] Isn't #archiveteam-bs ment for off-topic stuff like this? [22:58] *** M9uy3 has joined #archiveteam-bs [22:58] hi, how to start that project? https://www.archiveteam.org/index.php?title=Blog.pl [22:59] M9uy3: 1 sec [23:00] Hey M9uy3. So the first step would be to get a list of all the blogs hosted on blog.pl. [23:00] That would probably mean grabbing all of http://www.blog.pl/katalog and creating a list out of that. [23:00] it doesn't seem to be a specific task made for it "yet" [23:01] ok, only URLs? [23:01] What do the numbers there on the left mean? Are those numbers of blogs in the respective categories? [23:01] If so, we're talking about millions of blogs. [23:01] ez: What other countries are there that have no copyright laws? [23:01] 7752304 blogs in all categories [23:01] Oh dear. [23:01] I think I've heard that Montenegro has none, at least [23:01] ;) [23:02] it will be a great crash [23:02] So more than every sixth Pole has a blog there?? [23:02] (On average) [23:03] the project is online since 2001 [23:03] Hmm, ok, we'll have to think about how to do this then. [23:03] They shut down end of January, right? [23:04] yes, 31th [23:04] Mhm [23:05] The links on /katalog appear to point at the newest post for each blog. That might be a good starting point. [23:06] The image links, I mean. [23:06] We're probably looking at billions of links in total though. :-| [23:07] the site is called 'blog.pl' but one can find there even school websites http://pspwasosz10.blog.pl/ :/ [23:07] each of those links, linking an internal thingy/image usually have a single indefier do they not? [23:08] ola_norsk: What do you mean? [23:09] i mean, instead of billions of links, some of that billion might all be linking to same e.g picture/post etc [23:09] stored on that domain, i mean [23:09] No, the billions I mean are probably unique, though quite many of them might be 404s. [23:10] I mean links like http://reniablicharz.blog.pl/?p=1660 [23:10] Changing the p parameter leads you to other posts. [23:10] The next lower value that exists is 1654. [23:10] Which redirects to the second-newest blog post. [23:10] And so on. [23:11] The canonical post URLs look different and contain of a date and a slug, arranged as /YYYY/MM/DD/slug. [23:11] (Plus a slash at the end) [23:11] This will have to be a warrior project, but even then I'm not sure it's feasible. This thing is fucking *massive*. [23:12] what if "Grupa Onet.pl SA" was willing to just give all the shit by closing time? [23:13] that could save a bit of work [23:13] you mean export somehow? [23:13] aye [23:13] it doesn't hurt to ask [23:13] Feel free to do so. [23:13] (or demand, rudely) :D [23:14] kurwa, i do not speek polish :D [23:14] "Give it to us, or we'll DDoS you!" :-P [23:14] that [23:14] i've been already in contact with them today because of the second shutting down :) [23:14] Which would actually not be too far from the truth lol. [23:14] JAA: the blogs are just wordpress, nothing too spectacular there [23:15] "Give it to us, or we'll DDoS your future endevours!" [23:15] the issue indeed is how to get the subdomain urls [23:15] there are blogid and blog_id entries, but not yet api call translating id to subdomain found yet [23:15] ola_norsk: I will of course mention in the channel if I get a response. [23:15] unfortunately the front page cant be scraped, it limits paging numbers to 100 [23:16] ez: Yep, I know. This will be a good test for archiving Wordpress.com, which I assume will have to happen at some point and will be an absolute shitfest. [23:16] At least we have an easy way to find all blogs there though (through the wp.me shortener; we did that in URLTeam a while ago). [23:17] No such shortener here, unfortunately. [23:17] so, there is a need of URL list and/or a possibility to reupload the content somewhere? [23:18] the mirroring itself is doable within the timeframe [23:18] the issue is how to find what to mirror [23:18] ie write a category spider for the front page is probably the best one could do for now [23:18] unless better api is reverse engineered [23:18] Somebody2: good stuff. I think you will get answer, though maybe now at christmas time was the _worst_ time to write a mail to an organization :D [23:19] ez: I don't see an API anywhere...? [23:20] there is no API [23:20] JAA: there isnt [23:20] the id is in javascript for ad serving [23:20] so we know there *is* in fact numerical id per blog [23:21] Somebody2: if not; If there's no response..which there were in my case, there's no fauly in eventually sending a new mail after a while [23:21] Right [23:21] i wrote today earlier to them but the time is bad for such contacts (christmas) [23:21] Somebody2: fault* [23:22] JAA: however everything on the frontpage seems to use subdomain urls [23:22] I asked them for URL list for http://republika.onet.pl/ subdomains - another project to be down (in March) - less blog, more in type of 'Geocities' [23:23] No occurrence of blog_id in any of the JS included on blogs either. [23:23] its directly in the html [23:24] just viewsource [23:24] hmm, http://www.blog.pl/data/cache/thumb_270x200/data/post-images/74105/79552.jpeg [23:24] Yeah, I mean it isn't used anywhere. [23:24] var dige_vars = {"homepage_url":"www.blog.pl","category":"Spo\u0142ecze\u0144stwo","admin_url":"http:\/\/zarzadzanie.blog.pl\/krolowa-superstar.blog.pl\/wp-admin\/","template":"mystique","addthis":{"selector":".post-content","action":"append"},"blog_id":74105,"p [23:25] so the thumbs use blogid/postid format [23:25] ola_norsk: Well, there isn't really any *need* for them to respond to me -- that was the main point of my email. :-) [23:25] which is all nice, if there were something, anything, which could translate id to blog url [23:25] We should make a channel for this. [23:25] We'll need one down the line anyway. [23:25] They can simply go forward with getting funding for storage, then download a bunch of IA's stuff, and happily sit on it. :-) [23:25] #blog.pls ? [23:26] I'd hope they'd drop me (and info@archive.org) a quick note to say, "Hi, we've made a copy of 10PB of stuff, thanks for making it available!" -- but it's not requried... [23:27] Somebody2: i'm not sure what you mean by that, but NCS is the major computer/it association in Norway. Basically every computer/tech related company is member...I guess it's kind of like the NRA of computer stuff here [23:28] ola_norsk: What I mean is that the point of my email was that NCS does not need to talk to me any more in order to mirror IA. [23:28] So if they don't respond, it doesn't mean they aren't, you know, mirroring IA. [23:28] aye, they shoudln't..they can email archive.org themselves damnit [23:30] "what more can you tell me about the idea"..the fuckers should know how to google [23:30] ola_norsk: They don't need to email archive.org EITHER. [23:30] That was what I keep trying to point out to you! [23:31] they sure as hell don't need to ask me about 'something more concrete' though :D [23:31] *** icedice has quit IRC (Quit: Leaving) [23:31] I think the "something more concrete" was hopefully along the lines of what I suggested. :-) [23:32] or, maybe i should have just said straight up: It might need a couple of square meter of datalockers and racks [23:32] * Somebody2 going AFK [23:32] Yes, that probably would have been good. :-) [23:33] aye, but my english, or rather, my technical norwegian is not that proficiant [23:34] ez: Sounds good to me. [23:36] >>>>> Discussion on archiving blog.pl is now going on in #blog.pls [23:40] Somebody2: the best i can do is try to get people and organizations with 'sway' to consider it :/ [23:40] Somebody2: for all i know, e.g UIO.no have already pitched the idea.. [23:46] geographical location, political standing globally, and it's focus on 'green energy' and the somewhat hysterically habit of wanting to preserving old useless shit..would be a plus [23:49] *** icedice has joined #archiveteam-bs [23:51] in norway, trying to build new close to e.g even an old wooden gate is sometimes a cause for years of controversy :D [23:53] *** M9uy3 has quit IRC (Ping timeout: 260 seconds) [23:55] even 80s and 90s grafittis are at times deemed protected as 'cultural heritage' [23:57] the sad effect is, all the books in local libraries are old as fuck :/ [23:59] *** ola_norsk has quit IRC (I never hurts to ask. Merry christmas! https://youtu.be/wmin5WkOuPw)