[00:00] i'm not sure how to formulate it. E.g from 2006 until today, a file with e.g https://twitter.com/drkbri/status/940731016880312321 [00:01] on each line [00:01] A text file filled with tweet URLs, you mean? [00:01] yeah [00:01] Well, one such URL is something like 60 bytes. [00:02] Idk how many URLs you want to store. [00:02] If storage size is a concern, you could compress this *massively*, of course. [00:03] You could just store 'username id' lines, for example. [00:03] there's roughly 30 tweets in 3 minutes of netneutrality hashtag [00:03] you dont need to store the username, the ids are globally unique [00:03] you can replace it in the url by any text you want [00:03] Oh, it just redirects. TIL [00:04] Was it always like this though? [00:04] I thought it used to 404. [00:04] can it be ballparked to NOT be above 160GB file? [00:04] Ok yeah, just the ID then. [00:04] And even that is highly compressible since it's only digits. [00:05] 10 tweets per minute = about 53 million tweets in 10 years [00:05] * 60 bytes ? [00:06] i have dyscalculia :D [00:06] Well yeah, if you want to store the entire URL every time. [00:06] That would be 3.2 GB or so. [00:06] If you store username + ID, that probably reduces by half. [00:07] If you only keep the ID, another factor of ~2. [00:07] And if you compress that file, you probably get it down to less than 100 MB. [00:08] zfs with lz4 turned on would automatically make that tiny. [00:09] as long as it doesn't surpass the machine im doing it on it'll be fine. But yeah, i'm thinking several users might've made several tweets with that word. Maybe sqlite could be useful then [00:09] You probably want to avoid sqlite for large databases. [00:09] ^ [00:10] doesn't sqlite go into terrabytes? [00:10] I mean it can, and then the database suddenly is corrupt. [00:10] Plus it's going to kill your disk doing so [00:11] PoorHomie: only this machine has SSD, the working machine is good old spinning crap :D [00:12] work machine* [00:12] I don't know what you're trying to do really. [00:12] If you want to store the tweets to search through them afterwards, use a proper database. [00:12] If you just want to compile the tweet URLs, just use a (compressed) text file. [00:13] goal: get link to ALL tweets containing any mention of 'netneutrality' [00:13] just url? [00:13] as far back as it goes [00:13] username + ID of 1000 tweets is around 14.6 kB gzipped. [00:14] astrid, yeah. From that, it could either be made to a warc or fed to wayback..or? [00:14] Yes [00:14] with wget --get-requisites, wayback will apparently even save images [00:15] If you only keep the IDs, that reduces to 7.7 kB gzipped. [00:15] ola_norsk: Yeah, don't do that for that many URLs. [00:15] Use wget or wpull or whichever tool you prefer to create WARCs, then upload those to IA, and they'll get included in the Wayback Machine. [00:16] JAA: there's 'Sleep' to limit exessive /save/ requests though [00:16] hmm ok [00:17] I mean, you can try using /save, but creating WARCs directly will be *much* more efficient. [00:17] but, 6 hours of '#netneutrality' tweets, is quite a lot [00:17] even just 6 hours [00:17] It will also be possible to download the entire archives at once. The WARCs of stuff saved through the WM are not downloadable. [00:19] 3 hours of '#netneutrality' is ~ 290MB [00:19] as warc [00:19] Yeah, the WARCs will be quite large. You'll probably want to upload to IA while you're still grabbing. [00:20] anyway to concatenate warc files? [00:21] e.g daily captures [00:21] Yes, you can just concatenate them with e.g. cat. [00:22] cat -R * [00:22] or you can just upload them to the same item and not bother even cat'ing them [00:22] it still goes to wayback if warc? [00:22] yea [00:22] well, if it gets blessed [00:22] ty [00:23] astrid: what does that entail? [00:23] reviewd? [00:23] someone with admin waves a magic wang over it [00:23] i really don't know tbh [00:23] it still gets uploaded though? [00:24] yes that is all post-upload [00:24] ok [00:26] it's bound to be more likely to be blessed than doing /save/ reqeusts every fucking 3 minute though :D [00:27] not to mention safer, as when some damn conscrution company decided to cut my power last tuesday for 2 hours [00:29] what the world needs next is WARC tasks distibuted via DHT [00:29] *** astrid has left ][ [00:29] Yeah, you're largely independent from IA while grabbing, too. And you're archiving way more URLs than just the first page of the hashtag. [00:30] the 'requisites' could be aquired after though, at least, some of them [00:30] (and the FDS¤"#!#¤ t.co link could some day get translated into real links) [00:31] ola_norsk: Just ran another test. 10k tweet IDs are 75 kB gzipped, and it took about 9 minutes to grab those. [00:32] basically' in matter of 'netneutrality', it's not so much as digging out meme pictures, but 'PRO' and 'CON' tweets, i guess [00:32] twitter isn't very keen on mirroring, especially of historical data [00:32] You could probably reduce the size a bit more by using a more efficient compression algorithm. [00:32] ez: aye, they make money of off selling it [00:32] ez: Yep, and that's exactly why we should do it. [00:33] in terms of storage its about 100GB a month or so, archiving wise mirroring twitter isnt hard [00:34] in terms of using webarchice though, 3 hours of scrolling a hastag is ~290 megabytes :/ [00:34] i'd prefer twitter to be dataset [00:34] people want those archives for research, not to page through [00:34] ez: That doesn't sound right. [00:34] There are around 500 million tweets per day. [00:35] what year did twitter start? [00:35] The text of that alone would be 70 GB already. [00:35] (At 140 characters, but they increased the limit recently.) [00:35] Plus all the metadata, images, and videos. [00:35] JAA: its an old ballpark from my last attempt in 2015 or so, indeed the number could be much higher [00:36] ez: here's my "success (NOT)" at using webarchive https://webrecorder.io/ola_norsk/twitter-hashtags [00:36] ola_norsk: 2006 [00:36] JAA: about the time of hight of 'netneutrality' issue then? [00:37] JAA: yea, the images and clips might pose trouble, or not. i dont really have clear idea how big % those are per tweet [00:37] ~30 tweets per. 3rd minute for ~10 years .. [00:38] well, am not sure of the wisdom mirroring special hashtag, especially a political buzzword [00:38] if 3 hours of that = ~290MB of stuff.. [00:38] in europe its been called variously 'regulation of state telecoms', and related deregulation of those in mid 2000s [00:39] ola_norsk: Yep, 53 million tweets. That's up to 7.4 GB raw text. [00:39] ola_norsk: Obviously, the website is *much*, *MUCH* larger. [00:40] aye, even if subtracting shitty repsted lame memes, it's still a biggy i think [00:40] reposted* [00:41] you can generally count 10% of the raw number, theres not much point storing it raw, unless you need to build fast reverse index [00:41] it's basically a chore just getting the urls quicker than their inputted i guess :/ https://en.wikipedia.org/wiki/Infinite_monkey_theorem [00:41] 10% is what it generally compressed to with general algos, and 5% with specialized (very slow) ones [00:41] ez: Hmm, actually, that 500 million per day figure is from 2013. It's probably significantly higher than that now. [00:44] JAA: yea. better way to estimate this is simply pulling the realtime feed and piping it through gzip [00:44] i'm betting if even using /save/ per new tweet in a established hashtag, wayback would be too slow [00:44] twitter generally allows for pulling the realtime feed, the biggest problem is the history [00:44] no sane api, only page scraping [00:45] not if you pay them.. [00:45] There probably is a sane API, but $$$$$. [00:45] aye [00:46] JAA: anyhow, the current rate is something like 6k a second or something like that [00:46] basically, if you have the funds to make them listen you could say, "hey, gives me all tweets with the word 'cat' in them'.. [00:47] with this distribution in length [00:47] https://thenextweb.com/wp-content/blogs.dir/1/files/2012/01/Aig565bCAAAYgkB.png [00:47] ez: That's the rate from 2013. [00:48] I wonder how that distribution looks like for the past month. [00:48] *** second has quit IRC (Quit: WeeChat 1.4) [00:48] (Since they increased the limit.) [00:48] *** sec0nd is now known as second [00:49] the problem i'm seeing with scraping is that there should be a second process capturing newly entered tweets [00:50] e.g running back 10 years if well and good, but in the meantime there's 1000's of new ones :/ [00:51] maybe there's should be a tweep v2 [00:52] I'm sure that's possible since the browser displays that "3 new tweets" bar. [00:52] in tweep? [00:53] ola_norsk: its super annoying tasks only the datamining companies bother with [00:53] No, but in a similar way. [00:53] the stream allows you only certain rate under a filter [00:53] so you need shitton of accounts with disperate filters [00:54] ez: some data is worth bother with for regular people i think :) [00:54] its the sort of thing its just easier to pay for than trying to awkwardly skirt the rules (indian and russian media companies still do, coz they're fairly comfortable with blackhat social media stuff) [00:54] IMO public data entry should be public, but yeah [00:55] twitter is still one of the most open guys in the town [00:55] but yea, all big 3 will laugh at you and gaslight you if you ask for something like this [00:55] its a lifeblood of internet advertising, and you want it for free? [00:56] for all we know, people've already typed in one of shakespears books.. [00:56] to twitter [00:57] ola_norsk: no, its a really interesting corpus for ML training [00:57] you can easily make a chatbot with decent sample of twitters history, and a lot of folks do. [00:58] decent is not perfect [00:58] but i like the idea of what happens when a 1TB torrent appears on piratebay with all of twitters history [00:58] (IA wouldnt survive the heat of doing that) [00:58] legal heat at least [00:59] how could they get legal heat of archiving public tweets? :/ [00:59] for starters, you didnt ask individual users if they allow you to archive their tweets [00:59] i'm not doubting, just asking what reasons [00:59] what does the twitter EULA say? [01:00] its the whole dirty secret biz of data mining [01:00] by using twitter, you allow TWITTER, and its partners to use your data [01:00] but nobody else [01:00] Could Twitter actually do anything about it though? I assume you retain the copyright when posting. [01:01] i wish these guys were on IRC https://discord.gg/Qb9TSZ (Legal Masses) [01:01] JAA: they could, and they do. citing "privacy concerns" [01:01] which is hilarious case of gaslighting [01:01] I mean, I could think of various things they could do regarding scraping the data, but what could they do about the data release itself legally? [01:02] here in norway there is the 'right to be forgotten', but that does not trump unwillingness to let memory go [01:02] they take it down, they have fairly clear rules you are allowed to release data only with their approval. there are provisions for small samples which are useful only for statistics, but not more whole-picturesque things [01:02] Oh dear, time to switch the topic before we get into that discussion again. [01:03] ez: But on what legal basis would they take it down? [01:03] Their rules don't matter much if they aren't enforcable legally. [01:03] let's just archive it all, and hear who pisses about it :D [01:03] its basically this https://twittercommunity.com/t/sharing-social-graph-dataset-for-research-purposes/77998 [01:03] and thats JUST the social graph [01:04] not even the tweets [01:04] Their rules don't matter much if they aren't enforcable legally. [01:05] JAA: its basically same how jstor can paywall public domain works. they legally bully, but it wouldn't pass rigorous scrutiny [01:05] the issue of user's consent remains [01:05] twitter couldn't sue you in the end, but kenye west easily could [01:05] i don't know man, if i see it, i might screenshot it [01:06] Yes, that's exactly my point. The users could certainly do something about it because they hold the copyright to that content and didn't consent to it being distributed in that way. But I don't see what *Twitter* could do about it (ignoring the company's accounts). [01:06] Anyway... [01:06] A datapoint: there are about 27 hours between tweets 940737669532758016 and 940331105898577920. [01:07] archive first, delete whatever later..it's futile to archive after the fact [01:07] Clearly these aren't just numeric IDs, but there's something more complex going on. [01:07] It could be a five-digit random number at the end. [01:07] That would mean that there were 4 billion tweets in 27 hours. [01:07] ola_norsk: this data, at least in terms of archive would definitely survive easily as a torrent [01:07] IA would only seed it initially :) [01:08] ez: i'm only 75% into the topic [01:09] too much tech for me, i be grabbing them links! [01:09] *** ola_norsk has quit IRC (skål!) [01:12] JAA: btw, regarding the historical stats. pundits claim that twitter TPM has stalled since 2013 [01:12] and kept around the 5-8k/sec figure since then [01:12] http://www.businessinsider.com/twitter-tweets-per-day-appears-to-have-stalled-2015-6 [01:13] (before that there was hockeystick growth apparently). nobody made a sigmoid curve yet to verify tho [01:15] *** ola_norsk has joined #archiveteam-bs [01:15] I see. [01:15] * ola_norsk plain forgot [01:15] here's output http://paste.ubuntu.com/26173713/ [01:16] quite a mess, but better than nothing [01:17] ez: Hmm, they appear to be basing that entire article just on the 500 million figure stated somewhere on Twitter. :-| [01:17] JAA: no, i google various sites which claim to know current TPM [01:17] and they all show 5k, 6k, 7k [01:17] ola_norsk: Well yeah, it's messy. I'd only keep users and tweet IDs if you really intend to grab the entire history. [01:17] but yea, the BI article just compares two years which is a poor sample and argument [01:18] *** ola_norsk has quit IRC (https://youtu.be/EPHPu4PV-Bw) [01:19] this one claims even decline since peak in 2014, http://uk.businessinsider.com/tweets-on-twitter-is-in-serious-decline-2016-2 [01:20] could be just a case of BI having an agenda to portrait twitter that way tho [01:21] ez: Regarding the total size of Twitter, they apparently have at least 500 PB of storage: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html [01:21] Quite an interesting article in general, really. [01:23] i wish google would elaborate on their technicals more [01:24] compared to twitter, the amount of traffic google (ie yt) gets is real scary [01:25] I'd rather not have my content used for advancing algorithms to manipulate people with advertising. so good thing I don't post anything, I guess :) [01:25] well, "pirating" their data would amount to opening a pandoras box [01:25] Twitter 500 PB? [01:26] Sounds like more than I would expect. [01:26] abuse of the data is inevitable, but at least everyone should get equal opportunity to do good or evil [01:26] not just the highest bidder [01:28] *** Stiletto has quit IRC (Read error: Connection reset by peer) [01:28] *** ZexaronS has joined #archiveteam-bs [01:28] robogoat: i'd expect few PBs at most, for the actual content. the number being inflated by massive duplication on the edges [01:29] Yep, that number includes even cold storage. [01:29] Yeah, [01:30] google said they're 10 exa live, 5 on top [01:30] If you're talking 1PB replicated 500 times. [01:30] *5 on tape [01:30] They claim to be processing 10s of PB per day on another blog post. [01:30] in 2013 [01:30] Google I wouldn't be surprised. [01:30] YouTube alone is ~1 EB. [01:31] (Really rough order of magnitude estimate) [01:32] not sure if anyone has posted numbers since 2013 [01:32] but i suspect google is at the end of sigmoid too, in user adoption anyway [01:32] they definitely had to have a bump with 1080p/4k which wasnt as prevalent in 2013 [01:35] *** Stilett0 has joined #archiveteam-bs [01:41] *** CoolCanuk has quit IRC (Quit: Connection closed for inactivity) [01:49] hook54321: yes, afaik the offer is still open. [01:49] Did they contact ArchiveTeam specifcally, or? [01:50] *** ranavalon has quit IRC (Quit: Leaving) [01:50] No. [01:50] Who did they direct the offer towards? [01:51] But the site itself said that for years: https://web.archive.org/web/20150203022400/http://www.autistics.org/ [01:52] And I have friends-of-a-friend contact with the custodian. [01:52] Custodian being the website owner? [01:52] Yep. [01:55] I have a Facebook friend in common with them, but I don't really know that Facebook friend personally. [01:58] Nods. [02:12] *** pizzaiolo has joined #archiveteam-bs [02:13] *** pizzaiolo has quit IRC (Client Quit) [02:29] *** Soni has quit IRC (Read error: Operation timed out) [02:32] *** closure has quit IRC (Read error: Operation timed out) [02:34] *** closure has joined #archiveteam-bs [02:35] *** svchfoo1 sets mode: +o closure [02:37] *** Valentin- has quit IRC (Ping timeout: 506 seconds) [02:37] *** dashcloud has quit IRC (Remote host closed the connection) [02:38] *** dashcloud has joined #archiveteam-bs [02:41] *** Asparagir has joined #archiveteam-bs [02:42] *** Asparagir has quit IRC (Client Quit) [02:47] *** MrRadar has quit IRC (Quit: Rebooting) [02:54] *** Stilett0 has quit IRC () [03:06] *** ZexaronS has quit IRC (Read error: Operation timed out) [03:09] *** MrRadar has joined #archiveteam-bs [03:25] Somebody2: I can try contacting the owner unless you think it would be easier for you to contact them. [03:25] It's kinda difficult to email people about domain names because it oftentimes gets seen as the "Your domain is expiring soon" spam [03:34] *** zhongfu has quit IRC (Remote host closed the connection) [03:50] hook54321: Probably better to work out what you are proposing in some more detail, first. [03:50] Let's take this to PM. [04:06] *** qw3rty117 has joined #archiveteam-bs [04:12] *** qw3rty116 has quit IRC (Read error: Operation timed out) [04:37] *** zhongfu has joined #archiveteam-bs [05:05] *** du_ has quit IRC (Quit: Page closed) [05:11] *** Yurume has quit IRC (Read error: Operation timed out) [05:17] *** Yurume has joined #archiveteam-bs [06:08] *** sep332 has quit IRC (Read error: Operation timed out) [06:08] *** sep332 has joined #archiveteam-bs [06:25] *** Nugamus has quit IRC (Ping timeout: 260 seconds) [06:34] *** kimmer2 has quit IRC (Ping timeout: 633 seconds) [06:55] *** jschwart has quit IRC (Quit: Konversation terminated!) [07:49] *** me is now known as yipdw [08:02] *** Mateon1 has joined #archiveteam-bs [08:17] *** ndiddy has quit IRC () [08:42] *** godane has joined #archiveteam-bs [09:25] *** Soni has joined #archiveteam-bs [09:28] *** tuluu has quit IRC (Read error: Operation timed out) [09:28] *** tuluu has joined #archiveteam-bs [10:00] *** beardicus has quit IRC (bye) [10:00] *** beardicus has joined #archiveteam-bs [10:40] *** BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io) [12:06] *** BlueMaxim has quit IRC (Quit: Leaving) [12:27] *** pizzaiolo has joined #archiveteam-bs [12:32] *** pizzaiolo has quit IRC (pizzaiolo) [12:34] *** pizzaiolo has joined #archiveteam-bs [12:55] *** refeed has joined #archiveteam-bs [12:55] *** refeed has quit IRC (Client Quit) [13:15] *** ranavalon has joined #archiveteam-bs [13:16] *** ranavalon has quit IRC (Read error: Connection reset by peer) [13:16] *** ranavalon has joined #archiveteam-bs [13:25] *** purplebot has quit IRC (Quit: ZNC - http://znc.in) [13:25] *** PurpleSym has quit IRC (Quit: *) [13:28] *** PurpleSym has joined #archiveteam-bs [13:29] *** purplebot has joined #archiveteam-bs [13:33] Are we going to do Bitchute after vidme? [13:35] Are they shutting down? [13:45] so i'm uploading my tgif abc woc 1998-12-11 tape i have [13:46] i'm slowly uploading christmas shows i have for myspleen [14:14] JAA I heard they are in a similar situation as VidMe (barely making it) and now have an influx of VidMe users to make things worse [14:26] Makes sense considering they're essentially saying "Please come back, Vidme!": https://twitter.com/bitchute/status/936804311492734977 [14:26] On the other hand, they also retweeted messages from former Vidme users saying they're switching to BitChute, e.g. https://twitter.com/ErickAlden/status/937022050761433088 [15:32] *** RichardG has quit IRC (Read error: Connection reset by peer) [15:37] *** RichardG has joined #archiveteam-bs [16:01] SketchCow: FYI, I'm about to upload two huge ArchiveBot WARCs to FOS. 28 and 35 GB or something like that. That pipeline (not mine) was using a buggy version of wpull, so I'm trying to fix it. [16:01] Not sure if this is even problematic or anything, just wanted to let you know. [16:04] And regarding the rsyncd config, if you don't mind giving me your config file from FOS, I'd like to play around to see if I can figure out a proper solution to the overwriting issue. [16:05] I also want to test if --ignore-existing even works at all with a write-only target. Since the client can't get a list of files on the server, it's possible that it won't change anything. [16:16] *** du_ has joined #archiveteam-bs [16:32] *** cloudfunn has joined #archiveteam-bs [16:59] *** kimmer12 has joined #archiveteam-bs [17:02] *** kimmer12 has quit IRC (Read error: Connection reset by peer) [17:02] *** kimmer13 has joined #archiveteam-bs [17:04] *** kimmer1 has quit IRC (Ping timeout: 633 seconds) [17:05] Go ahead and try [17:05] *** kimmer13 has quit IRC (Read error: Connection reset by peer) [17:06] 1.5tb free on FOS at the moment, more free today [17:06] Uploads are done already. :-) [17:08] *** kimmer1 has joined #archiveteam-bs [17:08] *** Stilett0 has joined #archiveteam-bs [17:19] *** Valentine has joined #archiveteam-bs [17:20] Yeah, just saying we went from MANGA CRISIS to ok [17:22] Sweet [17:40] https://pineapplefund.org/ [18:17] *** Stilett0 is now known as Stiletto [18:24] *** kimmer12 has joined #archiveteam-bs [18:30] *** ola_norsk has joined #archiveteam-bs [18:30] *** kimmer13 has joined #archiveteam-bs [18:31] anyone know if there's a way to sort the playlist on a 'community video(s)' item? [18:31] *** kimmer1 has quit IRC (Ping timeout: 633 seconds) [18:31] or, re-sort it, i guess [18:34] i'm guessing it's sorted by filenames. E.g, if all files starts with date like '20170101_' , would it be possible to reverse that sorting; so that the playlist is basically reversed, showing newest -> oldest [18:35] *** kimmer12 has quit IRC (Ping timeout: 633 seconds) [18:41] i'm guessing one rather messy workaround would using 'ia move' command, to append e.g '0001_' , '0002_' to the filenames based on date. But might there be a better way? [18:54] going with prefixed filenames would require every filename in an item to be renamed as well, if a newer file were to be added to it :/ [19:04] *** ndiddy has joined #archiveteam-bs [19:08] JAA: btw, tweep is working like a mofo :D But, i'm kind of worried about there not being a way to regulate/randomize it's frequency of requests. So perhaps running it trough proxys might be better? [19:10] JAA: I'm guessing it would be noticable, at some point by some twitter admin, even if it's faking user agent. if it's left running for weeks and months :/ [19:20] How about #archiveteam-offtopic [19:21] that [19:22] SketchCow: i don't want to be OP there though :/ [19:24] SketchCow: How about archiveteam-ot , for short, and similarities to -bs ? [19:33] *** schbirid has joined #archiveteam-bs [19:41] SketchCow: though i'm not seeing how Internet Archive playlist sorting and archiving tweets is not within 'Off-Topic and Lengthy Archive Team and Archive Discussions' [19:47] *** jschwart has joined #archiveteam-bs [19:52] *** Smiley has quit IRC (Ping timeout: 255 seconds) [19:55] *** RichardG has quit IRC (Read error: Connection reset by peer) [19:55] *** Smiley has joined #archiveteam-bs [20:53] *** icedice has joined #archiveteam-bs [21:04] *** BlueMaxim has joined #archiveteam-bs [21:04] -ot might works [21:05] You did the thing [21:05] I was suggesting the channel, not suggesting you were saying something for the channel. [21:05] Me doing something useful; That's rare :D [21:06] btw, for some reason i just naturally figured "bs" to stand for "bullshit(ting)" :/ [21:07] anyway, i can't be OP though, regardless of what it's called [21:07] that's pretty much what the original purpose of this channel was :P [21:08] lol [21:08] PurpleSym already grabbed op in -ot [21:08] he's gotta log off some time [21:08] or maybe he's the OP it needs, who knows [21:09] * ola_norsk just knows ola_norsk is not OP material [21:11] (or she) [21:12] channel squatting is quite useless, since there's a multitude of variations that's between "*-ot" and "*-offtopic" :D [21:26] PoorHomie: he or she might've just joined and gotten OP by default, like i did [21:40] archiveteam-bs-bs [21:41] *** Asparagir has joined #archiveteam-bs [21:44] so much bs :D [21:47] if there's #archiveteam , and #archiveteam-bs , how on topic is required here? [21:48] well I mean I haven't been around a while so I might be wrong [21:48] *** Asparagir has quit IRC (Asparagir) [21:48] but I always thought this channel was for general nothing talk until it needed to get serious [21:55] Buck stops with me [21:55] People discussing endless what ifs and theories about archiving and saving thing [21:55] = OK [21:55] People going off for hours about how to make a good wiki software suite [21:55] ! OK [21:55] People going off about bitcoing [21:56] ! OK [21:56] fair enough :D [21:57] I'm just going to clean shit up [21:58] People who do good work are either being driven away or can't focus on what's needed [22:00] So guess who has the bat [22:00] a vidme user wrote this "What you are doing means a lot to me. And I agree that there is no such thing as "safe" digital data, sadly." [22:03] *** jschwart has quit IRC (Quit: Konversation terminated!) [22:10] SketchCow: i'm starting to upload Joystiq WoW Insider Show [22:10] i have 16gb of that [22:12] *** pizzaiolo has quit IRC (Read error: Operation timed out) [22:15] Great [22:16] metadata maybe a problem until episode 139 [22:16] *** ranavalon has quit IRC (Read error: Connection reset by peer) [22:17] *** ranavalon has joined #archiveteam-bs [22:20] ez: you said the other day that sqlite would not be ideal to use for capturing twitter data. Could 'H2' be better? [22:23] ez: speed and stability would in my case be worth more than databasesize, since i'm pretty much just looking to reconstruct links to individual tweets, and take it from there [22:25] just use postgres [22:26] schbirid: does it make a single file database? [22:26] no [22:27] i only have experience with mysql and sqlite :/ [22:27] what kind of volume do you expect? [22:27] never a bad time to learn postgres [22:28] gtg [22:28] *** schbirid has quit IRC (Quit: Leaving) [22:28] schbirid: there was made some calculations 1-2 days ago , by JAA [22:28] oh [22:29] ola_norsk: its fine if you have some subset, but for archiving the databse is too bloaty [22:30] datasets like that are just flat "log" files, aggresively compressed [22:30] ez would writing into a mounted gzip file be too slow? [22:31] no, you just pipe output of whatever dumper you have [22:31] again, this matters only if youre scraping whole twitter, and the 1:20 compression ratio helps a lot with logistics [22:32] doesnt make sense if you use some highly specific filter [22:34] ola_norsk: if you want to reconstruct UI ("links") for the scraped data and make a web interface for it, yea, i'd probably go with uncompressed db [22:34] filter is every tweet containing a word, e.g "netneutrality" ..Basically there's no way i can store it as wark. So focusing on using tweep, which seems to get the text and IDs, and reconstruct those IDs into links [22:34] perhaps not even bother with db, and just save it all as rendered pages [22:34] ie warc [22:34] my harddrive is 160GB :D [22:35] hard to say, but gut feeling is that thats plenty enough for something so specific [22:35] if you were to do, say, 'trump' as a keyword, youd probably need far more [22:36] problem is i'd need some way to check for duplicate entrys, in case of e.g powerfail :/ [22:37] well, log style dumps work that way. append operation is already guaranteed to be atomic the filesystem [22:37] from what i know of sqlite, it doesn't store anything unless it's solidly entered into the database [22:37] so the mirroring script you make just looks at the end of the log to fetch last logged entry and continues from there [22:37] as for sqlite, same applies, you just select max() etc [22:38] sqlite guarantees write order, so it behaves like a (much less efficient, but with nifty query language) log [22:39] ola_norsk: again, my gut feeling is for something so super specific, sqlite is plenty fine [22:39] tho i my idea might be off how many tweets are out there could be off, i'm just expecting couple hundred millions, not more [22:41] some would indeed be deleted, user banned, profile set to private, etc. So looking for something that is as fast as possible to store it, then validate later if need be. [22:41] without wasting too much storage each day [22:42] well, depends on what you do in the end [22:42] most people scrape twitter on scale like this for sentiment tracking [22:42] twitter itself has best api for that. you give it keyword, it throws a realtime feed back at you. [22:42] plan is to upload to IA every 24hour, afer 24h capture. [22:43] twitter api seems intentionally limited.. [22:43] after* [22:43] not sure if its intentional. im pretty sure keeping reverse search index for everything would be monumental task [22:43] so they dont, and just track 7 day window [22:45] i think the free API is limited to ~3000 tweets [22:45] for e.g one user [22:45] am talking about the realtime feed [22:45] yes, the rest of the api is worthless [22:45] you're better off scraping for that [22:47] wow, that hashtag [22:47] what a cesspool of FUD [22:48] '#netneutrality" ? [22:48] :D [22:48] yea [22:48] lol, yeah [22:48] so, i'm not saving all those pics :D [22:49] 'tweep' seems to get the Id, the user, and perhaps also resolved (fucking!) t.co links though :D [22:49] and text [22:49] ola_norsk: i think that archiving twitter in immediate future is not viable, until sb commits to play a constant whack-a-mole with a fleet of proxies [22:49] and accounts [22:50] but writing a bot which just archives whatever it is the most controversial trend at any given time might be viable [22:50] that's my main worry. The speed tweep seems to go at is bound to be noticed by some jerk at twitter that might ban my ip or something [22:51] if i leave it running for weeks and months [22:52] id not worry about tweep much as long you use it just for a keyword [22:53] tweep itself is a bit troublesome because of what you said - it sees only whatever is in search index, and only with fairly long poll delays [22:55] twitter is now somewhat famous for "banning" (they call it deranking, but so far the tweets simply vanish) from search results [22:56] tweep seems to be more focused on just specified user, not so much words or phrases [22:56] IMO [22:56] well, it does only what the webui do [22:56] and the web ui is very restricted in scope, yes [22:56] there's e.g no option to limit frequency [22:57] iirc it just hammers [22:57] one-request-at-any-given time [22:57] thats fairly benign by hammering standards [22:57] aye, and imagine if 'lol' or 'omg' was used as searchword..and left running :D [22:58] if you were hammering 500 requests in parallel, they would probably raise some trigers [22:58] aye [22:59] i doubt they'd go "wow, that Firefox user is clicking mighty fast!" :D [23:02] ez the 'anonymity' of tweep seems to be only faking user agent :/ [23:02] a static hardcoded as such [23:02] twitter is a lot like reddit. they're generally lenient towards bots on plain http level. instead, they just expose api so crappy it makes any mass scrapes very, very awkward on account of crappy api [23:03] unless you pay them.. [23:07] i'm thinking an extremely THIN wm where every request goes via Tor or open proxies might even be safer than e.g letting tweep run for 24 hours :/ [23:07] just run tweep via torify [23:08] if you want to play a blackhat whack-a-mole like this though, i suggest you first modify to support proxies directly [23:08] by just using single request at a time, and your own ip i'd consider fairly legit ... anything beyond that, you're skirting net etiquette a bit [23:09] i appreciate etiquetette :D , im not blackhat :D [23:10] (within reason) [23:11] it's why i'd would at least like the tool to have request delay [23:11] well, its the street equivalent of peaceful protest which dissolves in a hour, and violent black bloc. they both might have just cause, but the latter is impatient, angry, and more likely to catch bystanders in the paving block and tear gas crossfire. [23:12] i'd prefer walking away with the shit slowly :D [23:12] 'better late than never' :) [23:13] ola_norsk: request delay is not generally necessary [23:13] what is good manners is so called backoff-delay [23:14] ola_norsk: https://gist.github.com/ezdiy/17855d7421bbb416cbb3d8e0e1caf213#file-vidme-py-L21 [23:14] this is my vidme scraper for example [23:15] it hammers, until something goes wrong with the api, and starts to exponentially increase the delay [23:15] for as long there is error [23:15] worst thing to do is knowing you broke something and just keep blindly hammering anyway [23:16] lol..like going beyong date in a shell script? :D [23:16] 1 etc [23:16] beyond* [23:17] my sh script was so rotten i could smell it lol [23:18] ez: http://paste.ubuntu.com/26179607/ [23:19] ola_norsk: fixed delay helps in a pinch (especially in places like sh), but is not quite ideal either [23:20] ola_norsk: yea, for channel i'd not worry about it much [23:20] aye [23:22] i personally do not care about vidme users, though i feel sorry for the many who moved there from youtube thinking it would be a 'free haven' [23:23] commercialazed haven, that is [23:23] i suspect vidme enjoyed a lot of popularity on account of its built-in youtubde-dl support [23:24] ie its very easy to "double post" on there [23:24] aye, and a ton of youtube user put their links into that; and when done, cancelled their youtube channels [23:25] not sure if that model would prevail. youtube tends to tell sites whoa are doing that "you guys, would you please stop doing that?" when they get big enough [23:35] ez: one problem of Vidme might'VE been https://imgur.com/a/DWXj3 [23:36] ez: never saw a single damn ad, execp on profile/video pages.. [23:37] ez: never heard back from the regarding that issue report though [23:37] them* [23:46] *** kristian_ has joined #archiveteam-bs [23:48] ola_norsk: i didnt follow vidme in recent past [23:48] but as far as january, vidme had no ads, their model was that of paid subscriptions/tips [23:54] ez: then i wished help@vidme.com would've just said so (in september), instead of acting like they did :D [23:55] iirc they made some vague promises [23:55] in fairly recent times, but as i said, i didnt follow at the time, you better just google reddit convos or something [23:55] 'fake it until you make it' i guess :D [23:56] damnit, they made me second-guess adblock :/ [23:56] but yea, vid.me was a very .com startup in a lot of ways [23:56] aye [23:57] over-relience on users will come, with mediocre product, and they didn't come. happened a lot in the 90s. [23:57] i honestly have no idea what the bitchute guys are doing [23:57] quickly make a warrior job for bitchute..it's got magnetlinks :D [23:57] aye [23:57] they might get a traction if they position themselves as non-profit [23:58] so people will be willing to seed their webtorrents [23:58] if they dont do that, everybody will be like 'fuck no, why should i help a commercial company lower their opex?' [23:58] it's a viable thing though.. using webtorrent [23:58] sure [23:58] i seed IA torrents [23:59] webtorrent player could even alleviate some outgoing data on IA i think [23:59] as i consider IA mostly as a non-profit endeavor [23:59] honestly, webtorrent is massive clusterfuck