[00:01] *** RichardG has quit IRC (Ping timeout: 268 seconds) [00:05] *** RichardG has joined #archiveteam-bs [00:24] *** killsushi has quit IRC (Quit: Leaving) [00:41] *** enowaldo has joined #archiveteam-bs [00:43] *** BlueMax has joined #archiveteam-bs [00:48] *** enowaldo has quit IRC (Read error: Operation timed out) [00:58] *** benjins has quit IRC (Quit: Leaving) [01:03] *** Zerote_ has quit IRC (Ping timeout: 600 seconds) [01:45] *** icedice has quit IRC (Quit: Leaving) [01:48] *** enowaldo has joined #archiveteam-bs [01:50] *** webdownlo has quit IRC (Quit: Page closed) [02:04] https://www.bbc.co.uk/news/world-europe-48345660 [02:06] godane: https://twitter.com/textfiles/status/1130655750475472896 [02:08] JAA: maybe put in a ArchiveBot patch for ?dl=0 -> ?dl=1 ? [02:11] i noticed it [02:12] he sent me a message from vanderbilt library askus service [02:13] i forwarded the email to your address [02:14] If anyone's at Carnegie Mellon I'm trying to get a paper on compression, msg me [02:16] godane: Please just ignore him, forward any reviews he sends, etc. [02:16] I'm gathering all of this. [02:16] (I also know where he lives now) [02:17] i just hope he is not in New england [02:18] anyways your getting tons of japanese manuals [02:18] with metadata mostly [02:19] so its going better then amazon manuals at least [02:21] so i found some japanese hiphop band page : http://www.harlem.co.jp/ [02:22] SketchCow: you may have something for your hiphop tape collection here : https://soundcloud.com/club_harlem [02:29] *** enowaldo has quit IRC (Read error: Operation timed out) [03:05] *** Anthony1 has joined #archiveteam-bs [03:21] *** qw3rty111 has joined #archiveteam-bs [03:27] *** marked1 has quit IRC (Quit: WeeChat 2.4) [03:28] *** qw3rty119 has quit IRC (Read error: Operation timed out) [03:31] *** Anthony1 has quit IRC (Quit: Page closed) [03:50] *** odemgi_ has joined #archiveteam-bs [03:53] *** odemgi has quit IRC (Ping timeout: 252 seconds) [03:53] *** odemg has quit IRC (Ping timeout: 265 seconds) [03:56] *** enowaldo has joined #archiveteam-bs [04:05] *** odemg has joined #archiveteam-bs [04:08] *** enowaldo has quit IRC (Read error: Operation timed out) [04:14] *** halt_ has quit IRC (irc.efnet.nl efnet.deic.eu) [04:23] *** systwi has joined #archiveteam-bs [04:53] *** paul2520 has quit IRC (Read error: Operation timed out) [04:53] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [04:53] *** dxrt_ has quit IRC (Write error: Broken pipe) [04:53] *** ivan has quit IRC (Write error: Broken pipe) [04:53] *** kiska1 has quit IRC (Read error: Operation timed out) [04:53] *** TC01 has quit IRC (Read error: Operation timed out) [04:53] *** colona has quit IRC (Read error: Operation timed out) [04:53] *** JH88 has quit IRC (Read error: Operation timed out) [04:53] *** HashbangI has quit IRC (Read error: Operation timed out) [04:53] *** ivan has joined #archiveteam-bs [04:54] *** systwi has quit IRC (Read error: Operation timed out) [04:54] *** wyatt8740 has joined #archiveteam-bs [04:54] *** TigerbotH has quit IRC (Read error: Operation timed out) [04:54] *** Yurume has quit IRC (Read error: Operation timed out) [04:55] *** PhrackD has quit IRC (Read error: Operation timed out) [04:55] *** Lord_Nigh has joined #archiveteam-bs [04:55] *** Yurume has joined #archiveteam-bs [04:55] *** PotcFdk has quit IRC (Read error: Operation timed out) [04:55] *** step has quit IRC (Read error: Operation timed out) [04:57] *** TC01 has joined #archiveteam-bs [04:57] *** colona has joined #archiveteam-bs [04:58] *** qw3rty111 has quit IRC (Ping timeout: 600 seconds) [05:04] SketchCow: that tweet, mind looping me in on the story about it? [05:05] *** wyatt8740 has quit IRC (Read error: Operation timed out) [05:32] ^ also interested in this [05:34] I assume it has something to do with the crazy whackjob threatening Godane [05:52] *** qw3rty111 has joined #archiveteam-bs [05:52] *** kiska1 has joined #archiveteam-bs [05:52] *** svchfoo3 sets mode: +o kiska1 [05:56] *** dxrt_ has joined #archiveteam-bs [05:56] *** dxrt sets mode: +o dxrt_ [06:01] *** step has joined #archiveteam-bs [06:02] *** PhrackD has joined #archiveteam-bs [06:02] *** HashbangI has joined #archiveteam-bs [06:02] *** systwi has joined #archiveteam-bs [06:02] *** TigerbotH has joined #archiveteam-bs [06:05] *** enowaldo has joined #archiveteam-bs [06:06] *** PotcFdk has joined #archiveteam-bs [06:10] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [06:14] *** paul2520 has joined #archiveteam-bs [06:21] *** Zerote_ has joined #archiveteam-bs [06:58] Flashfire: there's a whackjob? fun story or sad, realistic one? [07:38] *** Zerote_ has quit IRC (Ping timeout: 600 seconds) [07:46] *** Zerote has joined #archiveteam-bs [07:50] *** BlueMax has quit IRC (Quit: Leaving) [07:59] *** enowaldo has joined #archiveteam-bs [08:18] eientei95: nvm I can't tabcomplete good [08:18] lol [08:18] No idea if we ever got that dump though - that edit seems to be from some point last year [08:18] Huh [08:19] *** enowaldo has quit IRC (Read error: Operation timed out) [08:20] Frogging: any idea? [08:31] I've pm'd the guy anyway [08:42] *** HashbangI has quit IRC (Read error: Connection reset by peer) [08:43] *** enowaldo has joined #archiveteam-bs [08:56] *** enowaldo has quit IRC (Read error: Operation timed out) [10:10] *** enowaldo has joined #archiveteam-bs [10:32] *** enowaldo has quit IRC (Read error: Operation timed out) [10:45] *** zerkalo has joined #archiveteam-bs [10:56] *** Oddly has joined #archiveteam-bs [11:12] *** enowaldo has joined #archiveteam-bs [11:12] *** wp494 has quit IRC (Ping timeout: 615 seconds) [11:13] *** wp494 has joined #archiveteam-bs [11:24] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [11:52] *** enowaldo has joined #archiveteam-bs [12:06] kisspunch: Won't help. As mentioned, AB doesn't grab the actual download on ?dl=1 links. It's due to the --no-parent flag for wpull and Dropbox's redirect setup. [12:13] *** m007a83_ is now known as m007a83 [12:37] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [12:56] *** HashbangI has joined #archiveteam-bs [12:59] *** cfarquhar has quit IRC (Read error: Operation timed out) [12:59] *** cfarquhar has joined #archiveteam-bs [13:14] *** Dj-Wawa has joined #archiveteam-bs [13:28] *** enowaldo has joined #archiveteam-bs [14:07] *** enowaldo has quit IRC (Ping timeout: 268 seconds) [14:16] So, the thing I tweeted about [14:16] Godane uploads a lot of video. A lot of everything, a saint. [14:17] So, it's pretty usual that a bunch of tapes come out, and some of our more... special users post "reviews" like "I AM REQUESTING (OTHER SHOW) PLZ" [14:17] Like, I get it, self-centered nerds. You post 50 live shows of The Blathering Blootz playing at various NJ nightclubs and someone "reviews" it going "DO YOU HAVE THEIR 1984 CBGB TAPE" [14:18] But we got a guy [14:18] And he's super into wanting a specific run of Nightlight [14:18] Nightline [14:18] And he has, for months, months mind you, posting a review on almost everything, demanding we post Nightlight from two specific years [14:18] Reviews, like... reviews on everything uploaded by godane. Reviews on my uploads. [14:18] And they started getting weird. [14:19] Literally "Am I going to have to kill you, am I going to track you down and murder you if you don't post these tapes" [14:19] So now he just upped the game [14:19] He just wrote to a third-party archive AS me, and demanded things [14:19] And now that archive is flipping out [14:19] So now I have to step careful [14:20] But it's very likely, thousands of items will go dark because of him [14:21] Can't we do something about *him*? [14:21] I mean, I am [14:22] Hence my tweet? [14:22] Because now he's in my fuckin' backyard? [14:22] Forgive me, I don't follow you on twitter [14:22] You're missing out [14:22] (Or even go on twitter all much) [14:22] I'm the best thing since the thing that slices bread [14:23] Igloo: https://twitter.com/textfiles/status/1130655750475472896 [14:23] I shall follow [14:23] SketchCannedBread [14:23] JAA: want to write a qwarc script for minecraftforum.net? [14:23] most of it was grabbed in my earlier run [14:23] I think we just warrior the minecraftforum [14:23] but there may be some new stuff [14:24] Sure, when it goes read-only I guess? [14:24] i'm not giving them another bit of trust on this [14:24] last time we did, remember [14:24] 1/3rd of the entire forum vanished [14:24] Right [14:25] Igloo: it's fucking huge, it took me weeks to get all forums archived with multiple grab-site jobs running across several systems [14:25] the html parsing of it is very cpu hungry [14:25] #shmimecraft ? [14:25] #cursed [14:26] *** Zerote has quit IRC (Read error: Operation timed out) [14:27] Why does that photo exist [14:27] Jason you mad man [14:27] :D [14:27] i love that photo :D [14:27] Did you see https://www.reddit.com/r/DataHoarder/comments/bn1f8j/introducing_datahoardercloud_a_new_standard_for/ [14:34] I love people who put IPFS out there as a solution to stuff like this. I tried to use IPFS as a method to provide backups of a service I ran, holy shit the diskIO/CPU usage was insane for a 500GB collection of crap [14:39] *** enowaldo has joined #archiveteam-bs [14:41] *** deevious has quit IRC (Quit: deevious) [14:49] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [14:54] *** Zerote has joined #archiveteam-bs [14:57] Two responses. [14:57] First, I really like the IPFS guy, as a person and as a thinker and as a dedication to his craft and goals. [14:57] Second, the amount of secret bro money that drives a lot of tech and hides the true cost of items and usage that people then come to concoct "plans" against is utterly ridiculous. [14:58] IPFS has a lot of inefficiencies because it's focused on being distributed and being outside of the, let's face it, law. [14:58] Like, law WANTS centralization and two companies whose pencil necked geeks they can strong-arm into giving up the keys because brown people did something wrong [14:58] *** Oddly has quit IRC (Read error: Operation timed out) [14:59] But I think that it's distribution first, cheapness second, and that will always be an issue. [14:59] We've discovered that with internetarchive.bak and we'll discover it elsewhere. [15:09] *** enowaldo has joined #archiveteam-bs [15:40] *** enowaldo has quit IRC (Read error: Operation timed out) [15:59] *** tomaspark has quit IRC (Read error: Connection reset by peer) [16:03] *** Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) [17:23] is it possible to --no-offsitelinks a running ArchiveBot run without restrting it? I am concerned that given the volume of offsite links to a 2m+ post forum, ew0hhphkhlajc4w23hfr7e6km will not finish before the shutdown [17:25] *** enowaldo has joined #archiveteam-bs [17:29] mozilla is turning into Google Sunsets Everything #2 [17:57] didn't google inherit the screenshots thing from when they acquired pocket? [17:57] er [17:57] mozilla [17:58] didn't mozilla inherit the screenshots thing from when they acquired pocket? [17:59] Maybe mozilla is trying to limit their liability by not hosting content created by users on their sync servers where they can be transferred from machine to machine, but that doesn't make sense since you could abuse the url/history sync database stuff to transfer data anyway [17:59] by filling the history with fake urls with data encoded as base64 [17:59] in the url itself [17:59] i suppose screenshots allow it to be done more easily... [17:59] joshua_: No, that's not possible. [18:00] https://www.vice.com/en_us/article/a3x7qz/newly-surfaced-arcade-management-documents-from-the-1970s-predicted-a-wild-future-for-video-games [18:00] :o [18:00] *furiously reads* [18:00] * Lord_Nigh pokes Stiletto [18:00] This was the stuff I was talking about [18:00] I tweeted this [18:00] Stay up on it [18:01] JAA: ah, that's a shame. well, if it's looking sketch in the next few days, we might want to turn up the concurrency / turn down the delays on that job to intercept before the turndown [18:01] (also the archive is getting huge already, and I am not sure how much of the actual forum it has already archived; is there a max size limit on any one archivebot job that we're worried about hitting?) [18:02] Nope, no size limit. We've had jobs of several TB. [18:02] isn't the max size limit 'when the disk fills, at which point the job crashes/hangs and has to be manually unstuck to upload'? [18:02] The data is uploaded to IA while the job's still running. [18:02] ah, i didn't realize that [18:02] phew. I was reading the archivebot wiki page and it indicates disks of 100GB recommended or so [18:02] yeah [18:03] But yes, that is the limit. Specifically, if the log file or URL DB gets too large. [18:03] And of course those jobs take forever. [18:04] ... does the code currently slow down the url pulling if the disk is dangerously full with a lot of data still pending to upload to IA? i.e. to let the uploadable-and-clearable parts of the disk get freed up [18:04] using the disk like a giant buffer [18:04] it appears to be mirroring ahlf of the known universe of some site's video archive [18:04] well, no kill like overkill. [18:05] the url db can't do that since its a database, so the limit of total size would be when the urldb fills the whole disk leaving no space for files to be pulled [18:05] (also, hi, Lord_Nigh, ltns) [18:05] the only way i can see to get around that would be to 'flush' the database when it gets too big AND a specific subdirectory on the server is completely pulled, but this would be complicated [18:06] hi joshua_ ! i don't think we've spoken in quite some time (I think the last time we talked was me asking if you still had a copy of the gnuboy SVN?) [18:06] which was ~10 years ago [18:07] how are things? [18:09] Lord_Nigh: wpull will pause if it detects that there's less than 500 MiB of free disk space (by default, the exact value is customisable). However, that check only runs between URLs, so it can still happen that the disk fills up if it's grabbing huge files. [18:13] The exact limit is a bit more complicated. There needs to be enough space also for the temporary files and the log file. [18:15] hmm. for web servers which report the size of the file being retrieved, this could be used to ensure no single file will fill the entire rest of the disk... [18:15] but with many retrievals happening in parallel this is still a problem [18:18] *** icedice has joined #archiveteam-bs [18:19] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [18:29] *** Zerote_ has joined #archiveteam-bs [18:33] *** Zerote has quit IRC (Ping timeout: 252 seconds) [18:36] *** Oddly has joined #archiveteam-bs [18:50] *** Dj-Wawa has joined #archiveteam-bs [19:39] Do we want a copy of the EU parliament's video library (committee meetings, plenary speeches, etc.)? Could be huge obviously. [19:39] E.g. this: http://www.europarl.europa.eu/ep-live/en/plenary/ [19:41] Lord_Nigh, things are ok; work is work, etc. yeah, I am sad that the gbdev channel went full Nazi... [19:45] joshua_: there's always the discord [19:46] which only contains trace amounts of beware [19:46] yeah, though, discord: "like IRC, but with a mandatory 600MB client RAM footprint!" [19:47] other day i discovered ripcord https://cancel.fm/ripcord/ [19:47] #1 feature is "not made from a web browser" [19:49] tempting. [19:49] I see discord using 256M, vs irccloud using 400M [19:49] but hey, mobile app /shrug [19:52] *** enowaldo_ has joined #archiveteam-bs [19:55] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [19:59] irssi: 31 MiB... Just saying... [20:01] my weechat is taking 87 megs right now [20:01] that's RSS; virtual size is about 300 megs [20:01] Yeah, same here, 31 MiB RSS, 154 MiB VIRT [20:03] thelounge is using 134M RSS, 1007M VIRT [20:04] Server or client? [20:04] Also, let's move this discussion to -ot. [20:06] JAA: i'm looking at eu parliament videos library [20:09] Cheers. I know that VideoBot used to upload some videos from europarl, but that stopped last year, and I have no idea how complete it is. [20:09] JAA: code i used to grab 2008-09-01 date urls : curl -s http://www.europarl.europa.eu/ep-live/en/plenary/video?date=01-09-2008 | grep debate= | sed 's|.*href="|http://www.europarl.europa.eu|g' | sed 's|".*||g' [20:10] they have a day-month-year date in there urls [20:10] I was going to just extract the VOD URLs from the ArchiveBot log for the europarl.europa.eu job. [20:10] I ignored those earlier. [20:12] code to grab videos from the debate urls: curl -s http://www.europarl.europa.eu/ep-live/en/plenary/video?debate=1220281482063 | grep mp4 | sed 's|> <|>\n<|g' | sed 's|.*value="||g' | sed 's|".*||g' [20:13] *** Coderjo has quit IRC (Ping timeout: 252 seconds) [20:13] i could use this code to start making daily dump videos of eu parliament video library [20:14] *** zerkalo has quit IRC (Read error: Operation timed out) [20:14] *** HashbangI has quit IRC (Read error: Operation timed out) [20:14] *** dxrt_ has quit IRC (Read error: Operation timed out) [20:14] *** jspiros has quit IRC (Read error: Operation timed out) [20:14] *** paul2520 has quit IRC (Write error: Broken pipe) [20:14] *** kiska1 has quit IRC (Write error: Broken pipe) [20:15] *** TigerbotH has quit IRC (Read error: Operation timed out) [20:15] *** PotcFdk has quit IRC (Read error: Operation timed out) [20:16] *** PhrackD has quit IRC (Read error: Operation timed out) [20:16] *** jspiros has joined #archiveteam-bs [20:16] *** kiska1 has joined #archiveteam-bs [20:16] *** qw3rty112 has joined #archiveteam-bs [20:16] this could be better cause then you could set the videos in order maybe : http://www.europarl.europa.eu/ep-live/en/plenary/search-by-date/results?date=01-09-2008&start=0 [20:16] *** svchfoo3 sets mode: +o kiska1 [20:17] *** paul2520 has joined #archiveteam-bs [20:18] *** PhrackD has joined #archiveteam-bs [20:19] *** qw3rty111 has quit IRC (Ping timeout: 600 seconds) [20:20] *** dxrt_ has joined #archiveteam-bs [20:20] *** dxrt sets mode: +o dxrt_ [20:20] *** TigerbotH has joined #archiveteam-bs [20:20] *** systwi has quit IRC (Read error: Operation timed out) [20:24] *** HashbangI has joined #archiveteam-bs [20:24] JAA: better code : curl -s http://www.europarl.europa.eu/ep-live/en/plenary/video?date=02-09-2008 | grep 'li id=' | sed 's|.*li id="c|http://www.europarl.europa.eu/ep-live/en/plenary/video?debate=|g' | sed 's|".*||g' | grep ^http [20:25] *** systwi has joined #archiveteam-bs [20:25] one of videos didn't have a debate= url cause the first one is a selected video [20:25] JAA godane: no idea if this is useful, but there's this secondary website web.ep.streamovations.be that also has an index where you can get a list of all videos ever in a single request [20:25] *** Zerote_ has quit IRC (Read error: Connection reset by peer) [20:25] *** Zerote__ has joined #archiveteam-bs [20:27] Here's a script I used previously (that intentionally only get's the DASH urls for currently running european parliament live streams though): https://gist.github.com/phiresky/2312c9d6eb3f69067b61fa1e267867c3 [20:27] *** Coderjo has joined #archiveteam-bs [20:28] *** zerkalo has joined #archiveteam-bs [20:29] *** RichardG has joined #archiveteam-bs [20:32] Looks like there's a lot more content on https://multimedia.europarl.europa.eu/en/home also. [20:32] *** BartoCH has joined #archiveteam-bs [20:33] if i remember correctly, web.ep.streamovations.be has recordings you can't (easily?) find on www.europarl.europa.eu . but it's probably only the stuff that went through their live streaming system, not clips and stuff [20:34] *** Oddly has quit IRC (Read error: Operation timed out) [20:34] *** phiresky1 is now known as phiresky [20:44] *** PotcFdk has joined #archiveteam-bs [21:12] so to everyone here all the nightline and old abc, nbc, cbs news broadcasts i have are dark [21:13] i'm not going to bother uploading anymore cause of that [21:13] *** enowaldo_ has quit IRC (Read error: Operation timed out) [21:15] :-( [21:15] Don't let that jerk win. [21:25] *** enowaldo has joined #archiveteam-bs [21:26] Oh godane i am so sorry [21:27] godane: why did they go dark? [21:27] no i mean i'm not uploading anymore old broadcasts from Vanderbilt [21:27] did they send a DMCA? [21:28] cause there from Vanderbilt archive [21:28] so they sent a DMCA? [21:28] Ah, Not because of jerkhead? [21:29] that crazy guy emailed Vanderbilt as me and jason scott asking for episodes of nightline from what i got from the emails [21:29] ah [21:29] wtf [21:30] ^(10:19:36 AM) SketchCow: He just wrote to a third-party archive AS me, and demanded things [21:30] he has a boner for 2004-2005 Nightline episodes [21:31] Ah I see [21:31] so Vanderbilt blocked you? [21:31] I don't understand what happened really [21:31] Humans can be shitty [21:32] that crazy guy talked about ted koppel hair color a few times with passion [21:37] i grab the videos looking for m3u8 files in pages like this : https://tvnews.vanderbilt.edu/broadcasts/895918 [21:38] I wonder if youtube-dl supports it [21:39] would probably make that faster for you [21:39] i then use something like this recently : youtube-dl --hls-prefer-native --hls-use-mpegts --fixup warn --keep-fragments -o $(basename $(dirname $m3u8url) .mp4).ts [21:39] yes it can [21:39] but use those commands cause mpegts will give back same checksum [21:39] ah ok [21:39] anyway [21:40] so why were the items blocked on archive.org? [21:40] who blocked them? [21:40] it has to be blocked cause that crazy guy that bother me for 2004-2005 nightline episodes [21:41] ah [21:41] I still don't understand why the guy can't be blocked instead. [21:41] wait, so you intentionally blocked them? [21:42] cause Vanderbilt doesn't want IA hosting them i guest [21:42] *guess [21:42] anyways i got to go [21:42] bbl [21:42] ah ok, i understand it now [21:42] (sorry for bothinging you) [21:50] *** qwebirc78 has joined #archiveteam-bs [21:50] Are we going to archive Ghostbin? [22:00] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [22:02] *** qwebirc78 has left [22:03] anyways i'm looking for old japanese computer magazines in pdf/cbr/cbz format [22:04] a big upload happened a few months ago : https://archive.org/details/micomBASIC19841994 [22:05] whats weird is there is another big 2 uploads of Oh! MZ and Oh! X [22:06] EU election timetable: [22:06] - Thursday, 23 May: UK, Netherlands [22:06] - Friday, 24 May: Czech Republic, Ireland [22:06] - Saturday, 25 May: Czech Republic, Slovakia, Latvia, Malta [22:06] - Sunday, 26 May: all other countries [22:06] So we should try to cover them in this order. [22:06] I'm running betamax's list from the UK through ArchiveBot currently. [22:09] those 3 sets of pdfs all have date of feb 28 2014 which makes me think there from the same person or release group [22:09] based on pdfinfo command [22:23] *** enowaldo has joined #archiveteam-bs [22:24] SketchCow: did we grab Afterhoursdjs.org : http://95.46.199.251/ [22:25] cause its closing at end of the month: https://old.reddit.com/r/opendirectories/comments/bquq60/afterhoursdjsorg_liveset_recordings_mostly_2000s/ [22:25] omg [22:26] godane: #archivebot ? [22:26] i figure it was too big for archivebot [22:26] Yep, probably. [22:26] nah [22:27] Or at least AB won't like it. [22:27] anyway, keep in mind "There's a copy of these livesets on archive.org already, but it's a little trickier to bulk-download from there" [22:28] ohh [22:29] so it's not like this stuff would be lost [22:29] i did not know about this :P https://archive.org/details/afterhoursdjs_livesets [22:29] It's actually easier to bulk-download from IA than a recursive wget. [22:30] ia download --search='collection:afterhoursdjs_livesets' [22:31] *** Despatche has quit IRC (Ping timeout: 255 seconds) [22:49] "The following text is what triggered our spam filter: loan" [22:49] PLS [22:50] The offending link: http://www.karinegloanecmaurin.eu , website of Karine Gloanec Maurin, French member of the EU Parliament [22:53] JAA: fyi, I've just put a large list of tweets (~100MB, 1.75 million tweets) coming from twitter accounts owned by UK MEP candidates, into archivebot [22:54] betamax: Nice, thanks! [22:55] I put it in as a single list, if you think it would be better as multiple smaller lists, feel free to split it up and do that instead (you're much more familiar with the archivebot system) [22:55] We'll see how it handles that. [22:56] Should probably be fine. [22:57] I assume it's safe to "!yahoo" that job and twitter will handle the load? [22:57] I already changed the settings. [22:57] (To more than what !yahoo does.) [22:57] But yes, Twitter's fine with a high request rate. [22:58] Ah, excellent! (And btw, all those tweets were obtained with snscrape, which zoomed through the list on a fast 1Gb/sec up/down connection) [23:00] :-) [23:00] How many scrapes were you running concurrently? [23:01] Also, I hope to implement https://github.com/JustAnotherArchivist/snscrape/issues/34 soonish, which should make it even faster. [23:02] None! All done one after another (but on a very fast server - 64 core Xeon, 400GB RAM) [23:02] Heh, cool. [23:03] Though snscrape's single-threaded, so the cores won't matter. [23:05] Oh, I was wondering if there was a way to limit snscrape's twitter-user scrape to only include tweets after a certain date (e.g: so I can run another scrape in, say, a weeks time and get everything from now until after the election) [23:06] Yeah, two ways to do that: --since option or using twitter-search directly instead of twitter-user. [23:06] I consider the latter more reliable, but --since should probably work fine for Twitter. [23:07] The syntax for the search would be something like this: snscrape twitter-search 'from:username since:2019-01-01' [23:07] Thanks! [23:07] *** enowaldo has quit IRC (Read error: Operation timed out) [23:07] And it's more reliable because it does the filtering on the server side. [23:07] --since iterates over results until it finds a result that is older than the specified datetime. [23:08] Twitter seems to reliably return results in reverse-chronological order, but Instagram hashtag searches for example are very unreliable in that regard since there are sometimes old results at the end of a page. [23:10] *** BlueMax has joined #archiveteam-bs [23:28] *** icedice has quit IRC (Quit: Leaving) [23:45] *** icedice has joined #archiveteam-bs [23:50] *** enowaldo has joined #archiveteam-bs